www.it-ebooks.info

99513ffirs.qxd:WileyRed

8/27/07

4:15 PM

Page iii

Data Analysis Using

SQL and Excel®

Gordon S. Linoff

Wiley Publishing, Inc.

www.it-ebooks.info

99513ffirs.qxd:WileyRed

8/27/07

4:15 PM

Page ii

www.it-ebooks.info

99513ffirs.qxd:WileyRed

8/27/07

4:15 PM

Page i

Data Analysis Using

SQL and Excel®

www.it-ebooks.info

99513ffirs.qxd:WileyRed

8/27/07

4:15 PM

Page ii

www.it-ebooks.info

99513ffirs.qxd:WileyRed

8/27/07

4:15 PM

Page iii

Data Analysis Using

SQL and Excel®

Gordon S. Linoff

Wiley Publishing, Inc.

www.it-ebooks.info

99513ffirs.qxd:WileyRed

8/27/07

4:15 PM

Page iv

Data Analysis Using SQL and Excel®

Published by

Wiley Publishing, Inc.

10475 Crosspoint Boulevard

Indianapolis, IN 46256

www.wiley.com

Copyright © 2008 by Wiley Publishing, Inc., Indianapolis, Indiana

Published simultaneously in Canada

ISBN: 978-0-470-09951-3

Manufactured in the United States of America

10 9 8 7 6 5 4 3 2 1

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any

form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise,

except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without

either the prior written permission of the Publisher, or authorization through payment of the

appropriate per-copy fee to the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA

01923, (978) 750-8400, fax (978) 646-8600. Requests to the Publisher for permission should be

addressed to the Legal Department, Wiley Publishing, Inc., 10475 Crosspoint Blvd., Indianapolis, IN

46256, (317) 572-3447, fax (317) 572-4355, or online at http://www.wiley.com/go/permissions.

Limit of Liability/Disclaimer of Warranty: The publisher and the author make no representations

or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation warranties of fitness for a particular

purpose. No warranty may be created or extended by sales or promotional materials. The advice

and strategies contained herein may not be suitable for every situation. This work is sold with the

understanding that the publisher is not engaged in rendering legal, accounting, or other professional services. If professional assistance is required, the services of a competent professional person should be sought. Neither the publisher nor the author shall be liable for damages arising

herefrom. The fact that an organization or Website is referred to in this work as a citation and/or a

potential source of further information does not mean that the author or the publisher endorses the

information the organization or Website may provide or recommendations it may make. Further,

readers should be aware that Internet Websites listed in this work may have changed or disappeared between when this work was written and when it is read.

For general information on our other products and services or to obtain technical support, please

contact our Customer Care Department within the U.S. at (800) 762-2974, outside the U.S. at (317)

572-3993, or fax (317) 572-4002.

Library of Congress Cataloging-in-Publication Data:

Linoff, Gordon.

Data analysis using SQL and Excel / Gordon S. Linoff.

p. cm.

Includes index.

ISBN 978-0-470-09951-3 (paper/website)

1. SQL (Computer program language) 2. Querying (Computer science) 3. Data mining. 4.

Microsoft Excel (Computer file) I. Title.

QA76.73.S67L56 2007

005.75'85--dc22

2007026313

Trademarks: Wiley, the Wiley logo, and related trade dress are trademarks or registered trademarks

of John Wiley & Sons, Inc. and/or its affiliates, in the United States and other countries, and may not

be used without written permission. Excel is a registered trademark of Microsoft Corporation in the

United States and/or other countries. All other trademarks are the property of their respective owners. Wiley Publishing, Inc., is not associated with any product or vendor mentioned in this book.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print

may not be available in electronic books.

www.it-ebooks.info

99513ffirs.qxd:WileyRed

8/27/07

4:15 PM

Page v

To Giuseppe for sixteen years, five books, and counting . . .

www.it-ebooks.info

99513ffirs.qxd:WileyRed

8/27/07

4:15 PM

Page vi

About the Author

Gordon Linoff (gordon@data-miners.com) is a recognized expert in the field

of data mining. He has more than twenty-five years of experience working

with companies large and small to analyze customer data and to help design

data warehouses. His passion for SQL and relational databases dates to the

early 1990s, when he was building a relational database engine designed for

large corporate data warehouses at the now-defunct Thinking Machines Corporation. Since then, he has had the opportunity to work with all the leading

database vendors, including Microsoft, Oracle, and IBM.

With his colleague Michael Berry, Gordon has written three of the most popular books on data mining, starting with Data Mining Techniques for Marketing,

Sales, and Customer Support. In addition to writing books on data mining, he

also teaches courses on data mining, and has taught thousands of students on

four continents.

Gordon is currently a principal at Data Miners, a consulting company he

and Michael Berry founded in 1998. Data Miners is devoted to doing and

teaching data mining and customer-centric data analysis.

vi

www.it-ebooks.info

99513ffirs.qxd:WileyRed

8/27/07

4:15 PM

Page vii

Credits

Acquisitions Editor

Robert Elliott

Vice President and Executive

Publisher

Joseph B. Wikert

Development Editor

Ed Connor

Project Coordinator, Cover

Lynsey Osborn

Technical Editor

Michael J. A. Berry

Copy Editor

Kim Cofer

Graphics and Production

Specialists

Craig Woods, Happenstance

Type-O-Rama

Oso Rey, Happenstance

Type-O-Rama

Editorial Manager

Mary Beth Wakefield

Proofreading

Ian Golder, Word One

Production Manager

Tim Tate

Indexing

Johnna VanHoose Dinse

Vice President and Executive

Group Publisher

Richard Swadley

Anniversary Logo Design

Richard Pacifico

Production Editor

William A. Barton

vii

www.it-ebooks.info

99513ffirs.qxd:WileyRed

8/27/07

4:15 PM

Page viii

www.it-ebooks.info

99513ftoc.qxd:WileyRed

8/24/07

11:15 AM

Page ix

Contents

Foreword

xxvii

Acknowledgments

xxxi

Introduction

Chapter 1

xxxiii

A Data Miner Looks at SQL

Picturing the Structure of the Data

What Is a Data Model?

What Is a Table?

Allowing NULL Values

Column Types

What Is an Entity-Relationship Diagram?

The Zip Code Tables

Subscription Dataset

Purchases Dataset

Picturing Data Analysis Using Dataflows

What Is a Dataflow?

Dataflow Nodes (Operators)

READ: Reading a Database Table

OUTPUT: Outputting a Table (or Chart)

SELECT: Selecting Various Columns in the Table

FILTER: Filtering Rows Based on a Condition

APPEND: Appending New Calculated Columns

UNION: Combining Multiple Datasets into One

AGGREGATE: Aggregating Values

LOOKUP: Looking Up Values in One Table in Another

CROSSJOIN: General Join of Two Tables

JOIN: Join Two Tables Together Using a Key Column

SORT: Ordering the Results of a Dataset

Dataflows, SQL, and Relational Algebra

1

2

3

3

5

6

7

8

10

11

12

13

15

15

15

15

15

15

16

16

16

16

16

17

17

ix

www.it-ebooks.info

99513ftoc.qxd:WileyRed

x

8/24/07

11:15 AM

Page x

Contents

SQL Queries

18

What to Do, Not How to Do It

A Basic SQL Query

A Basic Summary SQL Query

What it Means to Join Tables

Cross-Joins: The Most General Joins

Lookup: A Useful Join

Equijoins

Nonequijoins

Outer Joins

Other Important Capabilities in SQL

UNION ALL

CASE

IN

Subqueries Are Our Friend

Subqueries for Naming Variables

Subqueries for Handling Summaries

Subqueries and IN

Rewriting the “IN” as a JOIN

Correlated Subqueries

The NOT IN Operator

Subqueries for UNION ALL

Chapter 2

18

19

20

22

23

24

26

27

28

29

30

30

31

32

33

34

36

36

37

38

39

Lessons Learned

40

What’s In a Table? Getting Started with Data Exploration

What Is Data Exploration?

Excel for Charting

43

44

45

A Basic Chart: Column Charts

Inserting the Data

Creating the Column Chart

Formatting the Column Chart

Useful Variations on the Column Chart

A New Query

Side-by-Side Columns

Stacked Columns

Stacked and Normalized Columns

Number of Orders and Revenue

Other Types of Charts

Line Charts

Area Charts

X-Y Charts (Scatter Plots)

What Values Are in the Columns?

Histograms

Histograms of Counts

Cumulative Histograms of Counts

Histograms (Frequencies) for Numeric Values

Ranges Based on the Number of Digits, Using

Numeric Techniques

www.it-ebooks.info

45

46

47

49

52

52

52

54

54

54

56

56

57

57

59

60

64

66

67

68

99513ftoc.qxd:WileyRed

8/24/07

11:15 AM

Page xi

Contents

Ranges Based on the Number of Digits, Using

String Techniques

More Refined Ranges: First Digit Plus Number of Digits

Breaking Numerics into Equal-Sized Groups

More Values to Explore — Min, Max, and Mode

72

Minimum and Maximum Values

The Most Common Value (Mode)

Calculating Mode Using Standard SQL

Calculating Mode Using SQL Extensions

Calculating Mode Using String Operations

72

73

73

74

75

Exploring String Values

Histogram of Length

Strings Starting or Ending with Spaces

Handling Upper- and Lowercase

What Characters Are in a String?

Exploring Values in Two Columns

What Are Average Sales By State?

How Often Are Products Repeated within a Single Order?

Direct Counting Approach

Comparison of Distinct Counts to Overall Counts

Which State Has the Most American Express Users?

From Summarizing One Column to Summarizing

All Columns

Good Summary for One Column

Query to Get All Columns in a Table

Using SQL to Generate Summary Code

Chapter 3

69

69

71

76

76

76

77

77

79

79

80

80

81

83

84

84

87

88

Lessons Learned

90

How Different Is Different?

Basic Statistical Concepts

91

92

The Null Hypothesis

Confidence and Probability

Normal Distribution

93

94

95

How Different Are the Averages?

The Approach

Standard Deviation for Subset Averages

Three Approaches

Estimation Based on Two Samples

Estimation Based on Difference

Counting Possibilities

How Many Men?

How Many Californians?

Null Hypothesis and Confidence

How Many Customers Are Still Active?

Given the Count, What Is the Probability?

Given the Probability, What Is the Number of Stops?

The Rate or the Number?

www.it-ebooks.info

99

99

100

101

102

104

104

105

110

112

113

114

116

117

xi

99513ftoc.qxd:WileyRed

xii

8/24/07

11:15 AM

Page xii

Contents

Ratios, and Their Statistics

Standard Error of a Proportion

Confidence Interval on Proportions

Difference of Proportions

Conservative Lower Bounds

Chi-Square

118

120

121

122

123

Expected Values

Chi-Square Calculation

Chi-Square Distribution

Chi-Square in SQL

What States Have Unusual Affinities for Which

Types of Products?

Data Investigation

SQL to Calculate Chi-Square Values

Affinity Results

Chapter 4

118

123

124

125

127

128

129

130

131

Lessons Learned

132

Where Is It All Happening? Location, Location, Location

Latitude and Longitude

133

134

Definition of Latitude and Longitude

Degrees, Minutes, Seconds, and All That

Distance between Two Locations

Euclidian Method

Accurate Method

Finding All Zip Codes within a Given Distance

Finding Nearest Zip Code in Excel

Pictures with Zip Codes

The Scatter Plot Map

Who Uses Solar Power for Heating?

Where Are the Customers?

Census Demographics

The Extremes: Richest and Poorest

Median Income

Proportion of Wealthy and Poor

Income Similarity and Dissimilarity Using Chi-Square

Comparison of Zip Codes with and without Orders

Zip Codes Not in Census File

Profiles of Zip Codes with and without Orders

Classifying and Comparing Zip Codes

Geographic Hierarchies

Wealthiest Zip Code in a State?

Zip Code with the Most Orders in Each State

Interesting Hierarchies in Geographic Data

Counties

Designated Marketing Areas (DMAs)

Census Hierarchies

Other Geographic Subdivisions

www.it-ebooks.info

134

136

137

137

139

141

143

145

145

146

148

149

150

150

152

152

156

156

157

159

162

162

165

167

167

168

168

169

99513ftoc.qxd:WileyRed

8/24/07

11:15 AM

Page xiii

Contents

Calculating County Wealth

Identifying Counties

Measuring Wealth

Distribution of Values of Wealth

Which Zip Code Is Wealthiest Relative to Its County?

County with Highest Relative Order Penetration

Mapping in Excel

Why Create Maps?

It Can’t Be Done

Mapping on the Web

State Boundaries on Scatter Plots of Zip Codes

Plotting State Boundaries

Pictures of State Boundaries

Chapter 5

170

170

171

172

173

175

177

178

179

180

180

180

182

Lessons Learned

183

It’s a Matter of Time

Dates and Times in Databases

185

186

Some Fundamentals of Dates and Times in Databases

Extracting Components of Dates and Times

Converting to Standard Formats

Intervals (Durations)

Time Zones

Calendar Table

Starting to Investigate Dates

Verifying that Dates Have No Times

Comparing Counts by Date

Orderlines Shipped and Billed

Customers Shipped and Billed

Number of Different Bill and Ship Dates per Order

Counts of Orders and Order Sizes

Items as Measured by Number of Units

Items as Measured by Distinct Products

Size as Measured by Dollars

Days of the Week

Billing Date by Day of the Week

Changes in Day of the Week by Year

Comparison of Days of the Week for Two Dates

How Long between Two Dates?

Duration in Days

Duration in Weeks

Duration in Months

How Many Mondays?

A Business Problem about Days of the Week

Outline of a Solution

Solving It in SQL

Using a Calendar Table Instead

www.it-ebooks.info

187

187

189

190

191

191

192

192

193

193

195

196

197

198

198

201

203

203

204

205

206

206

208

209

210

210

210

212

213

xiii

99513ftoc.qxd:WileyRed

xiv

8/24/07

11:15 AM

Page xiv

Contents

Year-over-Year Comparisons

Comparisons by Day

Adding a Moving Average Trend Line

Comparisons by Week

Comparisons by Month

Month-to-Date Comparison

Extrapolation by Days in Month

Estimation Based on Day of Week

Estimation Based on Previous Year

Counting Active Customers by Day

How Many Customers on a Given Day?

How Many Customers Every Day?

How Many Customers of Different Types?

How Many Customers by Tenure Segment?

Simple Chart Animation in Excel

Order Date to Ship Date

Order Date to Ship Date by Year

Querying the Data

Creating the One-Year Excel Table

Creating and Customizing the Chart

Chapter 6

213

213

214

215

216

218

220

221

223

224

224

224

226

227

231

231

234

234

235

236

Lessons Learned

238

How Long Will Customers Last? Survival Analysis

to Understand Customers and Their Value

Background on Survival Analysis

239

240

Life Expectancy

Medical Research

Examples of Hazards

The Hazard Calculation

Data Investigation

Stop Flag

Tenure

Hazard Probability

Visualizing Customers: Time versus Tenure

Censoring

Survival and Retention

Point Estimate for Survival

Calculating Survival for All Tenures

Calculating Survival in SQL

Step 1. Create the Survival Table

Step 2: Load POPT and STOPT

Step 3: Calculate Cumulative Population

Step 4: Calculate the Hazard

Step 5: Calculate the Survival

Step 6: Fix ENDTENURE and NUMDAYS in Last Row

Generalizing the SQL

www.it-ebooks.info

242

243

243

245

245

245

247

249

250

251

253

254

254

256

257

257

258

259

259

260

260

99513ftoc.qxd:WileyRed

8/24/07

11:15 AM

Page xv

Contents

A Simple Customer Retention Calculation

Comparison between Retention and Survival

Simple Example of Hazard and Survival

Constant Hazard

What Happens to a Mixture

Constant Hazard Corresponding to Survival

Comparing Different Groups of Customers

267

Summarizing the Markets

Stratifying by Market

Survival Ratio

Conditional Survival

267

268

270

272

Comparing Survival over Time

272

How Has a Particular Hazard Changed over Time?

What Is Customer Survival by Year of Start?

What Did Survival Look Like in the Past?

Important Measures Derived from Survival

Point Estimate of Survival

Median Customer Tenure

Average Customer Lifetime

Confidence in the Hazards

Using Survival for Customer Value Calculations

Estimated Revenue

Estimating Future Revenue for One Future Start

SQL Day-by-Day Approach

SQL Summary Approach

Estimated Revenue for a Simple Group of Existing Customers

Estimated Second Year Revenue for a Homogenous Group

Pre-calculating Yearly Revenue by Tenure

Estimated Future Revenue for All Customers

Chapter 7

260

262

262

263

264

266

273

275

275

278

278

279

281

282

284

285

286

287

288

289

289

291

292

Lessons Learned

295

Factors Affecting Survival: The What and

Why of Customer Tenure

What Factors Are Important and When

297

298

Explanation of the Approach

Using Averages to Compare Numeric Variables

The Answer

Answering the Question in SQL

Extension to Include Confidence Bounds

Hazard Ratios

Interpreting Hazard Ratios

Calculating Hazard Ratios

Why the Hazard Ratio

Left Truncation

Recognizing Left Truncation

Effect of Left Truncation

www.it-ebooks.info

298

301

301

302

304

306

306

307

308

309

309

311

xv

99513ftoc.qxd:WileyRed

xvi

8/24/07

11:15 AM

Page xvi

Contents

How to Fix Left Truncation, Conceptually

Estimating Hazard Probability for One Tenure

Estimating Hazard Probabilities for All Tenures

Time Windowing

316

A Business Problem

Time Windows = Left Truncation + Right Censoring

Calculating One Hazard Probability Using a Time Window

All Hazard Probabilities for a Time Window

Comparison of Hazards by Stops in Year

Competing Risks

317

318

318

319

320

321

Examples of Competing Risks

I=Involuntary Churn

V=Voluntary Churn

M=Migration

Other

Competing Risk “Hazard Probability”

Competing Risk “Survival”

What Happens to Customers over Time

Example

A Cohort-Based Approach

The Survival Analysis Approach

Before and After

322

322

323

323

324

324

326

327

327

328

330

332

Three Scenarios

A Billing Mistake

A Loyalty Program

Raising Prices

Using Survival Forecasts

Forecasting Identified Customers Who Stopped

Estimating Excess Stops

Before and After Comparison

Cohort-Based Approach

Direct Estimation of Event Effect

Approach to the Calculation

Time-Varying Covariate Survival Using SQL and Excel

Chapter 8

313

314

314

333

333

333

335

335

336

336

337

338

341

341

342

Lessons Learned

344

Customer Purchases and Other Repeated Events

Identifying Customers

347

348

Who Is the Customer?

How Many?

How Many Genders in a Household

Investigating First Names

Other Customer Information

First and Last Names

Addresses

Other Identifying Information

www.it-ebooks.info

348

349

351

354

358

358

360

361

99513ftoc.qxd:WileyRed

8/24/07

11:15 AM

Page xvii

Contents

How Many New Customers Appear Each Year?

Counting Customers

Span of Time Making Purchases

Average Time between Orders

Purchase Intervals

RFM Analysis

370

The Dimensions

Recency

Frequency

Monetary

Calculating the RFM Cell

Utility of RFM

A Methodology for Marketing Experiments

Customer Migration

RFM Limits

Which Households Are Increasing Purchase

Amounts Over Time?

Comparison of Earliest and Latest Values

Calculating the Earliest and Latest Values

Comparing the First and Last Values

Comparison of First Year Values and Last Year Values

Trend from the Best Fit Line

Using the Slope

Calculating the Slope

Time to Next Event

Idea behind the Calculation

Calculating Next Purchase Date Using SQL

From Next Purchase Date to Time-to-Event

Stratifying Time-to-Event

Chapter 9

362

362

364

367

369

370

371

374

374

375

377

377

378

380

381

381

381

386

390

392

393

393

395

395

396

397

398

Lessons Learned

399

What’s in a Shopping Cart? Market Basket Analysis

and Association Rules

Exploratory Market Basket Analysis

401

402

Scatter Plot of Products

Duplicate Products in Orders

Histogram of Number of Units

Products Associated with One-Time Customers

Products Associated with the Best Customers

Changes in Price

Combinations (Item Sets)

Combinations of Two Products

Number of Two-Way Combinations

Generating All Two-Way Combinations

Examples of Combinations

Variations on Combinations

Combinations of Product Groups

Multi-Way Combinations

www.it-ebooks.info

402

403

407

408

410

413

415

415

415

417

419

420

420

422

xvii

99513ftoc.qxd:WileyRed

8/24/07

11:15 AM

Page xviii

xviii Contents

Households Not Orders

Combinations within a Household

Investigating Products within Households but

Not within Orders

Multiple Purchases of the Same Product

The Simplest Association Rules

Associations and Rules

Zero-Way Association Rules

What Is the Distribution of Probabilities?

What Do Zero-Way Associations Tell Us?

One-Way Association Rules

Example of One-Way Association Rules

Generating All One-Way Rules

One-Way Rules with Evaluation Information

One-Way Rules on Product Groups

Calculating Product Group Rules Using an

Intermediate Table

Calculating Product Group Rules Using

Window Functions

Two-Way Associations

Calculating Two-Way Associations

Using Chi-Square to Find the Best Rules

Applying Chi-Square to Rules

Applying Chi-Square to Rules in SQL

Comparing Chi-Square Rules to Lift

Chi-Square for Negative Rules

Heterogeneous Associations

Rules of the Form “State Plus Product”

Rules Mixing Different Types of Products

Extending Association Rules

Multi-Way Associations

Rules Using Attributes of Products

Rules with Different Left- and Right-Hand Sides

Before and After: Sequential Associations

Lessons Learned

424

424

425

426

428

428

429

429

430

431

431

433

434

436

438

440

441

441

442

442

444

445

447

448

448

450

451

451

452

453

454

455

Chapter 10 Data Mining Models in SQL

Introduction to Directed Data Mining

Directed Models

The Data in Modeling

Model Set

Score Set

Prediction Model Sets versus Profiling Model Sets

Examples of Modeling Tasks

Similarity Models

Yes-or-No Models (Binary Response Classification)

www.it-ebooks.info

457

458

459

459

459

461

461

463

463

463

99513ftoc.qxd:WileyRed

8/24/07

11:15 AM

Page xix

Contents

Yes-or-No Models with Propensity Scores

Multiple Categories

Estimating Numeric Values

Model Evaluation

Look-Alike Models

464

465

465

465

466

What Is the Model?

What Is the Best Zip Code?

A Basic Look-Alike Model

Look-Alike Using Z-Scores

Example of Nearest Neighbor Model

466

466

468

469

473

Lookup Model for Most Popular Product

475

Most Popular Product

Calculating Most Popular Product Group

Evaluating the Lookup Model

Using a Profiling Lookup Model for Prediction

Using Binary Classification Instead

Lookup Model for Order Size

Most Basic Example: No Dimensions

Adding One Dimension

Adding More Dimensions

Examining Nonstationarity

Evaluating the Model Using an Average Value Chart

Lookup Model for Probability of Response

The Overall Probability as a Model

Exploring Different Dimensions

How Accurate Are the Models?

Adding More Dimensions

Naïve Bayesian Models (Evidence Models)

Some Ideas in Probability

Probabilities

Odds

Likelihood

Calculating the Naïve Bayesian Model

An Intriguing Observation

Bayesian Model of One Variable

Bayesian Model of One Variable in SQL

The “Naïve” Generalization

Naïve Bayesian Model: Scoring and Lift

Scoring with More Attributes

Creating a Cumulative Gains Chart

Comparison of Naïve Bayesian and Lookup Models

Lessons Learned

Chapter 11 The Best-Fit Line: Linear Regression Models

The Best-Fit Line

Tenure and Amount Paid

www.it-ebooks.info

475

475

477

478

480

481

481

482

484

484

485

487

487

488

490

493

495

495

496

497

497

498

499

500

500

502

504

505

506

507

508

511

512

512

xix

99513ftoc.qxd:WileyRed

xx

8/24/07

11:15 AM

Page xx

Contents

Properties of the Best-fit Line

What Does Best-Fit Mean?

Formula for Line

Expected Value

Error (Residuals)

Preserving the Averages

Inverse Model

Beware of the Data

Trend Lines in Charts

Best-fit Line in Scatter Plots

Logarithmic, Power, and Exponential Trend Curves

Polynomial Trend Curves

Moving Average

Best-fit Using LINEST() Function

Returning Values in Multiple Cells

Calculating Expected Values

LINEST() for Logarithmic, Exponential, and Power Curves

Measuring Goodness of Fit Using R2

The R2 Value

Limitations of R2

What R2 Really Means

Direct Calculation of Best-Fit Line Coefficients

Doing the Calculation

Calculating the Best-Fit Line in SQL

Price Elasticity

Price Frequency

Price Frequency for $20 Books

Price Elasticity Model in SQL

Price Elasticity Average Value Chart

Weighted Linear Regression

Customer Stops during the First Year

Weighted Best Fit

Weighted Best-Fit Line in a Chart

Weighted Best-Fit in SQL

Weighted Best-Fit Using Solver

The Weighted Best-Fit Line

Solver Is Better Than Guessing

More Than One Input Variable

Multiple Regression in Excel

Getting the Data

Investigating Each Variable Separately

Building a Model with Three Input Variables

Using Solver for Multiple Regression

Choosing Input Variables One-By-One

Multiple Regression in SQL

Lessons Learned

513

513

515

515

517

518

518

519

521

521

522

524

525

528

528

530

531

532

532

534

535

536

536

537

538

539

541

542

543

544

545

546

548

549

550

550

551

552

552

553

554

555

557

558

558

560

www.it-ebooks.info

99513ftoc.qxd:WileyRed

8/24/07

11:15 AM

Page xxi

Contents

Chapter 12 Building Customer Signatures for Further Analysis

What Is a Customer Signature?

563

564

What Is a Customer?

Sources of Data for the Customer Signature

Current Customer Snapshot

Initial Customer Information

Self-Reported Information

External Data (Demographic and So On)

About Their Neighbors

Transaction Summaries

Using Customer Signatures

Predictive and Profile Modeling

Ad Hoc Analysis

Repository of Customer-Centric Business Metrics

565

566

566

567

568

568

569

569

570

570

570

570

Designing Customer Signatures

571

Column Roles

Identification Columns

Input Columns

Target Columns

Foreign Key Columns

Cutoff Date

Profiling versus Prediction

Time Frames

Naming of Columns

Eliminating Seasonality

Adding Seasonality Back In

Multiple Time Frames

571

571

572

572

572

573

573

573

574

574

575

576

Operations to Build a Customer Signature

Driving Table

Using an Existing Table as the Driving Table

Derived Table as the Driving Table

Looking Up Data

Fixed Lookup Tables

Customer Dimension Lookup Tables

Initial Transaction

Without Window Functions

With Window Functions

Pivoting

Payment Type Pivot

Channel Pivot

Year Pivot

Order Line Information Pivot

Summarizing

Basic Summaries

More Complex Summaries

www.it-ebooks.info

577

578

578

580

580

581

582

584

584

586

586

588

589

590

591

594

594

594

xxi

99513ftoc.qxd:WileyRed

xxii

8/24/07

11:15 AM

Page xxii

Contents

Extracting Features

596

Geographic Location Information

Date Time Columns

Patterns in Strings

Email Addresses

Addresses

Product Descriptions

Credit Card Numbers

596

597

598

598

599

599

600

Summarizing Customer Behaviors

601

Calculating Slope for Time Series

Calculating Slope from Pivoted Time Series

Calculating Slope for a Regular Time Series

Calculating Slope for an Irregular Time Series

Weekend Shoppers

Declining Usage Behavior

Appendix

601

601

603

604

604

606

Lessons Learned

609

Equivalent Constructs Among Databases

String Functions

611

612

Searching for Position of One String within Another

IBM

Microsoft

mysql

Oracle

SAS proc sql

String Concatenation

IBM

Microsoft

mysql

Oracle

SAS proc sql

String Length Function

IBM

Microsoft

mysql

Oracle

SAS proc sql

Substring Function

IBM

Microsoft

mysql

Oracle

SAS proc sql

Replace One Substring with Another

IBM

Microsoft

www.it-ebooks.info

612

612

613

613

613

613

614

614

614

614

614

614

614

614

615

615

615

615

615

615

615

615

616

616

616

616

616

99513ffirs.qxd:WileyRed

8/27/07

4:15 PM

Page iii

Data Analysis Using

SQL and Excel®

Gordon S. Linoff

Wiley Publishing, Inc.

www.it-ebooks.info

99513ffirs.qxd:WileyRed

8/27/07

4:15 PM

Page ii

www.it-ebooks.info

99513ffirs.qxd:WileyRed

8/27/07

4:15 PM

Page i

Data Analysis Using

SQL and Excel®

www.it-ebooks.info

99513ffirs.qxd:WileyRed

8/27/07

4:15 PM

Page ii

www.it-ebooks.info

99513ffirs.qxd:WileyRed

8/27/07

4:15 PM

Page iii

Data Analysis Using

SQL and Excel®

Gordon S. Linoff

Wiley Publishing, Inc.

www.it-ebooks.info

99513ffirs.qxd:WileyRed

8/27/07

4:15 PM

Page iv

Data Analysis Using SQL and Excel®

Published by

Wiley Publishing, Inc.

10475 Crosspoint Boulevard

Indianapolis, IN 46256

www.wiley.com

Copyright © 2008 by Wiley Publishing, Inc., Indianapolis, Indiana

Published simultaneously in Canada

ISBN: 978-0-470-09951-3

Manufactured in the United States of America

10 9 8 7 6 5 4 3 2 1

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any

form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise,

except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without

either the prior written permission of the Publisher, or authorization through payment of the

appropriate per-copy fee to the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA

01923, (978) 750-8400, fax (978) 646-8600. Requests to the Publisher for permission should be

addressed to the Legal Department, Wiley Publishing, Inc., 10475 Crosspoint Blvd., Indianapolis, IN

46256, (317) 572-3447, fax (317) 572-4355, or online at http://www.wiley.com/go/permissions.

Limit of Liability/Disclaimer of Warranty: The publisher and the author make no representations

or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation warranties of fitness for a particular

purpose. No warranty may be created or extended by sales or promotional materials. The advice

and strategies contained herein may not be suitable for every situation. This work is sold with the

understanding that the publisher is not engaged in rendering legal, accounting, or other professional services. If professional assistance is required, the services of a competent professional person should be sought. Neither the publisher nor the author shall be liable for damages arising

herefrom. The fact that an organization or Website is referred to in this work as a citation and/or a

potential source of further information does not mean that the author or the publisher endorses the

information the organization or Website may provide or recommendations it may make. Further,

readers should be aware that Internet Websites listed in this work may have changed or disappeared between when this work was written and when it is read.

For general information on our other products and services or to obtain technical support, please

contact our Customer Care Department within the U.S. at (800) 762-2974, outside the U.S. at (317)

572-3993, or fax (317) 572-4002.

Library of Congress Cataloging-in-Publication Data:

Linoff, Gordon.

Data analysis using SQL and Excel / Gordon S. Linoff.

p. cm.

Includes index.

ISBN 978-0-470-09951-3 (paper/website)

1. SQL (Computer program language) 2. Querying (Computer science) 3. Data mining. 4.

Microsoft Excel (Computer file) I. Title.

QA76.73.S67L56 2007

005.75'85--dc22

2007026313

Trademarks: Wiley, the Wiley logo, and related trade dress are trademarks or registered trademarks

of John Wiley & Sons, Inc. and/or its affiliates, in the United States and other countries, and may not

be used without written permission. Excel is a registered trademark of Microsoft Corporation in the

United States and/or other countries. All other trademarks are the property of their respective owners. Wiley Publishing, Inc., is not associated with any product or vendor mentioned in this book.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print

may not be available in electronic books.

www.it-ebooks.info

99513ffirs.qxd:WileyRed

8/27/07

4:15 PM

Page v

To Giuseppe for sixteen years, five books, and counting . . .

www.it-ebooks.info

99513ffirs.qxd:WileyRed

8/27/07

4:15 PM

Page vi

About the Author

Gordon Linoff (gordon@data-miners.com) is a recognized expert in the field

of data mining. He has more than twenty-five years of experience working

with companies large and small to analyze customer data and to help design

data warehouses. His passion for SQL and relational databases dates to the

early 1990s, when he was building a relational database engine designed for

large corporate data warehouses at the now-defunct Thinking Machines Corporation. Since then, he has had the opportunity to work with all the leading

database vendors, including Microsoft, Oracle, and IBM.

With his colleague Michael Berry, Gordon has written three of the most popular books on data mining, starting with Data Mining Techniques for Marketing,

Sales, and Customer Support. In addition to writing books on data mining, he

also teaches courses on data mining, and has taught thousands of students on

four continents.

Gordon is currently a principal at Data Miners, a consulting company he

and Michael Berry founded in 1998. Data Miners is devoted to doing and

teaching data mining and customer-centric data analysis.

vi

www.it-ebooks.info

99513ffirs.qxd:WileyRed

8/27/07

4:15 PM

Page vii

Credits

Acquisitions Editor

Robert Elliott

Vice President and Executive

Publisher

Joseph B. Wikert

Development Editor

Ed Connor

Project Coordinator, Cover

Lynsey Osborn

Technical Editor

Michael J. A. Berry

Copy Editor

Kim Cofer

Graphics and Production

Specialists

Craig Woods, Happenstance

Type-O-Rama

Oso Rey, Happenstance

Type-O-Rama

Editorial Manager

Mary Beth Wakefield

Proofreading

Ian Golder, Word One

Production Manager

Tim Tate

Indexing

Johnna VanHoose Dinse

Vice President and Executive

Group Publisher

Richard Swadley

Anniversary Logo Design

Richard Pacifico

Production Editor

William A. Barton

vii

www.it-ebooks.info

99513ffirs.qxd:WileyRed

8/27/07

4:15 PM

Page viii

www.it-ebooks.info

99513ftoc.qxd:WileyRed

8/24/07

11:15 AM

Page ix

Contents

Foreword

xxvii

Acknowledgments

xxxi

Introduction

Chapter 1

xxxiii

A Data Miner Looks at SQL

Picturing the Structure of the Data

What Is a Data Model?

What Is a Table?

Allowing NULL Values

Column Types

What Is an Entity-Relationship Diagram?

The Zip Code Tables

Subscription Dataset

Purchases Dataset

Picturing Data Analysis Using Dataflows

What Is a Dataflow?

Dataflow Nodes (Operators)

READ: Reading a Database Table

OUTPUT: Outputting a Table (or Chart)

SELECT: Selecting Various Columns in the Table

FILTER: Filtering Rows Based on a Condition

APPEND: Appending New Calculated Columns

UNION: Combining Multiple Datasets into One

AGGREGATE: Aggregating Values

LOOKUP: Looking Up Values in One Table in Another

CROSSJOIN: General Join of Two Tables

JOIN: Join Two Tables Together Using a Key Column

SORT: Ordering the Results of a Dataset

Dataflows, SQL, and Relational Algebra

1

2

3

3

5

6

7

8

10

11

12

13

15

15

15

15

15

15

16

16

16

16

16

17

17

ix

www.it-ebooks.info

99513ftoc.qxd:WileyRed

x

8/24/07

11:15 AM

Page x

Contents

SQL Queries

18

What to Do, Not How to Do It

A Basic SQL Query

A Basic Summary SQL Query

What it Means to Join Tables

Cross-Joins: The Most General Joins

Lookup: A Useful Join

Equijoins

Nonequijoins

Outer Joins

Other Important Capabilities in SQL

UNION ALL

CASE

IN

Subqueries Are Our Friend

Subqueries for Naming Variables

Subqueries for Handling Summaries

Subqueries and IN

Rewriting the “IN” as a JOIN

Correlated Subqueries

The NOT IN Operator

Subqueries for UNION ALL

Chapter 2

18

19

20

22

23

24

26

27

28

29

30

30

31

32

33

34

36

36

37

38

39

Lessons Learned

40

What’s In a Table? Getting Started with Data Exploration

What Is Data Exploration?

Excel for Charting

43

44

45

A Basic Chart: Column Charts

Inserting the Data

Creating the Column Chart

Formatting the Column Chart

Useful Variations on the Column Chart

A New Query

Side-by-Side Columns

Stacked Columns

Stacked and Normalized Columns

Number of Orders and Revenue

Other Types of Charts

Line Charts

Area Charts

X-Y Charts (Scatter Plots)

What Values Are in the Columns?

Histograms

Histograms of Counts

Cumulative Histograms of Counts

Histograms (Frequencies) for Numeric Values

Ranges Based on the Number of Digits, Using

Numeric Techniques

www.it-ebooks.info

45

46

47

49

52

52

52

54

54

54

56

56

57

57

59

60

64

66

67

68

99513ftoc.qxd:WileyRed

8/24/07

11:15 AM

Page xi

Contents

Ranges Based on the Number of Digits, Using

String Techniques

More Refined Ranges: First Digit Plus Number of Digits

Breaking Numerics into Equal-Sized Groups

More Values to Explore — Min, Max, and Mode

72

Minimum and Maximum Values

The Most Common Value (Mode)

Calculating Mode Using Standard SQL

Calculating Mode Using SQL Extensions

Calculating Mode Using String Operations

72

73

73

74

75

Exploring String Values

Histogram of Length

Strings Starting or Ending with Spaces

Handling Upper- and Lowercase

What Characters Are in a String?

Exploring Values in Two Columns

What Are Average Sales By State?

How Often Are Products Repeated within a Single Order?

Direct Counting Approach

Comparison of Distinct Counts to Overall Counts

Which State Has the Most American Express Users?

From Summarizing One Column to Summarizing

All Columns

Good Summary for One Column

Query to Get All Columns in a Table

Using SQL to Generate Summary Code

Chapter 3

69

69

71

76

76

76

77

77

79

79

80

80

81

83

84

84

87

88

Lessons Learned

90

How Different Is Different?

Basic Statistical Concepts

91

92

The Null Hypothesis

Confidence and Probability

Normal Distribution

93

94

95

How Different Are the Averages?

The Approach

Standard Deviation for Subset Averages

Three Approaches

Estimation Based on Two Samples

Estimation Based on Difference

Counting Possibilities

How Many Men?

How Many Californians?

Null Hypothesis and Confidence

How Many Customers Are Still Active?

Given the Count, What Is the Probability?

Given the Probability, What Is the Number of Stops?

The Rate or the Number?

www.it-ebooks.info

99

99

100

101

102

104

104

105

110

112

113

114

116

117

xi

99513ftoc.qxd:WileyRed

xii

8/24/07

11:15 AM

Page xii

Contents

Ratios, and Their Statistics

Standard Error of a Proportion

Confidence Interval on Proportions

Difference of Proportions

Conservative Lower Bounds

Chi-Square

118

120

121

122

123

Expected Values

Chi-Square Calculation

Chi-Square Distribution

Chi-Square in SQL

What States Have Unusual Affinities for Which

Types of Products?

Data Investigation

SQL to Calculate Chi-Square Values

Affinity Results

Chapter 4

118

123

124

125

127

128

129

130

131

Lessons Learned

132

Where Is It All Happening? Location, Location, Location

Latitude and Longitude

133

134

Definition of Latitude and Longitude

Degrees, Minutes, Seconds, and All That

Distance between Two Locations

Euclidian Method

Accurate Method

Finding All Zip Codes within a Given Distance

Finding Nearest Zip Code in Excel

Pictures with Zip Codes

The Scatter Plot Map

Who Uses Solar Power for Heating?

Where Are the Customers?

Census Demographics

The Extremes: Richest and Poorest

Median Income

Proportion of Wealthy and Poor

Income Similarity and Dissimilarity Using Chi-Square

Comparison of Zip Codes with and without Orders

Zip Codes Not in Census File

Profiles of Zip Codes with and without Orders

Classifying and Comparing Zip Codes

Geographic Hierarchies

Wealthiest Zip Code in a State?

Zip Code with the Most Orders in Each State

Interesting Hierarchies in Geographic Data

Counties

Designated Marketing Areas (DMAs)

Census Hierarchies

Other Geographic Subdivisions

www.it-ebooks.info

134

136

137

137

139

141

143

145

145

146

148

149

150

150

152

152

156

156

157

159

162

162

165

167

167

168

168

169

99513ftoc.qxd:WileyRed

8/24/07

11:15 AM

Page xiii

Contents

Calculating County Wealth

Identifying Counties

Measuring Wealth

Distribution of Values of Wealth

Which Zip Code Is Wealthiest Relative to Its County?

County with Highest Relative Order Penetration

Mapping in Excel

Why Create Maps?

It Can’t Be Done

Mapping on the Web

State Boundaries on Scatter Plots of Zip Codes

Plotting State Boundaries

Pictures of State Boundaries

Chapter 5

170

170

171

172

173

175

177

178

179

180

180

180

182

Lessons Learned

183

It’s a Matter of Time

Dates and Times in Databases

185

186

Some Fundamentals of Dates and Times in Databases

Extracting Components of Dates and Times

Converting to Standard Formats

Intervals (Durations)

Time Zones

Calendar Table

Starting to Investigate Dates

Verifying that Dates Have No Times

Comparing Counts by Date

Orderlines Shipped and Billed

Customers Shipped and Billed

Number of Different Bill and Ship Dates per Order

Counts of Orders and Order Sizes

Items as Measured by Number of Units

Items as Measured by Distinct Products

Size as Measured by Dollars

Days of the Week

Billing Date by Day of the Week

Changes in Day of the Week by Year

Comparison of Days of the Week for Two Dates

How Long between Two Dates?

Duration in Days

Duration in Weeks

Duration in Months

How Many Mondays?

A Business Problem about Days of the Week

Outline of a Solution

Solving It in SQL

Using a Calendar Table Instead

www.it-ebooks.info

187

187

189

190

191

191

192

192

193

193

195

196

197

198

198

201

203

203

204

205

206

206

208

209

210

210

210

212

213

xiii

99513ftoc.qxd:WileyRed

xiv

8/24/07

11:15 AM

Page xiv

Contents

Year-over-Year Comparisons

Comparisons by Day

Adding a Moving Average Trend Line

Comparisons by Week

Comparisons by Month

Month-to-Date Comparison

Extrapolation by Days in Month

Estimation Based on Day of Week

Estimation Based on Previous Year

Counting Active Customers by Day

How Many Customers on a Given Day?

How Many Customers Every Day?

How Many Customers of Different Types?

How Many Customers by Tenure Segment?

Simple Chart Animation in Excel

Order Date to Ship Date

Order Date to Ship Date by Year

Querying the Data

Creating the One-Year Excel Table

Creating and Customizing the Chart

Chapter 6

213

213

214

215

216

218

220

221

223

224

224

224

226

227

231

231

234

234

235

236

Lessons Learned

238

How Long Will Customers Last? Survival Analysis

to Understand Customers and Their Value

Background on Survival Analysis

239

240

Life Expectancy

Medical Research

Examples of Hazards

The Hazard Calculation

Data Investigation

Stop Flag

Tenure

Hazard Probability

Visualizing Customers: Time versus Tenure

Censoring

Survival and Retention

Point Estimate for Survival

Calculating Survival for All Tenures

Calculating Survival in SQL

Step 1. Create the Survival Table

Step 2: Load POPT and STOPT

Step 3: Calculate Cumulative Population

Step 4: Calculate the Hazard

Step 5: Calculate the Survival

Step 6: Fix ENDTENURE and NUMDAYS in Last Row

Generalizing the SQL

www.it-ebooks.info

242

243

243

245

245

245

247

249

250

251

253

254

254

256

257

257

258

259

259

260

260

99513ftoc.qxd:WileyRed

8/24/07

11:15 AM

Page xv

Contents

A Simple Customer Retention Calculation

Comparison between Retention and Survival

Simple Example of Hazard and Survival

Constant Hazard

What Happens to a Mixture

Constant Hazard Corresponding to Survival

Comparing Different Groups of Customers

267

Summarizing the Markets

Stratifying by Market

Survival Ratio

Conditional Survival

267

268

270

272

Comparing Survival over Time

272

How Has a Particular Hazard Changed over Time?

What Is Customer Survival by Year of Start?

What Did Survival Look Like in the Past?

Important Measures Derived from Survival

Point Estimate of Survival

Median Customer Tenure

Average Customer Lifetime

Confidence in the Hazards

Using Survival for Customer Value Calculations

Estimated Revenue

Estimating Future Revenue for One Future Start

SQL Day-by-Day Approach

SQL Summary Approach

Estimated Revenue for a Simple Group of Existing Customers

Estimated Second Year Revenue for a Homogenous Group

Pre-calculating Yearly Revenue by Tenure

Estimated Future Revenue for All Customers

Chapter 7

260

262

262

263

264

266

273

275

275

278

278

279

281

282

284

285

286

287

288

289

289

291

292

Lessons Learned

295

Factors Affecting Survival: The What and

Why of Customer Tenure

What Factors Are Important and When

297

298

Explanation of the Approach

Using Averages to Compare Numeric Variables

The Answer

Answering the Question in SQL

Extension to Include Confidence Bounds

Hazard Ratios

Interpreting Hazard Ratios

Calculating Hazard Ratios

Why the Hazard Ratio

Left Truncation

Recognizing Left Truncation

Effect of Left Truncation

www.it-ebooks.info

298

301

301

302

304

306

306

307

308

309

309

311

xv

99513ftoc.qxd:WileyRed

xvi

8/24/07

11:15 AM

Page xvi

Contents

How to Fix Left Truncation, Conceptually

Estimating Hazard Probability for One Tenure

Estimating Hazard Probabilities for All Tenures

Time Windowing

316

A Business Problem

Time Windows = Left Truncation + Right Censoring

Calculating One Hazard Probability Using a Time Window

All Hazard Probabilities for a Time Window

Comparison of Hazards by Stops in Year

Competing Risks

317

318

318

319

320

321

Examples of Competing Risks

I=Involuntary Churn

V=Voluntary Churn

M=Migration

Other

Competing Risk “Hazard Probability”

Competing Risk “Survival”

What Happens to Customers over Time

Example

A Cohort-Based Approach

The Survival Analysis Approach

Before and After

322

322

323

323

324

324

326

327

327

328

330

332

Three Scenarios

A Billing Mistake

A Loyalty Program

Raising Prices

Using Survival Forecasts

Forecasting Identified Customers Who Stopped

Estimating Excess Stops

Before and After Comparison

Cohort-Based Approach

Direct Estimation of Event Effect

Approach to the Calculation

Time-Varying Covariate Survival Using SQL and Excel

Chapter 8

313

314

314

333

333

333

335

335

336

336

337

338

341

341

342

Lessons Learned

344

Customer Purchases and Other Repeated Events

Identifying Customers

347

348

Who Is the Customer?

How Many?

How Many Genders in a Household

Investigating First Names

Other Customer Information

First and Last Names

Addresses

Other Identifying Information

www.it-ebooks.info

348

349

351

354

358

358

360

361

99513ftoc.qxd:WileyRed

8/24/07

11:15 AM

Page xvii

Contents

How Many New Customers Appear Each Year?

Counting Customers

Span of Time Making Purchases

Average Time between Orders

Purchase Intervals

RFM Analysis

370

The Dimensions

Recency

Frequency

Monetary

Calculating the RFM Cell

Utility of RFM

A Methodology for Marketing Experiments

Customer Migration

RFM Limits

Which Households Are Increasing Purchase

Amounts Over Time?

Comparison of Earliest and Latest Values

Calculating the Earliest and Latest Values

Comparing the First and Last Values

Comparison of First Year Values and Last Year Values

Trend from the Best Fit Line

Using the Slope

Calculating the Slope

Time to Next Event

Idea behind the Calculation

Calculating Next Purchase Date Using SQL

From Next Purchase Date to Time-to-Event

Stratifying Time-to-Event

Chapter 9

362

362

364

367

369

370

371

374

374

375

377

377

378

380

381

381

381

386

390

392

393

393

395

395

396

397

398

Lessons Learned

399

What’s in a Shopping Cart? Market Basket Analysis

and Association Rules

Exploratory Market Basket Analysis

401

402

Scatter Plot of Products

Duplicate Products in Orders

Histogram of Number of Units

Products Associated with One-Time Customers

Products Associated with the Best Customers

Changes in Price

Combinations (Item Sets)

Combinations of Two Products

Number of Two-Way Combinations

Generating All Two-Way Combinations

Examples of Combinations

Variations on Combinations

Combinations of Product Groups

Multi-Way Combinations

www.it-ebooks.info

402

403

407

408

410

413

415

415

415

417

419

420

420

422

xvii

99513ftoc.qxd:WileyRed

8/24/07

11:15 AM

Page xviii

xviii Contents

Households Not Orders

Combinations within a Household

Investigating Products within Households but

Not within Orders

Multiple Purchases of the Same Product

The Simplest Association Rules

Associations and Rules

Zero-Way Association Rules

What Is the Distribution of Probabilities?

What Do Zero-Way Associations Tell Us?

One-Way Association Rules

Example of One-Way Association Rules

Generating All One-Way Rules

One-Way Rules with Evaluation Information

One-Way Rules on Product Groups

Calculating Product Group Rules Using an

Intermediate Table

Calculating Product Group Rules Using

Window Functions

Two-Way Associations

Calculating Two-Way Associations

Using Chi-Square to Find the Best Rules

Applying Chi-Square to Rules

Applying Chi-Square to Rules in SQL

Comparing Chi-Square Rules to Lift

Chi-Square for Negative Rules

Heterogeneous Associations

Rules of the Form “State Plus Product”

Rules Mixing Different Types of Products

Extending Association Rules

Multi-Way Associations

Rules Using Attributes of Products

Rules with Different Left- and Right-Hand Sides

Before and After: Sequential Associations

Lessons Learned

424

424

425

426

428

428

429

429

430

431

431

433

434

436

438

440

441

441

442

442

444

445

447

448

448

450

451

451

452

453

454

455

Chapter 10 Data Mining Models in SQL

Introduction to Directed Data Mining

Directed Models

The Data in Modeling

Model Set

Score Set

Prediction Model Sets versus Profiling Model Sets

Examples of Modeling Tasks

Similarity Models

Yes-or-No Models (Binary Response Classification)

www.it-ebooks.info

457

458

459

459

459

461

461

463

463

463

99513ftoc.qxd:WileyRed

8/24/07

11:15 AM

Page xix

Contents

Yes-or-No Models with Propensity Scores

Multiple Categories

Estimating Numeric Values

Model Evaluation

Look-Alike Models

464

465

465

465

466

What Is the Model?

What Is the Best Zip Code?

A Basic Look-Alike Model

Look-Alike Using Z-Scores

Example of Nearest Neighbor Model

466

466

468

469

473

Lookup Model for Most Popular Product

475

Most Popular Product

Calculating Most Popular Product Group

Evaluating the Lookup Model

Using a Profiling Lookup Model for Prediction

Using Binary Classification Instead

Lookup Model for Order Size

Most Basic Example: No Dimensions

Adding One Dimension

Adding More Dimensions

Examining Nonstationarity

Evaluating the Model Using an Average Value Chart

Lookup Model for Probability of Response

The Overall Probability as a Model

Exploring Different Dimensions

How Accurate Are the Models?

Adding More Dimensions

Naïve Bayesian Models (Evidence Models)

Some Ideas in Probability

Probabilities

Odds

Likelihood

Calculating the Naïve Bayesian Model

An Intriguing Observation

Bayesian Model of One Variable

Bayesian Model of One Variable in SQL

The “Naïve” Generalization

Naïve Bayesian Model: Scoring and Lift

Scoring with More Attributes

Creating a Cumulative Gains Chart

Comparison of Naïve Bayesian and Lookup Models

Lessons Learned

Chapter 11 The Best-Fit Line: Linear Regression Models

The Best-Fit Line

Tenure and Amount Paid

www.it-ebooks.info

475

475

477

478

480

481

481

482

484

484

485

487

487

488

490

493

495

495

496

497

497

498

499

500

500

502

504

505

506

507

508

511

512

512

xix

99513ftoc.qxd:WileyRed

xx

8/24/07

11:15 AM

Page xx

Contents

Properties of the Best-fit Line

What Does Best-Fit Mean?

Formula for Line

Expected Value

Error (Residuals)

Preserving the Averages

Inverse Model

Beware of the Data

Trend Lines in Charts

Best-fit Line in Scatter Plots

Logarithmic, Power, and Exponential Trend Curves

Polynomial Trend Curves

Moving Average

Best-fit Using LINEST() Function

Returning Values in Multiple Cells

Calculating Expected Values

LINEST() for Logarithmic, Exponential, and Power Curves

Measuring Goodness of Fit Using R2

The R2 Value

Limitations of R2

What R2 Really Means

Direct Calculation of Best-Fit Line Coefficients

Doing the Calculation

Calculating the Best-Fit Line in SQL

Price Elasticity

Price Frequency

Price Frequency for $20 Books

Price Elasticity Model in SQL

Price Elasticity Average Value Chart

Weighted Linear Regression

Customer Stops during the First Year

Weighted Best Fit

Weighted Best-Fit Line in a Chart

Weighted Best-Fit in SQL

Weighted Best-Fit Using Solver

The Weighted Best-Fit Line

Solver Is Better Than Guessing

More Than One Input Variable

Multiple Regression in Excel

Getting the Data

Investigating Each Variable Separately

Building a Model with Three Input Variables

Using Solver for Multiple Regression

Choosing Input Variables One-By-One

Multiple Regression in SQL

Lessons Learned

513

513

515

515

517

518

518

519

521

521

522

524

525

528

528

530

531

532

532

534

535

536

536

537

538

539

541

542

543

544

545

546

548

549

550

550

551

552

552

553

554

555

557

558

558

560

www.it-ebooks.info

99513ftoc.qxd:WileyRed

8/24/07

11:15 AM

Page xxi

Contents

Chapter 12 Building Customer Signatures for Further Analysis

What Is a Customer Signature?

563

564

What Is a Customer?

Sources of Data for the Customer Signature

Current Customer Snapshot

Initial Customer Information

Self-Reported Information

External Data (Demographic and So On)

About Their Neighbors

Transaction Summaries

Using Customer Signatures

Predictive and Profile Modeling

Ad Hoc Analysis

Repository of Customer-Centric Business Metrics

565

566

566

567

568

568

569

569

570

570

570

570

Designing Customer Signatures

571

Column Roles

Identification Columns

Input Columns

Target Columns

Foreign Key Columns

Cutoff Date

Profiling versus Prediction

Time Frames

Naming of Columns

Eliminating Seasonality

Adding Seasonality Back In

Multiple Time Frames

571

571

572

572

572

573

573

573

574

574

575

576

Operations to Build a Customer Signature

Driving Table

Using an Existing Table as the Driving Table

Derived Table as the Driving Table

Looking Up Data

Fixed Lookup Tables

Customer Dimension Lookup Tables

Initial Transaction

Without Window Functions

With Window Functions

Pivoting

Payment Type Pivot

Channel Pivot

Year Pivot

Order Line Information Pivot

Summarizing

Basic Summaries

More Complex Summaries

www.it-ebooks.info

577

578

578

580

580

581

582

584

584

586

586

588

589

590

591

594

594

594

xxi

99513ftoc.qxd:WileyRed

xxii

8/24/07

11:15 AM

Page xxii

Contents

Extracting Features

596

Geographic Location Information

Date Time Columns

Patterns in Strings

Email Addresses

Addresses

Product Descriptions

Credit Card Numbers

596

597

598

598

599

599

600

Summarizing Customer Behaviors

601

Calculating Slope for Time Series

Calculating Slope from Pivoted Time Series

Calculating Slope for a Regular Time Series

Calculating Slope for an Irregular Time Series

Weekend Shoppers

Declining Usage Behavior

Appendix

601

601

603

604

604

606

Lessons Learned

609

Equivalent Constructs Among Databases

String Functions

611

612

Searching for Position of One String within Another

IBM

Microsoft

mysql

Oracle

SAS proc sql

String Concatenation

IBM

Microsoft

mysql

Oracle

SAS proc sql

String Length Function

IBM

Microsoft

mysql

Oracle

SAS proc sql

Substring Function

IBM

Microsoft

mysql

Oracle

SAS proc sql

Replace One Substring with Another

IBM

Microsoft

www.it-ebooks.info

612

612

613

613

613

613

614

614

614

614

614

614

614

614

615

615

615

615

615

615

615

615

616

616

616

616

616

## Introduction to finite element analysis using MATLAB and abaqus

## TIME SERIES DATA ANALYSIS USING EVIEWS potx

## numerical analysis using matlab and excel - steven t. karris

## pro data visualization using r and javascript

## problem solving and data analysis using minitab

## Querying Data by Using Joins and Subqueries ppsx

## Data Analysis Machine Learning and Applications Episode 3 Part 9 docx

## Data Analysis Machine Learning and Applications Episode 1 Part 1 doc

## Data Analysis Machine Learning and Applications Episode 1 Part 2 potx

## Data Analysis Machine Learning and Applications Episode 1 Part 3 docx

Tài liệu liên quan