Business Analytics Descriptive Predictive Prescriptive Jeffrey D. Camm

James J. Cochran

Wake Forest University

University of Alabama

Michael J. Fry

Jeffrey W. Ohlmann

University of Cincinnati

University of Iowa

Australia

●

Brazil

●

Mexico

●

Singapore

●

United Kingdom

●

United States

Copyright 2021 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.

This is an electronic version of the print textbook. Due to electronic rights restrictions, some third party content may be suppressed. Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. The publisher reserves the right to remove content from this title at any time if subsequent rights restrictions require it. For valuable information on pricing, previous editions, changes to current editions, and alternate formats, please visit www.cengage.com/highered to search by ISBN#, author, title, or keyword for materials in your areas of interest. Important Notice: Media content referenced within the product description or the product text may not be available in the eBook version.

Copyright 2021 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.

Business Analytics, Fourth Edition

© 2021, 2019 Cengage Learning, Inc.

Jeffrey D. Camm, James J. Cochran, Michael J. Fry, Jeffrey W. Ohlmann

WCN: 02-300

Senior Vice President, Higher Education & Skills Product: Erin Joyner Product Director: Jason Fremder Senior Product Manager: Aaron Arnsparger

Unless otherwise noted, all content is © Cengage. ALL RIGHTS RESERVED. No part of this work covered by the copyright herein may be reproduced or distributed in any form or by any means, except as permitted by U.S. copyright law, without the prior written permission of the copyright owner.

Senior Content Manager: Conor Allen

For product information and technology assistance, contact us at

Product Assistant: Maggie Russo

Cengage Customer & Sales Support, 1-800-354-9706 or support.cengage.com.

Marketing Manager: Chris Walz Senior Learning Designer: Brandon Foltz

For permission to use material from this text or product, submit all requests online at

Digital Delivery Lead: Mark Hopkinson

www.cengage.com/permissions.

Intellectual Property Analyst: Ashley Maynard Intellectual Property Project Manager: Kelli Besse Production Service: MPS Limited Senior Project Manager, MPS Limited: Santosh Pandey

Library of Congress Control Number: 2019921119 ISBN: 978-0-357-13178-7 Loose-leaf Edition:

Art Director: Chris Doughman

ISBN: 978-0-357-13179-4

Text Designer: Beckmeyer Design

Cengage

Cover Designer: Beckmeyer Design Cover Image: iStockPhoto.com/tawanlubfah

200 Pier 4 Boulevard Boston, MA 02210 USA Cengage is a leading provider of customized learning solutions with employees residing in nearly 40 different countries and sales in more than 125 countries around the world. Find your local representative at www.cengage.com. Cengage products are represented in Canada by Nelson Education, Ltd. To learn more about Cengage platforms and services, register or access your online learning solution, or purchase materials for your course, visit www.cengage.com.

Printed in the United States of America Print Number: 01 Print Year: 2020

Copyright 2021 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.

Brief Contents ABOUT THE AUTHORS xvii PREFACE xix Chapter 1

Introduction 1

Chapter 2

Descriptive Statistics 19

Chapter 3

Data Visualization 85

Chapter 4

Probability: An Introduction to Modeling Uncertainty 157

Chapter 5 Descriptive Data Mining 213 Chapter 6

Statistical Inference 253

Chapter 7

Linear Regression 327

Chapter 8

Time Series Analysis and Forecasting 407

Chapter 9

Predictive Data Mining 459

Chapter 10

Spreadsheet Models 509

Chapter 11

Monte Carlo Simulation 547

Chapter 12

Linear Optimization Models 609

Chapter 13

Integer Linear Optimization Models 663

Chapter 14

Nonlinear Optimization Models 703

Chapter 15

Decision Analysis 737

ulti-Chapter Case Problems M Capital State University Game-Day Magazines 783 Hanover Inc. 785 Appendix A

Basics of Excel 787

Appendix B

Database Basics with Microsoft Access 799

Appendix C Solutions to Even-Numbered Problems (MindTap Reader) References 837 Index 839

Contents ABOUT THE AUTHORS xvii PREFACE xix

Introduction 1 1.1 Decision Making 3 1.2 Business Analytics Defined 4 1.3 A Categorization of Analytical Methods and Models 5 Descriptive Analytics 5 Predictive Analytics 5 Prescriptive Analytics 6 1.4 Big Data 6 Volume 8 Velocity 8 Variety 8 Veracity 8 1.5 Business Analytics in Practice 10 Financial Analytics 10 Human Resource (HR) Analytics 11 Marketing Analytics 11 Health Care Analytics 11 Supply Chain Analytics 12 Analytics for Government and Nonprofits 12 Sports Analytics 12 Web Analytics 13 1.6 Legal and Ethical Issues in the Use of Data and Analytics 13 Summary 16 Glossary 16 Chapter 1

Available in the MindTap Reader: Appendix: Getting Started with R and RStudio Appendix: Basic Data Manipulation with R Chapter 2 Descriptive Statistics 19 2.1 Overview of Using Data: Definitions and Goals 20 2.2 Types of Data 22 Population and Sample Data 22 Quantitative and Categorical Data 22 Cross-Sectional and Time Series Data 22 Sources of Data 22 2.3 Modifying Data in Excel 25 Sorting and Filtering Data in Excel 25 Conditional Formatting of Data in Excel 28

vi Contents

2.4 Creating Distributions from Data 30 Frequency Distributions for Categorical Data 30 Relative Frequency and Percent Frequency Distributions 31 Frequency Distributions for Quantitative Data 32 Histograms 35 Cumulative Distributions 38 2.5 Measures of Location 40 Mean (Arithmetic Mean) 40 Median 41 Mode 42 Geometric Mean 42 2.6 Measures of Variability 45 Range 45 Variance 46 Standard Deviation 47 Coefficient of Variation 48 2.7 Analyzing Distributions 48 Percentiles 49 Quartiles 50 z-Scores 50 Empirical Rule 51 Identifying Outliers 53 Boxplots 53 2.8 Measures of Association Between Two Variables 56 Scatter Charts 56 Covariance 58 Correlation Coefficient 61 2.9 Data Cleansing 62 Missing Data 62 Blakely Tires 64 Identification of Erroneous Outliers and Other Erroneous Values 66 Variable Representation 68 Summary 69 Glossary 70 Problems 71 Case Problem 1: Heavenly Chocolates WebSite Transactions 81 Case Problem 2: African Elephant Populations 82 Available in the MindTap Reader: Appendix: Descriptive Statistics with R Data Visualization 85 3.1 Overview of Data Visualization 88 Effective Design Techniques 88 3.2 Tables 91 Table Design Principles 92 Crosstabulation 93 Chapter 3

Contents

PivotTables in Excel 96 Recommended PivotTables in Excel 100 3.3 Charts 102 Scatter Charts 102 Recommended Charts in Excel 104 Line Charts 105 Bar Charts and Column Charts 109 A Note on Pie Charts and Three-Dimensional Charts 110 Bubble Charts 112 Heat Maps 113 Additional Charts for Multiple Variables 115 PivotCharts in Excel 118 3.4 Advanced Data Visualization 120 Advanced Charts 120 Geographic Information Systems Charts 123 3.5 Data Dashboards 125 Principles of Effective Data Dashboards 125 Applications of Data Dashboards 126 Summary 128 Glossary 128 Problems 129 Case Problem 1: Pelican stores 139 Case Problem 2: Movie Theater Releases 140 Appendix: Data Visualization in Tableau 141 Available in the MindTap Reader: Appendix: Creating Tabular andGraphical Presentations with R Probability: An Introduction to Modeling Uncertainty 157 4.1 Events and Probabilities 159 4.2 Some Basic Relationships of Probability 160 Complement of an Event 160 Addition Law 161 4.3 Conditional Probability 163 Independent Events 168 Multiplication Law 168 Bayes’ Theorem 169 4.4 Random Variables 171 Discrete Random Variables 171 Continuous Random Variables 172 4.5 Discrete Probability Distributions 173 Custom Discrete Probability Distribution 173 Expected Value and Variance 175 Discrete Uniform Probability Distribution 178 Binomial Probability Distribution 179 Poisson Probability Distribution 182 Chapter 4

vii

viii Contents

4.6 Continuous Probability Distributions 185 Uniform Probability Distribution 185 Triangular Probability Distribution 187 Normal Probability Distribution 189 Exponential Probability Distribution 194 Summary 198 Glossary 198 Problems 200 Case Problem 1: Hamilton County Judges 209 Case Problem 2: McNeil’s Auto Mall 210 Case Problem 3: Gebhardt Electronics 211 Available in the MindTap Reader: Appendix: Discrete Probability Distributions with R Appendix: Continuous Probability Distributions with R Chapter 5 Descriptive Data Mining 213 5.1 Cluster Analysis 215 Measuring Distance Between Observations 215 k-Means Clustering 218 Hierarchical Clustering and Measuring Dissimilarity Between Clusters 221 Hierarchical Clustering Versus k-Means Clustering 225 5.2 Association Rules 226 Evaluating Association Rules 228 5.3 Text Mining 229 Voice of the Customer at Triad Airline 229 Preprocessing Text Data for Analysis 231 Movie Reviews 232 Computing Dissimilarity Between Documents 234 Word Clouds 234 Summary 235 Glossary 235 Problems 237 Case Problem 1: Big Ten Expansion 251 Case Problem 2: Know Thy Customer 251 Available in the MindTap Reader: Appendix: Getting Started with Rattle in R Appendix: k-Means Clustering with R Appendix: Hierarchical Clustering with R Appendix: Association Rules with R Appendix: Text Mining with R Appendix: R/Rattle Settings to Solve Chapter 5 Problems Appendix: Opening and Saving Excel Files in JMP Pro Appendix: Hierarchical Clustering with JMP Pro

Contents

Appendix: k-Means Clustering with JMP Pro Appendix: Association Rules with JMP Pro Appendix: Text Mining with JMP Pro Appendix: JMP Pro Settings to Solve Chapter 5 Problems Statistical Inference 253 6.1 Selecting a Sample 256 Sampling from a Finite Population 256 Sampling from an Infinite Population 257 6.2 Point Estimation 260 Practical Advice 262 6.3 Sampling Distributions 262 Sampling Distribution of x 265 Sampling Distribution of p 270 6.4 Interval Estimation 273 Interval Estimation of the Population Mean 273 Interval Estimation of the Population Proportion 280 6.5 Hypothesis Tests 283 Developing Null and Alternative Hypotheses 283 Type I and Type II Errors 286 Hypothesis Test of the Population Mean 287 Hypothesis Test of the Population Proportion 298 6.6 Big Data, Statistical Inference, and Practical Significance 301 Sampling Error 301 Nonsampling Error 302 Big Data 303 Understanding What Big Data Is 304 Big Data and Sampling Error 305 Big Data and the Precision of Confidence Intervals 306 Implications of Big Data for Confidence Intervals 307 Big Data, Hypothesis Testing, and p Values 308 Implications of Big Data in Hypothesis Testing 310 Summary 310 Glossary 311 Problems 314 Case Problem 1: Young Professional Magazine 324 Case Problem 2: Quality Associates, Inc. 325 Chapter 6

Available in the MindTap Reader: Appendix: Random Sampling with R Appendix: Interval Estimation with R Appendix: Hypothesis Testing with R Linear Regression 327 7.1 Simple Linear Regression Model 329 Regression Model 329 Estimated Regression Equation 329 Chapter 7

ix

x Contents

7.2 Least Squares Method 331 Least Squares Estimates of the Regression Parameters 333 Using Excel’s Chart Tools to Compute the Estimated Regression Equation 335 7.3 Assessing the Fit of the Simple Linear Regression Model 337 The Sums of Squares 337 The Coefficient of Determination 339 Using Excel’s Chart Tools to Compute the Coefficient ofDetermination 340 7.4 The Multiple Regression Model 341 Regression Model 341 Estimated Multiple Regression Equation 341 Least Squares Method and Multiple Regression 342 Butler Trucking Company and Multiple Regression 342 Using Excel’s Regression Tool to Develop the Estimated Multiple Regression Equation 343 7.5 Inference and Regression 346 Conditions Necessary for Valid Inference in the Least Squares Regression Model 347 Testing Individual Regression Parameters 351 Addressing Nonsignificant Independent Variables 354 Multicollinearity 355 7.6 Categorical Independent Variables 358 Butler Trucking Company and Rush Hour 358 Interpreting the Parameters 360 More Complex Categorical Variables 361 7.7 Modeling Nonlinear Relationships 363 Quadratic Regression Models 364 Piecewise Linear Regression Models 368 Interaction Between Independent Variables 370 7.8 Model Fitting 375 Variable Selection Procedures 375 Overfitting 376 7.9 Big Data and Regression 377 Inference and Very Large Samples 377 Model Selection 380 7.10 Prediction with Regression 382 Summary 384 Glossary 384 Problems 386 Case Problem 1: Alumni Giving 402 Case Problem 2: Consumer Research, Inc. 404 Case Problem 3: Predicting Winnings for NASCAR Drivers 405 Available in the MindTap Reader: Appendix: Simple Linear Regression with R

Contents

Appendix: Multiple Linear Regression with R Appendix: Regression Variable Selection Procedures with R Chapter 8 Time Series Analysis and Forecasting 407 8.1 Time Series Patterns 410 Horizontal Pattern 410 Trend Pattern 412 Seasonal Pattern 413 Trend and Seasonal Pattern 414 Cyclical Pattern 417 Identifying Time Series Patterns 417 8.2 Forecast Accuracy 417 8.3 Moving Averages and Exponential Smoothing 421 Moving Averages 422 Exponential Smoothing 426 8.4 Using Regression Analysis for Forecasting 430 Linear Trend Projection 430 Seasonality Without Trend 432 Seasonality with Trend 433 Using Regression Analysis as a Causal Forecasting Method 436 Combining Causal Variables with Trend and Seasonality Effects 439 Considerations in Using Regression in Forecasting 440 8.5 Determining the Best Forecasting Model to Use 440 Summary 441 Glossary 441 Problems 442 Case Problem 1: Forecasting Food and Beverage Sales 450 Case Problem 2: Forecasting Lost Sales 450 Appendix: Using the Excel Forecast Sheet 452

Available in the MindTap Reader: Appendix: Forecasting with R Predictive Data Mining 459 9.1 Data Sampling, Preparation, and Partitioning 461 Static Holdout Method 461 k-Fold Cross-Validation 462 Class Imbalanced Data 463 9.2 Performance Measures 464 Evaluating the Classification of Categorical Outcomes 464 Evaluating the Estimation of Continuous Outcomes 470 9.3 Logistic Regression 471 9.4 k-Nearest Neighbors 475 Classifying Categorical Outcomes with k-Nearest Neighbors 475 Estimating Continuous Outcomes with k-Nearest Neighbors 477 Chapter 9

xi

xii Contents

9.5 Classification and Regression Trees 478 Classifying Categorical Outcomes with a Classification Tree 478 Estimating Continuous Outcomes with a Regression Tree 483 Ensemble Methods 485 Summary 489 Glossary 491 Problems 492 Case Problem: Grey Code Corporation 505 Available in the MindTap Reader: Appendix: Classification via Logistic Regression with R Appendix: k-Nearest Neighbor Classification with R Appendix: k-Nearest Neighbor Regression with R Appendix: Individual Classification Trees with R Appendix: Individual Regression Trees with R Appendix: Random Forests of Classification Trees with R Appendix: Random Forests of Regression Trees with R Appendix: R/Rattle Settings to Solve Chapter 9 Problems Appendix: Data Partitioning with JMP Pro Appendix: Classification via Logistic Regression with JMPPro Appendix: k-Nearest Neighbors Classification and Regression with JMP Pro Appendix: Individual Classification and Regression Trees with JMP Pro Appendix: Random Forests of Classification orRegression Trees with JMP Pro Appendix: JMP Pro Settings to Solve Chapter 9 Problems Spreadsheet Models 509 10.1 Building Good Spreadsheet Models 511 Influence Diagrams 511 Building a Mathematical Model 511 Spreadsheet Design and Implementing the Model in a Spreadsheet 513 10.2 What-If Analysis 516 Data Tables 516 Goal Seek 518 Scenario Manager 520 10.3 Some Useful Excel Functions for Modeling 525 SUM and SUMPRODUCT 526 IF and COUNTIF 528 VLOOKUP 530 10.4 Auditing Spreadsheet Models 532 Trace Precedents and Dependents 532 Show Formulas 532 Evaluate Formulas 534 Error Checking 534 Watch Window 535 Chapter 10

Contents

10.5 Predictive and Prescriptive Spreadsheet Models 536 Summary 537 Glossary 537 Problems 538 Case Problem: Retirement Plan 544 Chapter 11 Monte Carlo Simulation 547 11.1 Risk Analysis for Sanotronics LLC 549 Base-Case Scenario 549 Worst-Case Scenario 550 Best-Case Scenario 550 Sanotronics Spreadsheet Model 550 Use of Probability Distributions to Represent Random Variables 551 Generating Values for Random Variables with Excel 553 Executing Simulation Trials with Excel 557 Measuring and Analyzing Simulation Output 557 11.2 Inventory Policy Analysis for Promus Corp 561 Spreadsheet Model for Promus 562 Generating Values for Promus Corp’s Demand 563 Executing Simulation Trials and Analyzing Output 565 11.3 Simulation Modeling for Land Shark Inc. 568 Spreadsheet Model for Land Shark 569 Generating Values for Land Shark’s Random Variables 570 Executing Simulation Trials and Analyzing Output 572 Generating Bid Amounts with Fitted Distributions 575 11.4 Simulation with Dependent Random Variables 580 Spreadsheet Model for Press Teag Worldwide 580 11.5 Simulation Considerations 585 Verification and Validation 585 Advantages and Disadvantages of Using Simulation 585 Summary 586 Summary of Steps for Conducting a Simulation Analysis 586 Glossary 587 Problems 587 Case Problem: Four Corners 600 Appendix: Common Probability Distributions for Simulation 602

Linear Optimization Models 609 12.1 A Simple Maximization Problem 611 Problem Formulation 612 Mathematical Model for the Par, Inc. Problem 614 12.2 Solving the Par, Inc. Problem 614 The Geometry of the Par, Inc. Problem 615 Solving Linear Programs with Excel Solver 617 Chapter 12

xiii

xiv Contents

12.3 A Simple Minimization Problem 621 Problem Formulation 621 Solution for the M&D Chemicals Problem 621 12.4 Special Cases of Linear Program Outcomes 623 Alternative Optimal Solutions 624 Infeasibility 625 Unbounded 626 12.5 Sensitivity Analysis 628 Interpreting Excel Solver Sensitivity Report 628 12.6 General Linear Programming Notation andMoreExamples 630 Investment Portfolio Selection 631 Transportation Planning 633 Maximizing Banner Ad Revenue 637 12.7 Generating an Alternative Optimal Solution for a Linear Program 642 Summary 644 Glossary 645 Problems 646 Case Problem: Investment Strategy 660 Integer Linear Optimization Models 663 13.1 Types of Integer Linear Optimization Models 664 13.2 Eastborne Realty, an Example of Integer Optimization 665 The Geometry of Linear All-Integer Optimization 666 13.3 Solving Integer Optimization Problems withExcelSolver 668 A Cautionary Note About Sensitivity Analysis 671 13.4 Applications Involving Binary Variables 673 Capital Budgeting 673 Fixed Cost 675 Bank Location 678 Product Design and Market Share Optimization 680 13.5 Modeling Flexibility Provided by Binary Variables 683 Multiple-Choice and Mutually Exclusive Constraints 683 k Out of n Alternatives Constraint 684 Conditional and Corequisite Constraints 684 13.6 Generating Alternatives in Binary Optimization 685 Summary 687 Glossary 688 Problems 689 Case Problem: Applecore Children’s Clothing 701 Chapter 13

Nonlinear Optimization Models 703 14.1 A Production Application: Par, Inc. Revisited 704 An Unconstrained Problem 704 A Constrained Problem 705 Solving Nonlinear Optimization Models Using Excel Solver 707 Sensitivity Analysis and Shadow Prices in Nonlinear Models 708 Chapter 14

Contents

xv

14.2 Local and Global Optima 709 Overcoming Local Optima with Excel Solver 712 14.3 A Location Problem 714 14.4 Markowitz Portfolio Model 715 14.5 Adoption of a New Product: The Bass Forecasting Model 720 Summary 723 Glossary 724 Problems 724 Case Problem: Portfolio Optimization with Transaction Costs 732 Decision Analysis 737 15.1 Problem Formulation 739 Payoff Tables 740 Decision Trees 740 15.2 Decision Analysis Without Probabilities 741 Optimistic Approach 741 Conservative Approach 742 Minimax Regret Approach 742 15.3 Decision Analysis with Probabilities 744 Expected Value Approach 744 Risk Analysis 746 Sensitivity Analysis 747 15.4 Decision Analysis with Sample Information 748 Expected Value of Sample Information 753 Expected Value of Perfect Information 753 15.5 Computing Branch Probabilities with Bayes’ Theorem 754 15.6 Utility Theory 757 Utility and Decision Analysis 758 Utility Functions 762 Exponential Utility Function 765 Summary 767 Glossary 767 Problems 769 Case Problem: Property Purchase Strategy 780 Chapter 15

ulti-Chapter Case ProblemS M Capital State University Game-Day Magazines 783 Hanover Inc. 785 Appendix A Basics of Excel 787 Appendix B Database Basics with Microsoft Access 799 Appendix C Solutions to Even-Numbered Problems (MindTap Reader) References 837 Index 839

About the Authors Jeffrey D. Camm. is the Inmar Presidential Chair and Associate Dean of Business Analytics in the School of Business at Wake Forest University. Born in Cincinnati, Ohio, he holds a B.S. from Xavier University (Ohio) and a Ph.D. from Clemson University. Prior to joining the faculty at Wake Forest, he was on the faculty of the University of Cincinnati. He has also been a visiting scholar at Stanford University and a visiting professor of business administration at the Tuck School of Business at Dartmouth College. Dr. Camm has published over 40 papers in the general area of optimization applied to problems in operations management and marketing. He has published his research in Science, Management Science, Operations Research, Interfaces, and other professional journals. Dr. Camm was named the Dornoff Fellow of Teaching Excellence at the University of Cincinnati and he was the 2006 recipient of the INFORMS Prize for the Teaching of Operations Research Practice. A firm believer in practicing what he preaches, he has served as an operations research consultant to numerous companies and government agencies. From 2005 to 2010 he served as editor-in-chief of Interfaces. In 2016, Professor Camm received the George E. Kimball Medal for service to the operations research profession, and in 2017 he was named an INFORMS Fellow. James J. Cochran. James J. Cochran is Associate Dean for Research, Professor of Applied Statistics and the Rogers-Spivey Faculty Fellow at The University of Alabama. Born in Dayton, Ohio, he earned his B.S., M.S., and M.B.A. from Wright State University and his Ph.D. from the University of Cincinnati. He has been at The University of Alabama since 2014 and has been a visiting scholar at Stanford University, Universidad de Talca, the University of South Africa and Pole Universitaire Leonard de Vinci. Dr. Cochran has published more than 40 papers in the development and application of operations research and statistical methods. He has published in several journals, including Management Science, The American Statistician, Communications in Statistics—Theory and Methods, Annals of Operations Research, European Journal of Operational Research, Journal of Combinatorial Optimization, Interfaces, and Statistics and Probability Letters. He received the 2008 INFORMS Prize for the Teaching of Operations Research Practice, 2010 Mu Sigma Rho Statistical Education Award and 2016 Waller Distinguished Teaching Career Award from the American Statistical Association. Dr. Cochran was elected to the International Statistics Institute in 2005, named a Fellow of the American Statistical Association in 2011, and named a Fellow of INFORMS in 2017. He also received the Founders Award in 2014 and the Karl E. Peace Award in 2015 from the American Statistical Association, and he received the INFORMS President’s Award in 2019. A strong advocate for effective operations research and statistics education as a means of improving the quality of applications to real problems, Dr. Cochran has chaired teaching effectiveness workshops around the globe. He has served as an operations research consultant to numerous companies and not-for-profit organizations. He served as editor-in-chief of INFORMS Transactions on Education and is on the editorial board of INFORMS Journal of Applied Analytics, International Transactions in Operational Research, and Significance. Michael J. Fry. Michael J. Fry is Professor of Operations, Business Analytics, and Information Systems (OBAIS) and Academic Director of the Center for Business Analytics in the Carl H. Lindner College of Business at the University of Cincinnati. Born in Killeen, Texas, he earned a B.S. from Texas A&M University, and M.S.E. and Ph.D. degrees from the University of Michigan. He has been at the University of Cincinnati since 2002, where he served as Department Head from 2014 to 2018 and has been named a Lindner Research Fellow. He has also been a visiting professor at Cornell University and at the University of British Columbia.

xviii

About the Authors

Professor Fry has published more than 25 research papers in journals such as Operations Research, Manufacturing & Service Operations Management, Transportation Science, Naval Research Logistics, IIE Transactions, Critical Care Medicine, and Interfaces. He serves on editorial boards for journals such as Production and Operations Management, INFORMS Journal of Applied Analytics (formerly Interfaces), and Journal of Quantitative Analysis in Sports. His research interests are in applying analytics to the areas of supply chain management, sports, and public-policy operations. He has worked with many different organizations for his research, including Dell, Inc., Starbucks Coffee Company, Great American Insurance Group, the Cincinnati Fire Department, the State of Ohio Election Commission, the Cincinnati Bengals, and the Cincinnati Zoo & Botanical Gardens. In 2008, he was named a finalist for the Daniel H. Wagner Prize for Excellence in Operations Research Practice, and he has been recognized for both his research and teaching excellence at the University of Cincinnati. In 2019, he led the team that was awarded the INFORMS UPS George D. Smith Prize on behalf of the OBAIS Department at the University of Cincinnati. Jeffrey W. Ohlmann. Jeffrey W. Ohlmann is Associate Professor of Business Analytics and Huneke Research Fellow in the Tippie College of Business at the University of Iowa. Born in Valentine, Nebraska, he earned a B.S. from the University of Nebraska, and M.S. and Ph.D. degrees from the University of Michigan. He has been at the University of Iowa since 2003. Professor Ohlmann’s research on the modeling and solution of decision-making problems has produced more than two dozen research papers in journals such as Operations Research, Mathematics of Operations Research, INFORMS Journal on Computing, Transportation Science, and the European Journal of Operational Research. He has collaborated with companies such as Transfreight, LeanCor, Cargill, the Hamilton County Board of Elections, and three National Football League franchises. Because of the relevance of his work to industry, he was bestowed the George B. Dantzig Dissertation Award and was recognized as a finalist for the Daniel H. Wagner Prize for Excellence in Operations Research Practice.

Preface B

usiness Analytics 4E is designed to introduce the concept of business analytics to undergraduate and graduate students. This edition builds upon what was one of the first collections of materials that are essential to the growing field of business analytics. In Chapter 1, we present an overview of business analytics and our approach to the material in this textbook. In simple terms, business analytics helps business professionals make better decisions based on data. We discuss models for summarizing, visualizing, and understanding useful information from historical data in Chapters 2 through 6. Chapters 7 through 9 introduce methods for both gaining insights from historical data and predicting possible future outcomes. C hapter10 covers the use of spreadsheets for examining data and building decision models. In Chapter11, we demonstrate how to explicitly introduce uncertainty into spreadsheet models through the use of Monte Carlo simulation. In Chapters 12 through 14, we discuss optimization models to help decision makers choose the best decision based on the available data. Chapter 15 is an overview of decision analysis approaches for incorporating a decision maker’s views about risk into decision making. In Appendix A we present optional material for students who need to learn the basics of using Microsoft Excel. The use of databases and manipulating data in Microsoft Access is discussed in Appendix B. Appendixes in many chapters illustrate the use of additional software tools such as R, JMP Pro and Tableau to apply analytics methods. This textbook can be used by students who have previously taken a course on basic statistical methods as well as students who have not had a prior course in statistics. Business Analytics 4E is also amenable to a two-course sequence in business statistics and analytics. All statistical concepts contained in this textbook are presented from a business analytics perspective using practical business examples. Chapters 2, 4, 6, and 7 provide an introduction to basic statistical concepts that form the foundation for more advanced analytics methods. Chapters 3, 5, and 9 cover additional topics of data visualization and data mining that are not traditionally part of most introductory business statistics courses, but they are exceedingly important and commonly used in current business environments. Chapter 10 and Appendix A provide the foundational knowledge students need to use Microsoft Excel for analytics applications. Chapters 11 through 15 build upon this spreadsheet knowledge to present additional topics that are used by many organizations that are leaders in the use of prescriptive analytics to improve decision making.

Updates in the Fourth Edition The fourth edition of Business Analytics is a major revision. We have added online appendixes for many topics in Chapters 1 through 9 that introduce the use of R, the exceptionally popular open-source software for analytics. Business Analytics 4E also includes an appendix to Chapter 3 introducing the powerful data visualization software Tableau. We have further enhanced our data mining chapters to allow instructors to choose their preferred means of teaching this material in terms of software usage. We have expanded the number of conceptual homework problems in both Chapters 5 and 9 to increase the number of opportunities for students learn about data mining and solve problems without the use of data mining software. Additionally, we now include online appendixes on using JMP Pro and R for teaching data mining so that instructors can choose their favored way of teaching this material. Other changes in this edition include an expanded discussion of binary variables for integer optimization in Chapter 13, an additional example in Chapter 11 for Monte Carlo simulation, and new and revised homework problems and cases. ●●

Tableau Appendix for Data Visualization. Chapter 3 now includes a new appendix that introduces the use of the software Tableau for data visualization. Tableau is a very powerful software for creating meaningful data visualizations that can be used to display, and to analyze, data. The appendix includes step-by-step directions for generating many of the charts used in Chapters 2 and 3 in Tableau.

xx Preface

●●

●●

●●

●●

●●

Incorporation of R. R is an exceptionally powerful open-source software that is widely used for a variety of statistical and analytics methods. We now include online appendixes that introduce the use of R for many of the topics covered in Chapters 1 through 9, including data visualization and data mining. These appendixes include step-by-step directions for using R to implement the methods described in these chapters. To facilitate the use of R, we introduce RStudio, an open-source integrated development environment (IDE) that provides a menu-driven interface for R. For Chapters 5 and 9 that cover data mining, we introduce the use of Rattle, a library package providing a graphical-user interface for R specifically tailored for data mining functionality. The use of RStudio and Rattle eases the learning curve of using R so that students can focus on learning the methods and interpreting the output. Updates for Data Mining Chapters. Chapters 5 and 9 have received extensive updates. We have moved the Descriptive Data Mining chapter to Chapter 5 so that it is located after our chapter on Probability. This allows us to use probability concepts such as conditional probability to explain association rule measures. Additional content on text mining and further discussion of ways to measure distance between observations have been added to a reorganized Descriptive Data Mining chapter. Descriptions of cross-validation approaches, methods of addressing class imbalanced data, and outof-bag estimation in ensemble methods have been added to Chapter 9 on Predictive Data Mining. The end-of-chapter problems in Chapters 5 and 9 have been revised and generalized to accommodate the use of a wide range of data mining software. To allow instructors to choose different software for use with these chapters, we have created online appendixes for both JMP Pro and R. JMP has introduced a new version of its software (JMP Pro 14) since the previous edition of this textbook, so we have updated our JMP Pro output and step-by-step instructions to reflect changes in this software. We have also written online appendixes for Chapters 5 and 9 that use R and the graphical-user interface Rattle to introduce topics from these chapters to students. The use of Rattle removes some of the more difficult line-by-line coding in R to perform many common data mining techniques so that students can concentrate on learning the methods rather than coding syntax. For some data mining techniques that are not available in Rattle, we show how to accomplish these methods using R code. And for all of our textbook examples, we include the exact R code that can be used to solve the examples. We have also added homework problems to Chapters 5 and 9 that can be solved without using any specialized software. This allows instructors to cover the basics of data mining without introducing any additional software. The online appendixes for Chapters 5 and 9 also include JMP Pro and R specific instructions for how to solve the end-of-chapter problems and cases using JMP Pro and R. Problem and case solutions using both JMP Pro and R are also available to instructors. Additional Simulation Model Example. We have added an additional example of a simulation model in Chapter 11. This new example helps bridge the gap in the difficultly levels of the previous examples. The new example also gives students additional information on how to build and interpret simulation models. New Cases. Business Analytics 4E includes nine new end-of-chapter cases that allow students to work on more extensive problems related to the chapter material and work with larger data sets. We have also written two new cases that require the use of material from multiple chapters. This helps students understand the connections between the material in different chapters and is more representative of analytics projects in practice where the methods used are often not limited to a single type. Legal and Ethical Issues Related to Analytics and Big Data. Chapter 1 now includes a section that discusses legal and ethical issues related to analytics and the use of big data. This section discusses legal issues related to the protection of data as well as ethical issues related to the misuse and unintended consequences of analytics applications.

Preface

●●

xxi

New End-of-Chapter Problems. The fourth edition of this textbook includes more than 20 new problems. We have also revised many of the existing problems to update and improve clarity. Each end-of-chapter problem now also includes a short header to make the application of the exercise more clear. As we have done in past editions, Excel solution files are available to instructors for problems that require the use of Excel. For problems that require the use of software in the data-mining chapters (Chapters 5 and 9), we include solutions for both JMP Pro and R/Rattle.

Continued Features and Pedagogy In the fourth edition of this textbook, we continue to offer all of the features that have been successful in the first two editions. Some of the specific features that we use in this textbook are listed below. ●●

●●

●●

●●

●●

Integration of Microsoft Excel: Excel has been thoroughly integrated throughout this textbook. For many methodologies, we provide instructions for how to perform calculations both by hand and with Excel. In other cases where realistic models are practical only with the use of a spreadsheet, we focus on the use of Excel to describe the methods to be used. Notes and Comments: At the end of many sections, we provide Notes and Comments to give the student additional insights about the methods presented in that section. These insights include comments on the limitations of the presented methods, recommendations for applications, and other matters. Additionally, margin notes are used throughout the textbook to provide additional insights and tips related to the specific material being discussed. Analytics in Action: Each chapter contains an Analytics in Action article. Several of these have been updated and replaced for the fourth edition. These articles present interesting examples of the use of business analytics in practice. The examples are drawn from many different organizations in a variety of areas including healthcare, finance, manufacturing, marketing, and others. DATAfiles and MODELfiles: All data sets used as examples and in student exercises are also provided online on the companion site as files available for download by the student. DATAfiles are Excel files (or .csv files for easy import into JMP Pro and R/Rattle) that contain data needed for the examples and problems given in the textbook. MODELfiles contain additional modeling features such as extensive use of Excel formulas or the use of Excel Solver, JMP Pro, or R. Problems and Cases: With the exception of Chapter 1, each chapter contains an extensive selection of problems to help the student master the material presented in that chapter. The problems vary in difficulty and most relate to specific examples of the use of business analytics in practice. Answers to even-numbered problems are provided in an online supplement for student access. With the exception of Chapter 1, each chapter also includes at least one in-depth case study that connects many of the different methods introduced in the chapter. The case studies are designed to be more open-ended than the chapter problems, but enough detail is provided to give the student some direction in solving the cases. New to the fourth edition is the inclusion of two cases that require the use of material from multiple chapters in the text to better illustrate how concepts from different chapters relate to each other.

MindTap MindTap is a customizable digital course solution that includes an interactive eBook, autograded exercises from the textbook, algorithmic practice problems with solutions feedback, Exploring Analytics visualizations, Adaptive Test Prep, and more! MindTap is also

xxii Preface

where instructors and users can find the online appendixes for JMP Pro and R/Rattle. All of these materials offer students better access to resources to understand the materials within the course. For more information on MindTap, please contact your Cengage representative.

WebAssign Prepare for class with confidence using WebAssign from Cengage. This online learning platform fuels practice, so students can truly absorb what you learn – and are better prepared come test time. Videos, Problem Walk-Throughs, and End-of-Chapter problems and cases with instant feedback help them understand the important concepts, while instant grading allows you and them to see where they stand in class. Class Insights allows students to see what topics they have mastered and which they are struggling with, helping them identify where to spend extra time. Study Smarter with WebAssign.

For Students Online resources are available to help the student work more efficiently. The resources can be accessed through www.cengage.com/decisionsciences/camm/ba/4e. ●●

●●

●●

R, RStudio, and Rattle: R, RStudio, and Rattle are open-source software, so they are free to download. Business Analytics 4E includes step-by-step instructions for downloading these software. JMP Pro: Many universities have site licenses of SAS Institute’s JMP Pro software on both Mac and Windows. These are typically offered through your university’s software licensing administrator. Faculty may contact the JMP Academic team to find out if their universities have a license or to request a complementary instructor copy at www.jmp .com/contact-academic. For institutions without a site license, students may rent a 6- or 12-month license for JMP at www.onthehub.com/jmp. Data Files: A complete download of all data files associated with this text.

For Instructors Instructor resources are available to adopters on the Instructor Companion Site, which can be found and accessed at www.cengage.com/decisionsciences/camm/ba/4e including: ●●

●●

●●

●●

Solutions Manual: The Solutions Manual, prepared by the authors, includes solutions for all problems in the text. It is available online as well as print. Excel solution files are available to instructors for those problems that require the use of Excel. Solutions for Chapters 5 and 9 are available using both JMP Pro and R/Rattle for data mining problems. Solutions to Case Problems: These are also prepared by the authors and contain solutions to all case problems presented in the text. Case solutions for Chapters 5 and 9 are provided using both JMP Pro and R/Rattle. Extensive case solutions are also provided for the new multi-chapter cases that draw on material from multiple chapters. PowerPoint Presentation Slides: The presentation slides contain a teaching outline that incorporates figures to complement instructor lectures. Test Bank: Cengage Learning Testing Powered by Cognero is a flexible, online system that allows you to: ●● ●● ●●

author, edit, and manage test bank content from multiple Cengage Learning solutions, create multiple test versions in an instant, and deliver tests from your Learning Management System (LMS), your classroom, or wherever you want.

Preface

xxiii

Acknowledgments We would like to acknowledge the work of reviewers and users who have provided comments and suggestions for improvement of this text. Thanks to: Rafael Becerril Arreola University of South Carolina Matthew D. Bailey Bucknell University Phillip Beaver University of Denver M. Khurrum S. Bhutta Ohio University Paolo Catasti Virginia Commonwealth University Q B. Chung Villanova University Elizabeth A. Denny University of Kentucky Mike Taein Eom University of Portland Yvette Njan Essounga Fayetteville State University Lawrence V. Fulton Texas State University Tom Groleau Carthage College James F. Hoelscher Lincoln Memorial University Eric Huggins Fort Lewis College Faizul Huq Ohio University Marco Lam York College of Pennsylvania Thomas Lee University of California, Berkeley Roger Myerson Northwestern University Ram Pakath University of Kentucky Susan Palocsay James Madison University Andy Shogan University of California, Berkeley Dothan Truong Embry-Riddle Aeronautical University

xxiv Preface

Kai Wang Wake Technical Community College Ed Wasil American University Ed Winkofsky University of Cincinnati A special thanks goes to our associates from business and industry who supplied the Analytics in Action features. We recognize them individually by a credit line in each of the articles. We are also indebted to our senior product manager, Aaron Arnsparger; our Senior Content Manager, Conor Allen; senior learning designer, Brandon Foltz; digital delivery lead, Mark Hopkinson; and our senior project manager at MPS Limited, Santosh Pandey, for their editorial counsel and support during the preparation of this text. Jeffrey D. Camm James J. Cochran Michael J. Fry Jeffrey W. Ohlmann

Chapter 1 Introduction Contents 1.1 DECISION MAKING 1.2 BUSINESS ANALYTICS DEFINED 1.3 A CATEGORIZATION OF ANALYTICAL METHODS AND MODELS Descriptive Analytics Predictive Analytics Prescriptive Analytics 1.4 BIG DATA Volume Velocity Variety Veracity 1.5 BUSINESS ANALYTICS IN PRACTICE Financial Analytics Human Resource (HR) Analytics Marketing Analytics Health Care Analytics Supply Chain Analytics Analytics for Government and Nonprofits Sports Analytics Web Analytics 1.6 LEGAL AND ETHICAL ISSUES IN THE USE OF DATA AND ANALYTICS Summary 16 Glossary 16 Available in the MindTap Reader: Appendix: Getting Started with R and Rstudio Appendix: Basic Data Manipulation WITH R

2

Chapter 1 Introduction

You apply for a loan for the first time. How does the bank assess the riskiness of the loan it might make to you? How does Amazon.com know which books and other products to recommend to you when you log in to their web site? How do airlines determine what price to quote to you when you are shopping for a plane ticket? How can doctors better diagnose and treat you when you are ill or injured? You may be applying for a loan for the first time, but millions of people around the world have applied for loans before. Many of these loan recipients have paid back their loans in full and on time, but some have not. The bank wants to know whether you are more like those who have paid back their loans or more like those who defaulted. By comparing your credit history, financial situation, and other factors to the vast database of previous loan recipients, the bank can effectively assess how likely you are to default on a loan. Similarly, Amazon.com has access to data on millions of purchases made by customers on its web site. Amazon.com examines your previous purchases, the products you have viewed, and any product recommendations you have provided. Amazon.com then searches through its huge database for customers who are similar to you in terms of product purchases, recommendations, and interests. Once similar customers have been identified, their purchases form the basis of the recommendations given to you. Prices for airline tickets are frequently updated. The price quoted to you for a flight between New York and San Francisco today could be very different from the price that will be quoted tomorrow. These changes happen because airlines use a pricing strategy known as revenue management. Revenue management works by examining vast amounts of data on past airline customer purchases and using these data to forecast future purchases. These forecasts are then fed into sophisticated optimization algorithms that determine the optimal price to charge for a particular flight and when to change that price. Revenue management has resulted in substantial increases in airline revenues. Finally, consider the case of being evaluated by a doctor for a potentially serious m edical issue. Hundreds of medical papers may describe research studies done on patients facing similar diagnoses, and thousands of data points exist on their outcomes. However, it is extremely unlikely that your doctor has read every one of these research papers or is aware of all previous patient outcomes. Instead of relying only on her medical training and knowledge gained from her limited set of previous patients, wouldn’t it be better for your doctor to have access to the expertise and patient histories of thousands of doctors around the world? A group of IBM computer scientists initiated a project to develop a new decision technology to help in answering these types of questions. That technology is called Watson, named after the founder of IBM, Thomas J. Watson. The team at IBM focused on one aim: How the vast amounts of data now available on the Internet can be used to make more datadriven, smarter decisions. Watson is an example of the exploding field of artificial intelligence (AI). Broadly speaking, AI is the use of data and computers to make decisions that would have in the past required human intelligence. Often, the computer software mimics the way we understand the human brain functions. Watson became a household name in 2011, when it famously won the television game show, Jeopardy! Since that proof of concept in 2011, IBM has reached agreements with the health insurance provider WellPoint (now part of Anthem), the financial services company Citibank, Memorial Sloan-Kettering Cancer Center, and automobile manufacturer General Motors to apply Watson to the decision problems that they face. Watson is a system of computing hardware, high-speed data processing, and analytical algorithms that are combined to make data-based recommendations. As more and more data are collected, Watson has the capability to learn over time. In simple terms, according to IBM, Watson gathers hundreds of thousands of possible solutions from a huge data bank, evaluates them using analytical techniques, and proposes only the best solutions for consideration. Watson provides not just a single solution, but rather a range of good solutions with a confidence level for each. For example, at a data center in Virginia, to the delight of doctors and patients, Watson is already being used to speed up the approval of medical procedures. Citibank is beginning to explore how to use Watson to better serve its customers, and cancer specialists at

1.1 Decision Making

3

more than a dozen hospitals in North America are using Watson to assist with the diagnosis and treatment of patients.1 This book is concerned with data-driven decision making and the use of analytical approaches in the decision-making process. Three developments spurred recent explosive growth in the use of analytical methods in business applications. First, technological advances—such as improved point-of-sale scanner technology and the collection of data through e-commerce and social networks, data obtained by sensors on all kinds of mechanical devices such as aircraft engines, automobiles, and farm machinery through the so-called Internet of Things and data generated from personal electronic devices—produce incredible amounts of data for businesses. Naturally, businesses want to use these data to improve the efficiency and profitability of their operations, better understand their customers, price their products more effectively, and gain a competitive advantage. Second, ongoing research has resulted in numerous methodological developments, including advances in computational approaches to effectively handle and explore massive amounts of data, faster algorithms for optimization and simulation, and more effective approaches for visualizing data. Third, these methodological developments were paired with an explosion in computing power and storage capability. Better computing hardware, parallel computing, and, more recently, cloud computing (the remote use of hardware and software over the Internet) have enabled businesses to solve big problems more quickly and more accurately than ever before. In summary, the availability of massive amounts of data, improvements in analytic methodologies, and substantial increases in computing power have all come together to result in a dramatic upsurge in the use of analytical methods in business and a reliance on the discipline that is the focus of this text: business analytics. As stated in the Preface, the purpose of this text is to provide students with a sound conceptual understanding of the role that business analytics plays in the decision-making process. To reinforce the applications orientation of the text and to provide a better understanding of the variety of applications in which analytical methods have been used successfully, Analytics in Action articles are presented throughout the book. Each Analytics in Action article summarizes an application of analytical methods in practice.

1.1 Decision Making It is the responsibility of managers to plan, coordinate, organize, and lead their organizations to better performance. Ultimately, managers’ responsibilities require that they make strategic, tactical, or operational decisions. Strategic decisions involve higher-level issues concerned with the overall direction of the organization; these decisions define the organization’s overall goals and aspirations for the future. Strategic decisions are usually the domain of higher-level executives and have a time horizon of three to five years. Tactical decisions concern how the organization should achieve the goals and objectives set by its strategy, and they are usually the responsibility of midlevel management. Tactical decisions usually span a year and thus are revisited annually or even every six months. Operational decisions affect how the firm is run from day to day; they are the domain of operations managers, who are the closest to the customer. Consider the case of the Thoroughbred Running Company (TRC). Historically, TRC had been a catalog-based retail seller of running shoes and apparel. TRC sales revenues grew quickly as it changed its emphasis from catalog-based sales to Internet-based sales. Recently, TRC decided that it should also establish retail stores in the malls and downtown areas of major cities. This strategic decision will take the firm in a new direction that it hopes will complement its Internet-based strategy. TRC middle managers will therefore have to make a variety of tactical decisions in support of this strategic decision, including “IBM’s Watson Is Learning Its Way to Saving Lives,” Fastcompany web site, December 8, 2012; H. Landi, “IBM Watson Health Touts Recent Studies Showing AI Improves How Physicians Treat Cancer,” FierceHealthcare web site, June 4, 2019.

1

4

Chapter 1 Introduction

how many new stores to open this year, where to open these new stores, how many distribution centers will be needed to support the new stores, and where to locate these distribution centers. Operations managers in the stores will need to make day-to-day decisions regarding, for instance, how many pairs of each model and size of shoes to order from the distribution centers and how to schedule their sales personnel’s work time. Regardless of the level within the firm, decision making can be defined as the following process: 1. 2. 3. 4. 5.

Identify and define the problem. Determine the criteria that will be used to evaluate alternative solutions. Determine the set of alternative solutions. Evaluate the alternatives. Choose an alternative.

Step 1 of decision making, identifying and defining the problem, is the most critical. Only if the problem is well-defined, with clear metrics of success or failure (step 2), can a proper approach for solving the problem (steps 3 and 4) be devised. Decision making concludes with the choice of one of the alternatives (step 5). There are a number of approaches to making decisions: tradition (“We’ve always done it this way”), intuition (“gut feeling”), and rules of thumb (“As the restaurant owner, I schedule twice the number of waiters and cooks on holidays”). The power of each of these approaches should not be underestimated. Managerial experience and intuition are valuable inputs to making decisions, but what if relevant data were available to help us make more informed decisions? With the vast amounts of data now generated and stored electronically, it is estimated that the amount of data stored by businesses more than doubles every two years. How can managers convert these data into knowledge that they can use to be more efficient and effective in managing their businesses?

1.2 Business Analytics Defined

Some firms and industries use the simpler term, analytics. Analytics is often thought of as a broader category than business analytics, encompassing the use of analytical techniques in the sciences and engineering as well. In this text, we use business analytics and analytics synonymously.

What makes decision making difficult and challenging? Uncertainty is probably the number one challenge. If we knew how much the demand will be for our product, we could do a much better job of planning and scheduling production. If we knew exactly how long each step in a project will take to be completed, we could better predict the project’s cost and completion date. If we knew how stocks will perform, investing would be a lot easier. Another factor that makes decision making difficult is that we often face such an enormous number of alternatives that we cannot evaluate them all. What is the best combination of stocks to help me meet my financial objectives? What is the best product line for a company that wants to maximize its market share? How should an airline price its tickets so as to maximize revenue? Business analytics is the scientific process of transforming data into insight for making better decisions.2 Business analytics is used for data-driven or fact-based decision making, which is often seen as more objective than other alternatives for decision making. As we shall see, the tools of business analytics can aid decision making by creating insights from data, by improving our ability to more accurately forecast for planning, by helping us quantify risk, and by yielding better alternatives through analysis and optimization. A study based on a large sample of firms that was conducted by researchers at MIT’s Sloan School of Management and the University of Pennsylvania concluded that firms guided by data-driven decision making have higher productivity and market value and increased output and profitability.3 We adopt the definition of analytics developed by the Institute for Operations Research and the Management Sciences (INFORMS). 3 E. Brynjolfsson, L. M. Hitt, and H. H. Kim, “Strength in Numbers: How Does Data-Driven Decisionmaking Affect Firm Performance?” Thirty-Second International Conference on Information Systems, Shanghai, China, December 2011. 2

1.3 A Categorization of Analytical Methods and Models

5

1.3 A Categorization of Analytical Methods and Models Business analytics can involve anything from simple reports to the most advanced optimization techniques (methods for finding the best course of action). Analytics is generally thought to comprise three broad categories of techniques: descriptive analytics, predictive analytics, and prescriptive analytics.

Descriptive Analytics

Appendix B, at the end of this book, describes how to use Microsoft Access to conduct data queries.

Descriptive analytics encompasses the set of techniques that describes what has happened in the past. Examples are data queries, reports, descriptive statistics, data visualization including data dashboards, some data-mining techniques, and basic what-if spreadsheet models. A data query is a request for information with certain characteristics from a database. For example, a query to a manufacturing plant’s database might be for all records of shipments to a particular distribution center during the month of March. This query provides descriptive information about these shipments: the number of shipments, how much was included in each shipment, the date each shipment was sent, and so on. A report summarizing relevant historical information for management might be conveyed by the use of descriptive statistics (means, measures of variation, etc.) and data-visualization tools (tables, charts, and maps). Simple descriptive statistics and data-visualization techniques can be used to find patterns or relationships in a large database. Data dashboards are collections of tables, charts, maps, and summary statistics that are updated as new data become available. Dashboards are used to help management monitor specific aspects of the company’s performance related to their decision-making responsibilities. For corporate-level managers, daily data dashboards might summarize sales by region, current inventory levels, and other company-wide metrics; front-line managers may view dashboards that contain metrics related to staffing levels, local inventory levels, and short-term sales forecasts. Data mining is the use of analytical techniques for better understanding patterns and relationships that exist in large data sets. For example, by analyzing text on social network platforms like Twitter, data-mining techniques (including cluster analysis and sentiment analysis) are used by companies to better understand their customers. By categorizing certain words as positive or negative and keeping track of how often those words appear in tweets, a company like Apple can better understand how its customers are feeling about a product like the Apple Watch.

Predictive Analytics Predictive analytics consists of techniques that use models constructed from past data to predict the future or ascertain the impact of one variable on another. For example, past data on product sales may be used to construct a mathematical model to predict future sales. This mode can factor in the product’s growth trajectory and seasonality based on past patterns. A packaged-food manufacturer may use point-of-sale scanner data from retail outlets to help in estimating the lift in unit sales due to coupons or sales events. Survey data and past purchase behavior may be used to help predict the market share of a new product. All of these are applications of predictive analytics. Linear regression, time series analysis, some data-mining techniques, and simulation, often referred to as risk analysis, all fall under the banner of predictive analytics. We discuss all of these techniques in greater detail later in this text. Data mining, previously discussed as a descriptive analytics tool, is also often used in predictive analytics. For example, a large grocery store chain might be interested in developing a targeted marketing campaign that offers a discount coupon on potato chips. By studying historical point-of-sale data, the store may be able to use data mining to predict which customers are the most likely to respond to an offer on discounted chips by purchasing higher-margin items such as beer or soft drinks in addition to the chips, thus increasing the store’s overall revenue.

6

Chapter 1 Introduction

Simulation involves the use of probability and statistics to construct a computer model to study the impact of uncertainty on a decision. For example, banks often use simulation to model investment and default risk in order to stress-test financial models. Simulation is also often used in the pharmaceutical industry to assess the risk of introducing a new drug.

Prescriptive Analytics Prescriptive analytics differs from descriptive and predictive analytics in that prescriptive analytics indicates a course of action to take; that is, the output of a prescriptive model is a decision. Predictive models provide a forecast or prediction, but do not provide a decision. However, a forecast or prediction, when combined with a rule, becomes a prescriptive model. For example, we may develop a model to predict the probability that a person will default on a loan. If we create a rule that says if the estimated probability of default is more than 0.6, we should not award a loan, now the predictive model, coupled with the rule is prescriptive analytics. These types of prescriptive models that rely on a rule or set of rules are often referred to as rule-based models. Other examples of prescriptive analytics are portfolio models in finance, supply network design models in operations, and price-markdown models in retailing. Portfolio models use historical investment return data to determine which mix of investments will yield the highest expected return while controlling or limiting exposure to risk. Supply-network design models provide plant and distribution center locations that will minimize costs while still meeting customer service requirements. Given historical data, retail price markdown models yield revenue-maximizing discount levels and the timing of discount offers when goods have not sold as planned. All of these models are known as optimization models, that is, models that give the best decision subject to the constraints of the situation. Another type of modeling in the prescriptive analytics category is simulation optimization which combines the use of probability and statistics to model uncertainty with optimization techniques to find good decisions in highly complex and highly uncertain settings. Finally, the techniques of decision analysis can be used to develop an optimal strategy when a decision maker is faced with several decision alternatives and an uncertain set of future events. Decision analysis also employs utility theory, which assigns values to outcomes based on the decision maker’s attitude toward risk, loss, and other factors. In this text we cover all three areas of business analytics: descriptive, predictive, and prescriptive. Table 1.1 shows how the chapters cover the three categories.

1.4 Big Data On any given day, 500 million tweets and 294 billion e-mails are sent, 95 million photos and videos are shared on Instagram, 350 million photos are posted on Facebook, and 3.5 billion searches are made with Google.4 It is through technology that we have truly been thrust into the data age. Because data can now be collected electronically, the available amounts of it are staggering. The Internet, cell phones, retail checkout scanners, surveillance video, and sensors on everything from aircraft to cars to bridges allow us to collect and store vast amounts of data in real time. In the midst of all of this data collection, the term big data has been created. There is no universally accepted definition of big data. However, probably the most accepted and most general definition is that big data is any set of data that is too large or too complex to be handled by standard data-processing techniques and typical desktop software. IBM describes the phenomenon of big data through the four Vs: volume, velocity, variety, and veracity, as shown in Figure 1.1.5

J. Desjardins, “How Much Data Is Generated Each Day?” Visual Capitalist web site, April 15, 2019.

4

IBM web site: www.ibmbigdatahub.com/sites/default/files/infographic_file/4-Vs-of-big-data.jpg.

5

7

1.4 Big Data

Table 1.1 Chapter

Coverage of Business Analytics Topics in This Text Title

Descriptive

Predictive

Prescriptive

●

●

1

Introduction

●

2

Descriptive Statistics

●

3

Data Visualization

●

4

Probability: An Introduction to Modeling Uncertainty

●

5

Descriptive Data Mining

●

6

Statistical Inference

●

7

Linear Regression

●

8

Time Series and Forecasting

●

9

Predictive Data Mining

●

10

Spreadsheet Models

11

Monte Carlo Simulation

12

Linear Optimization Models

●

13

Integer Linear Optimization Models

●

14

Nonlinear Optimization Models

●

15

Decision Analysis

●

FIGURE 1.1

●

●

●

●

●

The Four Vs of Big Data Data at Rest Terabytes to exabytes of existing data to process

Volume

Data in Motion Streaming data, milliseconds to seconds to respond

Velocity

Data in Many Forms Structured, unstructured, text, multimedia

Variety

Data in Doubt Veracity

Uncertainty due to data inconsistency & incompleteness, ambiguities, latency, deception, model approximations

Source: IBM.

8

Chapter 1 Introduction

Volume Because data are collected electronically, we are able to collect more of it. To be useful, these data must be stored, and this storage has led to vast quantities of data. Many companies now store in excess of 100 terabytes of data (a terabyte is 1,024 gigabytes).

Velocity Real-time capture and analysis of data present unique challenges both in how data are stored, and the speed with which those data can be analyzed for decision making. For example, the New York Stock Exchange collects 1 terabyte of data in a single trading session, and having current data and real-time rules for trades and predictive modeling are important for managing stock portfolios.

Variety In addition to the sheer volume and speed with which companies now collect data, more complicated types of data are now available and are proving to be of great value to businesses. Text data are collected by monitoring what is being said about a company’s products or services on social media platforms such as Twitter. Audio data are collected from service calls (on a service call, you will often hear “this call may be monitored for quality control”). Video data collected by in-store video cameras are used to analyze shopping behavior. Analyzing information generated by these nontraditional sources is more complicated in part because of the processing required to transform the data into a numerical form that can be analyzed.

Veracity Veracity has to do with how much uncertainty is in the data. For example, the data could have many missing values, which makes reliable analysis a challenge. Inconsistencies in units of measure and the lack of reliability of responses in terms of bias also increase the complexity of the data. Businesses have realized that understanding big data can lead to a competitive advantage. Although big data represents opportunities, it also presents challenges in terms of data storage and processing, security, and available analytical talent. The four Vs indicate that big data creates challenges in terms of how these complex data can be captured, stored, and processed; secured; and then analyzed. Traditional databases more or less assume that data fit into nice rows and columns, but that is not always the case with big data. Also, the sheer volume (the first V) often means that it is not possible to store all of the data on a single computer. This has led to new technologies like Hadoop—an open-source programming environment that supports big data processing through distributed storage and distributed processing on clusters of computers. Essentially, Hadoop provides a divide-andconquer approach to handling massive amounts of data, dividing the storage and processing over multiple computers. MapReduce is a programming model used within Hadoop that performs the two major steps for which it is named: the map step and the reduce step. The map step divides the data into manageable subsets and distributes it to the computers in the cluster (often termed nodes) for storing and processing. The reduce step collects answers from the nodes and combines them into an answer to the original problem. Technologies like Hadoop and MapReduce, paired with relatively inexpensive computer power, enable cost-effective processing of big data; otherwise, in some cases, processing might not even be possible. While some sources of big data are publicly available (Twitter, weather data, etc.), much of it is private information. Medical records, bank account information, and credit card transactions, for example, are all highly confidential and must be protected from computer hackers. Data security, the protection of stored data from destructive forces or unauthorized users, is of critical importance to companies. For example, credit card transactions are potentially very useful for understanding consumer behavior, but compromise of these data could lead to unauthorized use of the credit card or identity theft. A 2016 study of 383companies in 12 countries conducted by the Ponemon Institute and IBM found that the average cost of

1.4 Big Data

9

a data breach is $3.86 million.6 Companies such as Target, Anthem, JPMorgan Chase, Yahoo!, Facebook, Marriott, Equifax, and Home Depot have faced major data breaches costing millions of dollars. The complexities of the 4 Vs have increased the demand for analysts, but a shortage of qualified analysts has made hiring more challenging. More companies are searching for data scientists, who know how to effectively process and analyze massive amounts of data because they are well trained in both computer science and statistics. Next we discuss three examples of how companies are collecting big data for competitive advantage. Kroger Understands Its Customers7 Kroger is the largest retail grocery chain in the United

States. It sends over 11 million pieces of direct mail to its customers each quarter. The quarterly mailers each contain 12 coupons that are tailored to each household based on several years of shopping data obtained through its customer loyalty card program. By collecting and analyzing consumer behavior at the individual household level, and better matching its coupon offers to shopper interests, Kroger has been able to realize a far higher redemption rate on its coupons. In the six-week period following distribution of the mailers, over 70% of households redeem at least one coupon, leading to an estimated coupon revenue of $10 billion for Kroger. MagicBand at Disney8 The Walt Disney Company offers a wristband to visitors to its Or-

lando, Florida, Disney World theme park. Known as the MagicBand, the wristband contains technology that can transmit more than 40 feet and can be used to track each visitor’s location in the park in real time. The band can link to information that allows Disney to better serve its visitors. For example, prior to the trip to Disney World, a visitor might be asked to fill out a survey on his or her birth date and favorite rides, characters, and restaurant table type and location. This information, linked to the MagicBand, can allow Disney employees using smartphones to greet you by name as you arrive, offer you products they know you prefer, wish you a happy birthday, have your favorite characters show up as you wait in line or have lunch at your favorite table. The MagicBand can be linked to your credit card, so there is no need to carry cash or a credit card. And during your visit, your movement throughout the park can be tracked and the data can be analyzed to better serve you during your next visit to the park. General Electric and the Internet of Things9 The Internet of Things (IoT) is the technolo-

gy that allows data, collected from sensors in all types of machines, to be sent over the Internet to repositories where it can be stored and analyzed. This ability to collect data from products has enabled the companies that produce and sell those products to better serve their customers and offer new services based on analytics. For example, each day General Electric (GE) gathers nearly 50 million pieces of data from 10 million sensors on medical equipment and aircraft engines it has sold to customers throughout the world. In the case of aircraft engines, through a service agreement with its customers, GE collects data each time an airplane powered by its engines takes off and lands. By analyzing these data, GE can better predict when maintenance is needed, which helps customers avoid unplanned maintenance and downtime and helps ensure safe operation. GE can also use the data to better control how the plane is flown, leading to a decrease in fuel cost by flying more efficiently. GE spun off a new company called GE Digital 2.0 which operates as a stand-alone company focused on software that leverages IoT data. In 2018, GE announced that it would spin off a new company from its existing GE Digital business that will focus on industrial IoT applications. Although big data is clearly one of the drivers for the strong demand for analytics, it is important to understand that, in some sense, big data issues are a subset of analytics. Many very valuable applications of analytics do not involve big data, but rather traditional data sets that are very manageable by traditional database and analytics software. The key to S. Shepard, “The Average Cost of a Data Breach,” Security Today web site, July 17, 2018.

6

Based on “Kroger Knows Your Shopping Patterns Better than You Do,” Forbes.com, October 23, 2013.

7

Based on “Disney’s $1 Billion Bet on a Magical Wristband,” Wired.com, March 10, 2015.

8

Based on “G.E. Opens Its Big Data Platform,” NYTimes.com, October 9, 2014; “GE Announces New Industrial IoT Software Business,“ Forbes web site, December 14, 2018.

9

Chapter 1 Introduction

analytics is that it provides useful insights and better decision making using the data that are available—whether those data are “big” or “small.”

1.5 Business Analytics in Practice Business analytics involves tools as simple as reports and graphs to those that are as sophisticated as optimization, data mining, and simulation. In practice, companies that apply analytics often follow a trajectory similar to that shown in Figure 1.2. Organizations start with basic analytics in the lower left. As they realize the advantages of these analytic techniques, they often progress to more sophisticated techniques in an effort to reap the derived competitive advantage. Therefore, predictive and prescriptive analytics are sometimes referred to as advanced analytics. Not all companies reach that level of usage, but those that embrace analytics as a competitive strategy often do. Analytics has been applied in virtually all sectors of business and government. Organizations such as Procter & Gamble, IBM, UPS, Netflix, Amazon.com, Google, the Internal Revenue Service, and General Electric have embraced analytics to solve important problems or to achieve a competitive advantage. In this section, we briefly discuss some of the types of applications of analytics by application area.

Financial Analytics Applications of analytics in finance are numerous and pervasive. Predictive models are used to forecast financial performance, to assess the risk of investment portfolios and projects, and to construct financial instruments such as derivatives. Prescriptive models are used to construct optimal portfolios of investments, to allocate assets, and to create optimal capital budgeting plans. For example, Europcar, the leading rental car company in Europe, uses forecasting models, simulation and optimization to predict demand, assess risk, and optimize the use of its fleet. It's models are implemented via a decision support system used in nine countries in Europe and has led to higher utilization of its fleet, decreased costs, and increased profitability.10 Simulation is also often used to assess risk in the financial sector; one example is the deployment by Hypo Real Estate International of simulation models to successfully manage commercial real estate risk.11 FIGURE 1.2

Competitive Advantage

10

The Spectrum of Business Analytics

Optimization Decision Analysis Rule-Based Models Simulation Predictive Modeling Forecasting Data Mining Descriptive Statistics Data Visualization Data Query Standard Reporting

Prescriptive

Predictive

Descriptive

Degree of Complexity

Source: Adapted from SAS. J. Guillen et al., “Europcar Integrates Forecasting, Simulation, and Optimization Techniques in a Capacity and Revenue Management System,” INFORMS Journal on Applied Analytics, 49, no. 1 (January–February 2019).

10

Y. Jafry, C. Marrison, and U. Umkehrer-Neudeck, “Hypo International Strengthens Risk Management with a LargeScale, Secure Spreadsheet-Management Framework,” Interfaces 38, no. 4 (July–August 2008).

11

1.5 Business Analytics in Practice

11

Human Resource (HR) Analytics A relatively new area of application for analytics is the management of an organization’s human resources. The HR function is charged with ensuring that the organization (1)has the mix of skill sets necessary to meet its needs, (2) is hiring the highest-quality talent and providing an environment that retains it, and (3) achieves its organizational diversity goals. Google refers to its HR Analytics function as “people analytics.” Google has analyzed substantial data on their own employees to determine the characteristics of great leaders, to assess factors that contribute to productivity, and to evaluate potential new hires. Google also uses predictive analytics to continually update their forecast of future employee turnover and retention.12

Marketing Analytics Marketing is one of the fastest-growing areas for the application of analytics. A better understanding of consumer behavior through the use of scanner data and data generated from social media has led to an increased interest in marketing analytics. As a result, descriptive, predictive, and prescriptive analytics are all heavily used in marketing. A better understanding of consumer behavior through analytics leads to the better use of advertising budgets, more effective pricing strategies, improved forecasting of demand, improved product-line management, and increased customer satisfaction and loyalty. For example, Turner Broadcasting System Inc. uses forecasting and optimization models to create more-targeted audiences and to better schedule commercials for its advertising partners. The use of these models has led to an increase in Turner year-over-year advertising revenue of 186% and, at the same time, dramatically increased sales for the advertisers. Those advertisers that chose to benchmark found an increase in sales of $118 million.13 In another example of high-impact marketing analytics, automobile manufacturer Chrysler teamed with J.D. Power and Associates to develop an innovative set of predictive models to support its pricing decisions for automobiles. These models help Chrysler to better understand the ramifications of proposed pricing structures (a combination of manufacturer’s suggested retail price, interest rate offers, and rebates) and, as a result, to improve its pricing decisions. The models have generated an estimated annual savings of $500 million.14

Health Care Analytics The use of analytics in health care is on the increase because of pressure to simultaneously control costs and provide more effective treatment. Descriptive, predictive, and prescriptive analytics are used to improve patient, staff, and facility scheduling; patient flow; purchasing; and inventory control. A study by McKinsey Global Institute (MGI) and McKinsey & Company15 estimates that the health care system in the United States could save more than $300 billion per year by better utilizing analytics; these savings are approximately the equivalent of the entire gross domestic product of countries such as Finland, Singapore, and Ireland. The use of prescriptive analytics for diagnosis and treatment is relatively new, but it may prove to be the most important application of analytics in health care. For example, a group of scientists in Georgia used predictive models and optimization to develop personalized treatment for diabetes. They developed a predictive model that uses fluid dynamics and patient monitoring data to establish the relationship between drug dosage and drug effect at the individual level. This alleviates the need for more invasive procedures to monitor drug concentration. Then they used an optimization model that takes output from the predictive model to determine how an

J. Sullivan, “How Google Is Using People Analytics to Completely Reinvent HR,” Talent Management and HR web site, February 26, 2013.

12

J. A. Carbajal, P. Williams, A. Popescu, and W. Chaar, “Turner Blazes a Trail for Audience Targeting on Television with Operations Research and Advanced Analytics,“ INFORMS Journal on Applied Analytics, 49, no. 1 (January–February 2019).

13

J. Silva-Risso et al., “Chrysler and J. D. Power: Pioneering Scientific Price Customization in the Automobile Industry,” Interfaces 38, no. 1 (January–February 2008).

14

J. Manyika et al., “Big Data: The Next Frontier for Innovation, Competition and Productivity,” McKinsey Global Institute Report, 2011.

15

12

Chapter 1 Introduction

individual achieves better glycemic control using less dosage. Using the models results in about a 39% savings in hospital costs, which equates to about $40,880 per patient.16

Supply Chain Analytics The core service of companies such as UPS and FedEx is the efficient delivery of goods, and analytics has long been used to achieve efficiency. The optimal sorting of goods, vehicle and staff scheduling, and vehicle routing are all key to profitability for logistics companies such as UPS and FedEx. Companies can benefit from better inventory and processing control and more efficient supply chains. Analytic tools used in this area span the entire spectrum of analytics. For example, the women’s apparel manufacturer Bernard Claus, Inc. has successfully used descriptive analytics to provide its managers a visual representation of the status of its supply chain.17 ConAgra Foods uses predictive and prescriptive analytics to better plan capacity utilization by incorporating the inherent uncertainty in commodities pricing. ConAgra realized a 100% return on its investment in analytics in under three months—an unheard of result for a major technology investment.18

Analytics for Government and Nonprofits Government agencies and other nonprofits have used analytics to drive out inefficiencies and increase the effectiveness and accountability of programs. Indeed, much of advanced analytics has its roots in the U.S. and English military dating back to World War II. Today, the use of analytics in government is becoming pervasive in everything from elections to tax collection. For example, the New York State Department of Taxation and Finance has worked with IBM to use prescriptive analytics in the development of a more effective approach to tax collection. The result was an increase in collections from delinquent payers of $83 million over two years.19 The U.S. Internal Revenue Service has used data mining to identify patterns that distinguish questionable annual personal income tax filings. In one application, the IRS combines its data on individual taxpayers with data received from banks, on mortgage payments made by those taxpayers. When taxpayers report a mortgage payment that is unrealistically high relative to their reported taxable income, they are flagged as possible underreporters of taxable income. The filing is then further scrutinized and may trigger an audit. Likewise, nonprofit agencies have used analytics to ensure their effectiveness and accountability to their donors and clients. Catholic Relief Services (CRS) is the official international humanitarian agency of the U.S. Catholic community. The CRS mission is to provide relief for the victims of both natural and human-made disasters and to help people in need around the world through its health, educational, and agricultural programs. CRS uses an analytical spreadsheet model to assist in the allocation of its annual budget based on the impact that its various relief efforts and programs will have in different countries.20

Sports Analytics The use of analytics in sports has gained considerable notoriety since 2003 when renowned author Michael Lewis published Moneyball. Lewis’ book tells the story of how the Oakland Athletics used an analytical approach to player evaluation in order to assemble acompetitive team with a limited budget. The use of analytics for player evaluation and on-field strategy is now common, especially in professional sports. Professional sports teams use analytics to assess players for the amateur drafts and to decide how much to offer players in contract negotiations;21 E. Lee et al., “Outcome-Driven Personalized Treatment Design for Managing Diabetes,” Interfaces, 48, no. 5 (September–October 2018).

16

17

T. H. Davenport, ed., Enterprise Analytics (Upper Saddle River, NJ: Pearson Education Inc., 2013).

“ConAgra Mills: Up-to-the-Minute Insights Drive Smarter Selling Decisions and Big Improvements in Capacity Utilization,” IBM Smarter Planet Leadership Series. Available at: www.ibm.com/smarterplanet/us/en/leadership /conagra/, retrieved December 1, 2012.

18

19

G. Miller et al., “Tax Collection Optimization for New York State,” Interfaces 42, no. 1 (January–February 2013).

I. Gamvros, R. Nidel, and S. Raghavan, “Investment Analysis and Budget Allocation at Catholic Relief Services,” Interfaces 36, no. 5 (September–October 2006).

20

N. Streib, S. J. Young, and J. Sokol, “A Major League Baseball Team Uses Operations Research to Improve Draft Preparation,” Interfaces 42, no. 2 (March–April 2012).

21

1.6 Legal and Ethical Issues in the Use of Data and Analytics

13

p rofessional motorcycle racing teams use sophisticated optimization for gearbox design to gain competitive advantage;22 and teams use analytics to assist with on-field decisions such as which pitchers to use in various games of a Major League Baseball playoff series. The use of analytics for off-the-field business decisions is also increasing rapidly. Ensuring customer satisfaction is important for any company, and fans are the customers of sports teams. The Cleveland Indians professional baseball team used a type of predictive modeling known as conjoint analysis to design its premium seating offerings at Progressive Field based on fan survey data. Using prescriptive analytics, franchises across several major sports dynamically adjust ticket prices throughout the season to reflect the relative attractiveness and potential demand for each game.

Web Analytics Web analytics is the analysis of online activity, which includes, but is not limited to, visits to web sites and social media sites such as Facebook and LinkedIn. Web analytics obviously has huge implications for promoting and selling products and services via the Internet. Leading companies apply descriptive and advanced analytics to data collected in online experiments to determine the best way to configure web sites, position ads, and utilize social networks for the promotion of products and services. Online experimentation involves exposing various subgroups to different versions of a web site and tracking the results. Because of the massive pool of Internet users, experiments can be conducted without risking the disruption of the overall business of the company. Such experiments are proving to be invaluable because they enable the company to use trial-and-error in determining statistically what makes a difference in their web site traffic and sales.

1.6 Legal and Ethical Issues in the Use of Data and Analytics With the advent of big data and the dramatic increase in the use of analytics and data science to improve decision making, increased attention has been paid to ethical concerns around data privacy and the ethical use of models based on data. As businesses routinely collect data about their customers, they have an obligation to protect the data and to not misuse that data. Clients and customers have an obligation to understand the trade-offs between allowing their data to be collected and used, and the benefits they accrue from allowing a company to collect and use that data. For example, many companies have loyalty cards that collect data on customer purchases. In return for the benefits of using a loyalty card, typically discounted prices, customers must agree to allow the company to collect and use the data on purchases. An agreement must be signed between the customer and the company, and the agreement must specify what data will be collected and how it will be used. For example, the agreement might say that all scanned purchases will be collected with the date, time, location, and card number, but that the company agrees to only use that data internally to the company and to not give or sell that data to outside firms or individuals. The company then has an ethical obligation to uphold that agreement and make every effort to ensure that the data are protected from any type of unauthorized access. Unauthorized access of data is known as a data breach. Data breaches are a major concern for all companies in the digital age. A study by IBM and the Ponemon Institute estimated that the average cost of a data breach is $3.86 million. Data privacy laws are designed to protect individuals’ data from being used against their wishes. One of the strictest data privacy laws is the General Data Protection Regulation (GDPR) which went into effect in the European Union in May 2018. The law stipulates that the request for consent to use an individual’s data must be easily understood and accessible, the intended uses of the data must be specified, and it must be easy to withdraw consent. The law also stipulates that an individual has a right to a copy of their data and the right “to be forgotten,” that is, the right to demand that their data be erased. It is the J. Amoros, L. F. Escudero, J. F. Monge, J. V. Segura, and O. Reinoso, “TEAM ASPAR Uses Binary Optimization to Obtain Optimal Gearbox Ratios in Motorcycle Racing,” Interfaces 42, no. 2 (March–April 2012).

22

14

Chapter 1 Introduction

responsibility of analytics professionals, indeed, anyone who handles or stores data, to understand the laws associated with the collection, storage, and use of individuals’ data. Ethical issues that arise in the use of data and analytics are just as important as the legal issues. Analytics professionals have a responsibility to behave ethically, which includes protecting data, being transparent about the data and how it was collected, and what it does and does not contain. Analysts must be transparent about the methods used to analyze the data and any assumptions that have to be made for the methods used. Finally, analysts must provide valid conclusions and understandable recommendations to their clients. Intentionally using data and analytics for unethical purposes is clearly unethical. For example, using analytics to identify whom to target for fraud is of course inherently unethical because the goal itself is an unethical objective. Intentionally using biased data to achieve a goal is likewise inherently unethical. Misleading a client by misrepresenting results is clearly unethical. For example, consider the case of an airline that runs an advertisement that “84% of business fliers to Chicago prefer that airline over its competitors.” Such a statement is valid if the airline randomly surveyed business fliers across all airlines with a destination of Chicago. But, if for convenience, the airline surveyed only its own customers, the survey would be biased, and the claim would be misleading because fliers on other airlines were not surveyed. Indeed, if anything, the only conclusion one can legitimately draw from the biased sample of its own customers would be that 84% of that airlines’ own customers preferred that airline and 16% of its own customers actually preferred another airline!23 In her book, Weapons of Math Destruction, author Cathy O’Neil discusses how algorithms and models can be unintentionally biased.24 For example, consider an analyst who is building a credit risk model for awarding loans. The location of the home of the applicant might be a variable that is correlated with other variables like income and ethnicity. Income is perhaps a relevant variable for determining the amount of a loan, but ethnicity is not. A model using home location could therefore lead to unintentional bias in the credit risk model. It is the analysts’ responsibility to make sure this type of model bias and data bias do not become a part of the model. Researcher and opinion writer Zeynep Tufecki25 examines so-called “unintended consequences” of analytics, and particularly of machine learning and recommendation engines. Tufecki has pointed out that many Internet sites that use recommendation engines often suggest more extreme content, in terms of political views and conspiracy theories, to users based on their past viewing history. Tufecki and others theorize that this is because the machine learning algorithms being used have identified that more extreme content increases users’ viewing time on the site, which is often the objective function being maximized by the machine learning algorithm. Therefore, while it is not the intention of the algorithm to promote more extreme views and disseminate false information, this may be the unintended consequence of using a machine learning algorithm that maximizes users’ viewing time on the site. Analysts and decision makers must be aware of potential unintended consequences of their models, and they must decide how to react to these consequences once they are discovered. Several organizations, including the American Statistical Association (ASA) and the Institute for Operations Research and the Management Sciences (INFORMS), provide ethical guidelines for analysts. In their “Ethical Guidelines for Statistical Practice,”26 the ASA uses the term statistician throughout, but states that this “includes all practitioners of statistics and quantitative sciences—regardless of job title or field of degree— comprising statisticians at all levels of the profession and members of other professions who utilize and report statistical analyses and their applications.” Their guidelines

23

A. Barnett, “Misapplications Reviews: Newswatch,” Interfaces 14, no. 6 (November–December 1984).

C. O’Neil, Weapons of Math Destruction, How Big Data Increases Inequality and Threatens Democracy (New York: Crown Publishing, 2016).

24

25

Z. Tufecki. “YouTube, the Great Radicalizer,” The New York Times, March 10, 2018.

26

Ethical Guidelines for Statistical Practice, the American Statistical Association, April 14, 2018.

1.6 Legal and Ethical Issues in the Use of Data and Analytics

15

state that “Good statistical practice is fundamentally based on transparent assumptions, reproducible results, and valid interpretations.” More details are given in eight different sections of the guidelines and we encourage you to read and familiarize yourself with these guidelines. INFORMS is a professional society focused on operations research and the management sciences, including analytics. INFORMS offers an analytics certification called CAP— certified analytics professional. All candidates for CAP are required to comply with the code of ethics/conduct provided by INFORMS.27 The INFORMS CAP guidelines state, “In general, analytics professionals are obliged to conduct their professional activities responsibly, with particular attention to the values of consistency, respect for individuals, autonomy of all, integrity, justice, utility and competence.” INFORMS also offers a set of Ethics Guidelines for its members, which covers ethical behavior for analytics professionals in three domains: Society, Organizations (businesses, government, nonprofit organization, and universities), and the Profession (operations research and analytics).28 As these guidelines are fairly easy to understand and at the same time fairly comprehensive, we list them here in Table 1.2 and encourage you as a user/provider of analytics to make them your guiding principles.

Table 1.2

INFORMS Ethics Guidelines

Relative to Society Analytics professionals should aspire to be: ● Accountable for their professional actions and the impact of their work. ● Forthcoming about their assumptions, interests, sponsors, motivations, limitations, and potential conflicts of interest. ● Honest in reporting their results, even when they fail to yield the desired outcome. ● Objective in their assessments of facts, irrespective of their opinions or beliefs. ● Respectful of the viewpoints and the values of others. ● Responsible for undertaking research and projects that provide positive benefits by advancing our scientific understanding, contributing to organizational improvements, and supporting social good.

Relative to Organizations Analytics professionals should aspire to be: ● Accurate in our assertions, reports, and presentations. ● Alert to possible unintended or negative consequences that our results and recommendations may have on others. ● Informed of advances and developments in the fields relevant to our work. ● Questioning of whether there are more effective and efficient ways to reach a goal. ● Realistic in our claims of achievable results, and in acknowledging when the best course of action may be to terminate a project. ● Rigorous by adhering to proper professional practices in the development and reporting of our work.

Relative to the Profession Analytics professionals should aspire to be: ● Cooperative by sharing best practices, information, and ideas with colleagues, young professionals, and students. ● Impartial in our praise or criticism of others and their accomplishments, setting aside personal interests. ● Inclusive of all colleagues, and rejecting discrimination and harassment in any form. ● Tolerant of well-conducted research and well-reasoned results, which may differ from our own findings or opinions. ● Truthful in providing attribution when our work draws from the ideas of others. ● Vigilant by speaking out against actions that are damaging to the profession

27

Certified Analytics Professional Code of Ethics/Conduct. Available at www.certifiedanalytics.org/ethics.php.

INFORMS Ethics Guidelines. Available at www.informs.org/About-INFORMS/Governance/INFORMS-Ethics-Guidelines.

28

16

Chapter 1 Introduction

S umma r y This introductory chapter began with a discussion of decision making. Decision making can be defined as the following process: (1) identify and define the problem, (2) determine the criteria that will be used to evaluate alternative solutions, (3) determine the set of alternative solutions, (4) evaluate the alternatives, and (5) choose an alternative. Decisions may be strategic (high level, concerned with the overall direction of the business), tactical (midlevel, concerned with how to achieve the strategic goals of the business), or operational (day-to-day decisions that must be made to run the company). Uncertainty and an overwhelming number of alternatives are two key factors that make decision making difficult. Business analytics approaches can assist by identifying and mitigating uncertainty and by prescribing the best course of action from a very large number of alternatives. In short, business analytics can help us make better-informed decisions. There are three categories of analytics: descriptive, predictive, and prescriptive. Descriptive analytics describes what has happened and includes tools such as reports, data visualization, data dashboards, descriptive statistics, and some data-mining techniques. Predictive analytics consists of techniques that use past data to predict future events or ascertain the impact of one variable on another. These techniques include regression, data mining, forecasting, and simulation. Prescriptive analytics uses data to determine a course of action. This class of analytical techniques includes rule-based models, simulation, decision analysis, and optimization. Descriptive and predictive analytics can help us better understand the uncertainty and risk associated with our decision alternatives. Predictive and prescriptive analytics, also often referred to as advanced analytics, can help us make the best decision when facing a myriad of alternatives. Big data is a set of data that is too large or too complex to be handled by standard data-processing techniques or typical desktop software. The increasing prevalence of big data is leading to an increase in the use of analytics. The Internet, retail scanners, and cell phones are making huge amounts of data available to companies, and these companies want to better understand these data. Business analytics helps them understand these data and use them to make better decisions. We also discussed various application areas of analytics. Our discussion focused on financial analytics, human resource analytics, marketing analytics, health care analytics, supply chain analytics, analytics for government and nonprofit organizations, sports analytics, and web analytics. However, the use of analytics is rapidly spreading to other sectors, industries, and functional areas of organizations. We concluded this chapter with a discussion of legal and ethical issues in the use of data and analytics, a topic that should be of great importance to all practitioners and consumers of analytics. Each remaining chapter in this text will provide a real-world vignette in which business analytics is applied to a problem faced by a real organization. G l o ssa r y Artificial Intelligence (AI) The use of data and computers to make decisions that would have in the past required human intelligence. Advanced analytics Predictive and prescriptive analytics. Big data Any set of data that is too large or too complex to be handled by standard data-processing techniques and typical desktop software. Business analytics The scientific process of transforming data into insight for making better decisions. Data dashboard A collection of tables, charts, and maps to help management monitor selected aspects of the company’s performance. Data mining The use of analytical techniques for better understanding patterns and relationships that exist in large data sets. Data query A request for information with certain characteristics from a database.

Glossary

17

Data scientists Analysts trained in both computer science and statistics who know how to effectively process and analyze massive amounts of data. Data security Protecting stored data from destructive forces or unauthorized users. Decision analysis A technique used to develop an optimal strategy when a decision maker is faced with several decision alternatives and an uncertain set of future events. Descriptive analytics Analytical tools that describe what has happened. Hadoop An open-source programming environment that supports big data processing through distributed storage and distributed processing on clusters of computers. Internet of Things (IoT) The technology that allows data collected from sensors in all types of machines to be sent over the Internet to repositories where it can be stored and analyzed. MapReduce Programming model used within Hadoop that performs the two major steps for which it is named: the map step and the reduce step. The map step divides the data into manageable subsets and distributes it to the computers in the cluster for storing and processing. The reduce step collects answers from the nodes and combines them into an answer to the original problem. Operational decisions A decision concerned with how the organization is run from day to day. Optimization models A mathematical model that gives the best decision, subject to the situation’s constraints. Predictive analytics Techniques that use models constructed from past data to predict the future or to ascertain the impact of one variable on another. Prescriptive analytics Techniques that analyze input data and yield a best course of action. Rule-based model A prescriptive model that is based on a rule or set of rules. Simulation The use of probability and statistics to construct a computer model to study the impact of uncertainty on the decision at hand. Simulation optimization The use of probability and statistics to model uncertainty, combined with optimization techniques, to find good decisions in highly complex and highly uncertain settings. Strategic decision A decision that involves higher-level issues and that is concerned with the overall direction of the organization, defining the overall goals and aspirations for the organization’s future. Tactical decision A decision concerned with how the organization should achieve the goals and objectives set by its strategy. Utility theory The study of the total worth or relative desirability of a particular outcome that reflects the decision maker’s attitude toward a collection of factors such as profit, loss, and risk.

Chapter 2 Descriptive Statistics CONTENTS Analytics in Action: U.S. Census Bureau 2.1 OVERVIEW OF USING DATA: DEFINITIONS AND GOALS 2.2 TYPES OF DATA Population and Sample Data Quantitative and Categorical Data Cross-Sectional and Time Series Data Sources of Data 2.3 MODIFYING DATA IN EXCEL Sorting and Filtering Data in Excel Conditional Formatting of Data in Excel 2.4 CREATING DISTRIBUTIONS FROM DATA Frequency Distributions for Categorical Data Relative Frequency and Percent Frequency Distributions Frequency Distributions for Quantitative Data Histograms Cumulative Distributions 2.5 MEASURES OF LOCATION Mean (Arithmetic Mean) Median Mode Geometric Mean 2.6 MEASURES OF VARIABILITY Range Variance Standard Deviation Coefficient of Variation 2.7 ANALYZING DISTRIBUTIONS Percentiles Quartiles z-Scores Empirical Rule Identifying Outliers Boxplots 2.8 MEASURES OF ASSOCIATION BETWEEN TWOVARIABLES Scatter Charts Covariance Correlation Coefficient

20

Chapter 2 Descriptive Statistics

2.9 DATA Cleansing Missing Data Blakely Tires Identification of Erroneous Outliers and Other Erroneous Values Variable Representation Summary 69 Glossary 70 Problems 71 Available in MindTap Reader: Appendix: Descriptive Statistics WITH R A n a ly t i c s

i n

A c t io n

U.S. Census Bureau The U.S. Census Bureau is part of the Department of Commerce. The U.S. Census Bureau collects data related to the population and economy of the United States using a variety of methods and for many purposes. These data are essential to many government and business decisions. Probably the best-known data collected by the U.S. Census Bureau is the decennial census, which is an effort to count the total U.S. population. Collecting these data is a huge undertaking involving mailings, door-to-door visits, and other methods. The decennial census collects categorical data such as the sex and race of the respondents, as well as quantitative data such as the number of people living in the household. The data collected in the decennial census are used to determine the number of representatives assigned to each state, the number of Electoral College votes apportioned to each state, and how federal government funding is divided among communities. The U.S. Census Bureau also administers the Current Population Survey (CPS). The CPS is a cross-sectional monthly survey of a sample of 60,000 households used to estimate employment and unemployment rates in different geographic areas. The CPS has been administered since 1940, so an extensive time series of employment and unemployment data

now exists. These data drive government policies such as job assistance programs. The estimated unemployment rates are watched closely as an overall indicator of the health of the U.S. economy. The data collected by the U.S. Census Bureau are also very useful to businesses. Retailers use data on population changes in different areas to plan new store openings. Mail-order catalog companies use the demographic data when designing targeted marketing campaigns. In many cases, businesses combine the data collected by the U.S. Census Bureau with their own data on customer behavior to plan strategies and to identify potential customers. The data collected by the U.S. Census Bureau is publicly available and can be downloaded from its web site. In this chapter, we first explain the need to collect and analyze data and identify some common sources of data. Then we discuss the types of data that you may encounter in practice and present several numerical measures for summarizing data. We cover some common ways of manipulating and summarizing data using spreadsheets. We then develop numerical summary measures for data sets consisting of a single variable. When a data set contains more than one variable, the same numerical measures can be computed separately for each variable. In the two-variable case, we also develop measures of the relationship between the variables.

2.1 Overview of Using Data: Definitions and Goals Data are the facts and figures collected, analyzed, and summarized for presentation and interpretation. Table 2.1 shows a data set containing information for stocks in the Dow Jones Industrial Index (or simply “the Dow”) on June 25, 2019. The Dow is tracked by many financial advisors and investors as an indication of the state of the overall financial markets and the economy in the United States. The share prices for the 30 companies listed in Table2.1 are the basis for computing the Dow Jones Industrial Average (DJI), which is tracked continuously by virtually every financial publication. The index is named for Charles Dow and Edward Jones who first began calculating the DJI in 1896. A characteristic or a quantity of interest that can take on different values is known as a variable; for the data in Table 2.1, the variables are Symbol, Industry, Share Price, and Copyright 2021 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.

21

2.1 Overview of Using Data: Definitions and Goals

Table 2.1

Data for Dow Jones Industrial Index Companies

Company

Symbol

Industry

Share Price ($)

Volume

Apple

AAPL

American Express

AXP

Technology

195.57

21,060,685

Financial

123.16

Boeing

2,387,770

BA

Manufacturing

369.32

3,002,708

Caterpillar

CAT

Manufacturing

133.71

3,747,782

Cisco Systems

CSCO

Technology

56.08

25,533,426

Chevron Corporation

CVX

Chemical, Oil, and Gas

123.64

4,705,879

Disney

DIS

Entertainment

139.94

14,670,995

Dow, Inc.

DOW

Chemical, Oil, and Gas

49.69

4,002,257

Goldman Sachs

GS

Financial

196.06

1,828,219

The Home Depot

HD

Retail

204.74

3,583,573

IBM

IBM

Technology

138.36

2,797,803

Intel

INTC

Technology

46.85

16,658,127

Johnson & Johnson

JNJ

Pharmaceuticals

144.24

7,516,973

JPMorgan Chase

JPM

Banking

107.76

18,654,861

Coca-Cola

KO

Food and Drink

51.76

11,517,843

McDonald’s

MCD

Food and Drink

205.71

3,017,625

3M

MMM

Conglomerate

172.03

2,730,927

Merck

MRK

Pharmaceuticals

Microsoft

MSFT

Technology

Nike

NKE

Consumer Goods

82.62

7,335,836

Pfizer

PFE

Pharmaceuticals

43.76

26,952,088

Procter & Gamble

PG

Consumer Goods

111.72

6,795,912

Travelers

TRV

Insurance

153.13

1,295,768

UnitedHealth Group

UNH

Healthcare

247.66

3,178,942

United Technologies

UTX

Conglomerate

129.02

2,790,767

Visa

V

Financial

171.28

9,897,832

Verizon

VZ

Telecommunications

58.00

10,554,753

Walgreens Boots Alliance

WBA

Retail

52.95

8,535,442

Wal-Mart

WMT

Retail

110.72

6,104,935

ExxonMobil

XOM

Chemical, Oil, and Gas

76.27

9,722,688

Decision variables used in optimization models are covered in Chapters 12, 13, and 14. Random variables are covered in greater detail in Chapters4 and 11.

85.24

8,909,750

133.43

33,328,420

Volume. An observation is a set of values corresponding to a set of variables; each row in Table 2.1 corresponds to an observation. Practically every problem (and opportunity) that an organization (or individual) faces is concerned with the impact of the possible values of relevant variables on the business outcome. Thus, we are concerned with how the value of a variable can vary; variation is the difference in a variable measured over observations (time, customers, items, etc.). The role of descriptive analytics is to collect and analyze data to gain a better understanding of variation and its impact on the business setting. The values of some variables are under direct control of the decision maker (these are often called decision variables). The values of other variables may fluctuate with uncertainty because of factors outside the direct control of the decision maker. In general, a quantity whose values are not known with certainty is called a random variable, or uncertain variable. When we collect data, we are gathering past observed values, or realizations of a variable. By collecting these past realizations of one or more variables, our goal is to learn more about the variation of a particular business situation.

22

Chapter 2 Descriptive Statistics

2.2 Types of Data Population and Sample Data To ensure that the companies in the Dow form a representative sample, companies are periodically added and removed from the Dow. It is possible that the companies in the Dow today have changed from what is shown in Table 2.1.

Data can be categorized in several ways based on how they are collected and the type collected. In many cases, it is not feasible to collect data from the population of all elements of interest. In such instances, we collect data from a subset of the population known as a sample. For example, with the thousands of publicly traded companies in the United States, tracking and analyzing all of these stocks every day would be too time consuming and expensive. The Dow represents a sample of 30 stocks of large public companies based in the United States, and it is often interpreted to represent the larger population of all publicly traded companies. It is very important to collect sample data that are representative of the population data so that generalizations can be made from them. In most cases (although not true of the Dow), a representative sample can be gathered by random sampling from the population data. Dealing with populations and samples can introduce subtle differences in how we calculate and interpret summary statistics. In almost all practical applications of business analytics, we will be dealing with sample data.

Quantitative and Categorical Data Data are considered quantitative data if numeric and arithmetic operations, such as addition, subtraction, multiplication, and division, can be performed on them. For instance, we can sum the values for Volume in the Dow data in Table 2.1 to calculate a total volume of all shares traded by companies included in the Dow. If arithmetic operations cannot be performed on the data, they are considered categorical data. We can summarize categorical data by counting the number of observations or computing the proportions of observations in each category. For instance, the data in the Industry column in Table 2.1 are categorical. We can count the number of companies in the Dow that are in the telecommunications industry. Table 2.1 shows three companies in the financial industry: American Express, Goldman Sachs, and Visa. We cannot perform arithmetic operations on the data in the Industry column.

Cross-Sectional and Time Series Data For statistical analysis, it is important to distinguish between cross-sectional data and time series data. Cross-sectional data are collected from several entities at the same, or approximately the same, point in time. The data in Table 2.1 are cross-sectional because they describe the 30 companies that comprise the Dow at the same point in time (June 2019). Time series data are collected over several time periods. Graphs of time series data are frequently found in business and economic publications. Such graphs help analysts understand what happened in the past, identify trends over time, and project future levels for the time series. For example, the graph of the time series in Figure 2.1 shows the DJI value from January 2006 to May 2019. The figure illustrates that the DJI limbed to above 14,000 in 2007. However, the financial crisis in 2008 led to a significant decline in the DJI to between 6,000 and 7,000 by 2009. Since 2009, the DJI has been generally increasing and topped 26,000 in 2019.

Sources of Data Data necessary to analyze a business problem or opportunity can often be obtained with an appropriate study; such statistical studies can be classified as either experimental or observational. In an experimental study, a variable of interest is first identified. Then one or more other variables are identified and controlled or manipulated to obtain data about how these variables influence the variable of interest. For example, if a pharmaceutical firm conducts an experiment to learn about how a new drug affects blood pressure, then blood pressure is the variable of interest. The dosage level of the new drug is another variable that is hoped to have a causal effect on blood pressure. To obtain data about the effect of

23

2.2 Types of Data

FIGURE 2.1

Dow Jones Industrial Average Values Since 2006

30000

DJI Values

25000 20000 15000 10000 5000 0 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019

In Chapter 15 we discuss methods for determining the value of additional information that can be provided by collecting data.

the new drug, researchers select a sample of individuals. The dosage level of the new drug is controlled by giving different dosages to the different groups of individuals. Before and after the study, data on blood pressure are collected for each group. Statistical analysis of these experimental data can help determine how the new drug affects blood pressure. Nonexperimental, or observational, studies make no attempt to control the variables of interest. A survey is perhaps the most common type of observational study. For instance, in a personal interview survey, research questions are first identified. Then a questionnaire is designed and administered to a sample of individuals. Some restaurants use observational studies to obtain data about customer opinions with regard to the quality of food, quality of service, atmosphere, and so on. A customer opinion questionnaire used by Chops City Grill in Naples, Florida, is shown in Figure 2.2. Note that the customers who fill out the questionnaire are asked to provide ratings for 12 variables, including overall experience, the greeting by hostess, the table visit by the manager, overall service, and so on. The response categories of excellent, good, average, fair, and poor provide categorical data that enable Chops City Grill management to maintain high standards for the restaurant’s food and service. In some cases, the data needed for a particular application exist from an experimental or observational study that has already been conducted. For example, companies maintain a variety of databases about their employees, customers, and business operations. Data onemployee salaries, ages, and years of experience can usually be obtained from internal personnel records. Other internal records contain data on sales, advertising expenditures, distribution costs, inventory levels, and production quantities. Most companies also maintain detailed data about their customers. Anyone who wants to use data and statistical analysis to aid in decision making must be aware of the time and cost required to obtain the data. The use of existing data sources is desirable when data must be obtained in a relatively short period of time. If important data are not readily available from a reliable existing source, the additional time and cost involved in obtaining the data must be taken into account. In all cases, the decision maker should consider the potential contribution of the statistical analysis to the decision-making process. The cost of data acquisition and the subsequent statistical analysis should not exceed the savings generated by using the information to make a better decision.

24

Chapter 2 Descriptive Statistics

FIGURE2.2

Customer Opinion Questionnaire Used by Chops City Grill Restaurant

Date: ____________

Server Name: ____________

O

ur customers are our top priority. Please take a moment to fill out our survey card, so we can better serve your needs. You may return this card to the front desk or return by mail. Thank you! SERVICE SURVEY

Overall Experience Greeting by Hostess Manager (Table Visit) Overall Service Professionalism Menu Knowledge Friendliness

Excellent

Good

Average

Fair

Poor

❑ ❑ ❑ ❑ ❑ ❑ ❑ ❑ ❑ ❑ ❑ ❑

❑ ❑ ❑ ❑ ❑ ❑ ❑ ❑ ❑ ❑ ❑ ❑

❑ ❑ ❑ ❑ ❑ ❑ ❑ ❑ ❑ ❑ ❑ ❑

❑ ❑ ❑ ❑ ❑ ❑ ❑ ❑ ❑ ❑ ❑ ❑

❑ ❑ ❑ ❑ ❑ ❑ ❑ ❑ ❑ ❑ ❑ ❑

Wine Selection Menu Selection Food Quality Food Presentation Value for $ Spent What comments could you give us to improve our restaurant?

Thank you, we appreciate your comments. —The staff of Chops City Grill.

N otes

+

C o m m ents

1. Organizations that specialize in collecting and maintaining data make available substantial amounts of business and economic data. Companies can access these external data sources through leasing arrangements or by purchase. Dun & Bradstreet, Bloomberg, and Dow Jones & Company are three firms that provide extensive business database services to clients. Nielsen and Ipsos are two companies that have built successful businesses collecting and processing data that they sell to advertisers and product manufacturers. Data are also available from a variety of industry associations and special-interest organizations. 2. Government agencies are another important source of existing data. For instance, the web site data.gov was launched by the U.S. government in 2009 to make it easier for the

public to access data collected by the U.S. federal government. The data.gov web site includes over 150,000 data sets from a variety of U.S. federal departments and agencies, but many other federal agencies maintain their own web sites and data repositories. Many state and local governments are also now providing data sets online. As examples, the states of California and Texas maintain open data portals at data.ca.gov and data.texas.gov, respectively. New York City’s open data web site is opendata.cityofnewyork.us and the city of Cincinnati, Ohio, is at data.cincinnati-oh.gov. In general, the Internet is an important source of data and statistical information. One can obtain access to stock quotes, meal prices at restaurants, salary data, and a wide array of other information simply by performing an Internet search.

25

2.3 Modifying Data in Excel

2.3 Modifying Data in Excel Projects often involve so much data that it is difficult to analyze all of the data at once. In this section, we examine methods for summarizing and manipulating data using Excel to make the data more manageable and to develop insights.

Sorting and Filtering Data in Excel Excel contains many useful features for sorting and filtering data so that one can more easily identify patterns. Table 2.2 contains data on the 20 top-selling passenger-car automobiles in the United States in February 2019. The table shows the model and manufacturer of each automobile as well as the sales for the model in February 2019 and February 2018. Figure 2.3 shows the data from Table 2.2 entered into an Excel spreadsheet, and the percent change in sales for each model from February 2018 to February 2019 has been calculated. This is done by entering the formula 5(D2-E2)/E2 in cell F2 and then copying the contents of this cell to cells F3 to F20. Suppose that we want to sort these automobiles by February 2018 sales instead of by February 2019 sales. To do this, we use Excel’s Sort function, as shown in the following steps. Step 1. Select cells A1:F21 Step 2. Click the Data tab in the Ribbon Step 3. Click Sort in the Sort & Filter group

TABLE 2.2 Rank (by February 2019 Sales)

Top20Cars2019

20 Top-Selling Automobiles in United States in February 2019

Manufacturer

Model

Sales (February 2019)

Sales (February 2018)

1

Toyota

Corolla

29,016

25,021

2

Toyota

Camry

24,267

30,865

3

Honda

Civic

22,979

25,816

4

Honda

Accord

20,254

19,753

5

Nissan

Sentra

17,072

17,148

6

Nissan

Altima

16,216

19,703

7

Ford

Fusion

13,163

16,721

8

Chevrolet

Malibu

10,799

11,890

9

Hyundai

Elantra

10,304

15,724

10

Kia

Soul

8,592

6,631

11

Chevrolet

Cruze

7,361

12,875

12

Nissan

Versa

7,410

7,196

13

Volkswagen

Jetta

7,109

4,592

14

Kia

Optima

7,212

6,402

15

Kia

Forte

6,953

7,662

16

Hyundai

Sonata

6,481

6,700

17

Tesla

Model 3

5,750

2,485

18

Dodge

Charger

6,547

7,568

19

Ford

Mustang

5,342

5,800

20

Ford

Fiesta

5,035

3,559

Source: Manufacturers and Automotive News Data Center.

26

Chapter 2 Descriptive Statistics

FIGURE2.3

Data for 20 Top-Selling Automobiles Entered into Excel with Percent Change in Sales from 2018

Step 4. Step 5. Step 6. Step 7.

Select the check box for My data has headers In the first Sort by dropdown menu, select Sales (February 2018) In the Order dropdown menu, select Largest to Smallest (see Figure 2.4) Click OK

The result of using Excel’s Sort function for the February 2018 data is shown in igure2.5. Now we can easily see that, although the Toyota Corolla was the best-selling F automobile in February 2019, both the Toyota Camry and the Honda Civic outsold the Toyota Corolla in February 2018. Note that while we sorted on Sales (February 2018), which is in column E, the data in all other columns are adjusted accordingly. Now let’s suppose that we are interested only in seeing the sales of models made by Nissan. We can do this using Excel’s Filter function: Step 1. Step 2. Step 3. Step 4. Step 5.

Select cells A1:F21 Click the Data tab in the Ribbon Click Filter in the Sort & Filter group Click on the Filter Arrow in column B, next to Manufacturer If all choices are checked, you can easily deselect all choices by unchecking (Select All). Then select only the check box for Nissan. Step 6. Click OK

The result is a display of only the data for models made by Nissan (see Figure 2.6). We now see that of the 20 top-selling models in February 2019, Nissan made three of them: the Altima, the Sentra, and the Versa. We can further filter the data by choosing the down arrows in the other columns. We can make all data visible again by clicking on the down arrow in column B and checking (Select All) and clicking OK, or by clicking Filter in the Sort & Filter Group again from the Datatab.

2.3 Modifying Data in Excel

FIGURE 2.4

Using Excel’s Sort Function to Sort the Top-Selling Automobiles Data

FIGURE 2.5

Top-Selling Automobiles Data Sorted by Sales in February 2018 Sales

27

28

Chapter 2 Descriptive Statistics

FIGURE2.6

Top-Selling Automobiles Data Filtered to Show Only Automobiles Manufactured by Nissan

Conditional Formatting of Data in Excel Conditional formatting in Excel can make it easy to identify data that satisfy certain conditions in a data set. For instance, suppose that we wanted to quickly identify the automobile models in Table 2.2 for which sales had decreased from February 2018 to February 2019. We can quickly highlight these models: Step 1. Starting with the original data shown in Figure 2.3, select cells F1:F21 Step 2. Click the Home tab in the Ribbon Step 3. Click Conditional Formatting in the Styles group Step 4. Select Highlight Cells Rules, and click Less Than...from the dropdown menu Step 5. Enter 0% in the Format cells that are LESS THAN: box Step 6. Click OK The results are shown in Figure 2.7. Here we see that the models with decreasing sales (for example, Toyota Camry, Honda Civic, Nissan Sentra, Nissan Altima) are now FIGURE2.7

Using Conditional Formatting in Excel to Highlight Automobiles with Declining Sales from February 2018

2.3 Modifying Data in Excel

Bar charts and other graphical presentations will be covered in detail in Chapter 3. We will see other uses for Conditional Formatting in Excel in Chapter3.

FIGURE2.8

29

clearly visible. Note that Excel’s Conditional Formatting function offers tremendous flexibility. Instead of highlighting only models with decreasing sales, we could instead choose DataBars from the Conditional Formatting dropdown menu in the Styles Group of the Home tab in the Ribbon. The result of using the Blue Data Bar Gradient Fill option is shown in F igure2.8. Data bars are essentially a bar chart input into the cells that shows the magnitude of the cell values. The widths of the bars in this display are comparable to the values of the variable for which the bars have been drawn; a value of 20 creates a bar twice as wide as that for a value of 10. Negative values are shown to the left side of the axis; positive values are shown to the right. Cells with negative values are shaded in red, and those with positive values are shaded in blue. Again, we can easily seewhich models had decreasing sales, but Data Bars also provide us with a visual representation of the magnitude of the change in sales. Many other Conditional Formatting options are available in Excel. The Quick Analysis button in Excel appears just outside the bottom-right corner of a group of selected cells whenever you select multiple cells. Clicking the QuickAnalysis button gives you shortcuts for Conditional Formatting, adding Data Bars,and other operations. Clicking on this button gives you the options shown in Figure 2.9 for Formatting. Note that there are also tabs for Charts,Totals, Tables, andSparklines.

Using Conditional Formatting in Excel to Generate Data Bars for the Top-Selling Automobiles Data

30

Chapter 2 Descriptive Statistics

FIGURE 2.9

Excel Quick Analysis Button Formatting Options

Formatting

Charts

Totals

Tables

Sparklines

ab Data Bars

Color...

Icon Set

Greater...

Text...

Clear...

Conditional Formatting uses rules to highlight interesting data.

2.4 Creating Distributions from Data Distributions help summarize many characteristics of a data set by describing how often certain values for a variable appear in that data set. Distributions can be created for both categorical and quantitative data, and they assist the analyst in gauging variation.

Frequency Distributions for Categorical Data Bins for categorical data are also referred to as classes.

It is often useful to create a frequency distribution for a data set. A frequency distribution is a summary of data that shows the number (frequency) of observations in each of several nonoverlapping classes, typically referred to as bins. Consider the data in Table 2.3, taken

Table 2.3

SoftDrinks

Data from a Sample of 50 Soft Drink Purchases

Coca-Cola

Sprite

Pepsi

Diet Coke

Coca-Cola

Coca-Cola

Pepsi

Diet Coke

Coca-Cola

Diet Coke

Coca-Cola

Coca-Cola

Coca-Cola

Diet Coke

Pepsi

Coca-Cola

Coca-Cola

Dr. Pepper

Dr. Pepper

Sprite

Coca-Cola

Diet Coke

Pepsi

Diet Coke

Pepsi

Coca-Cola

Pepsi

Pepsi

Coca-Cola

Pepsi

Coca-Cola

Coca-Cola

Pepsi

Dr. Pepper

Pepsi

Pepsi

Sprite

Coca-Cola

Coca-Cola

Coca-Cola

Sprite

Dr. Pepper

Diet Coke

Dr. Pepper

Pepsi

Coca-Cola

Pepsi

Sprite

Coca-Cola

Diet Coke

See Appendix A for more information on absolute versus relative references in Excel.

31

2.4 Creating Distributions from Data

from a sample of 50 soft drink purchases. Each purchase is for one of five popular soft drinks, which define the five bins: Coca-Cola, Diet Coke, Dr. Pepper, Pepsi, and Sprite. To develop a frequency distribution for these data, we count the number of times each soft drink appears in Table 2.3. Coca-Cola appears 19 times, Diet Coke appears 8 times, Dr. Pepper appears 5 times, Pepsi appears 13 times, and Sprite appears 5 times. These counts are summarized in the frequency distribution in Table 2.4. This frequency distribution provides a summary of how the 50 soft drink purchases are distributed across the 5 soft drinks. This summary offers more insight than the original data shown in Table 2.3. The frequency distribution shows that Coca-Cola is the leader, Pepsi is second, Diet Coke is third, and Sprite and Dr. Pepper are tied for fourth. The frequency distribution thus summarizes information about the popularity of the five soft drinks. We can use Excel to calculate the frequency of categorical observations occurring in a data set using the COUNTIF function. Figure 2.10 shows the sample of 50 soft drink purchases in an Excel spreadsheet. Column D contains the five different soft drink categories as the bins. In cell E2, we enter the formula 5COUNTIF($A$2:$B$26, D2), where A2:B26 is the range for the sample data, and D2 is the bin (Coca-Cola) that we are trying to match. The COUNTIF function in Excel counts the number of times a certain value appears in the indicated range. In this case we want to count the number of times Coca-Cola appears in the sample data. The result is a value of 19 in cell E2, indicating that Coca-Cola appears 19 times in the sample data. We can copy the formula from cell E2 to cells E3 to E6 to get frequency counts for Diet Coke, Pepsi, Dr. Pepper, and Sprite. By using the absolute reference $A$2:$B$26 in our formula, Excel always searches the same sample data for the values we want when we copy the formula.

Relative Frequency and Percent Frequency Distributions

The percent frequency of a bin is the relative frequency multiplied by100.

A frequency distribution shows the number (frequency) of items in each of several nonoverlapping bins. However, we are often interested in the proportion, or percentage, of items in each bin. The relative frequency of a bin equals the fraction or proportion of items belonging to a class. For a data set with n observations, the relative frequency of each bin can be determined as follows: Frequency of the bin Relative frequency of a bin 5 n A relative frequency distribution is a tabular summary of data showing the relative frequency for each bin. A percent frequency distribution summarizes the percent frequency of the data for each bin. Table 2.5 shows a relative frequency distribution and a percent frequency distribution for the soft drink data. Using the data from Table 2.4, we see that the relative frequency for Coca-Cola is 19/50 5 0.38, the relative frequency for Diet Coke is 8/50 5 0.16, and so on. From the percent frequency distribution, we see that38% of the purchases were Coca-Cola, 16% were Diet Coke, and so on. We canalsonote that 38% 1 26% 1 16% 5 80% of the purchases were the top three soft drinks. A percent frequency distribution can be used to provide estimates of the relative likelihoods of different values for a random variable. So, by constructing a percent frequency Table 2.4 Soft Drink

Frequency Distribution of Soft Drink Purchases Frequency

Coca-Cola

19

Diet Coke

8

Dr. Pepper

5

Pepsi

13

Sprite

5

Total

50

32

Chapter 2 Descriptive Statistics

FIGURE 2.10

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

Table 2.5 Soft Drink

Creating a Frequency Distribution for Soft Drinks Data in Excel A B Sample Data Coca-Cola Coca-Cola Diet Coke Sprite Pepsi Pepsi Diet Coke Coca-Cola Coca-Cola Pepsi Coca-Cola Sprite Dr. Pepper Dr. Pepper Diet Coke Pepsi Pepsi Diet Coke Pepsi Pepsi Coca-Cola Coca-Cola Dr. Pepper Coca-Cola Sprite Diet Coke Coca-Cola Pepsi Diet Coke Pepsi Coca-Cola Pepsi Coca-Cola Coca-Cola Diet Coke Dr. Pepper Coca-Cola Sprite Coca-Cola Coca-Cola Coca-Cola Coca-Cola Sprite Pepsi Coca-Cola Dr. Pepper Coca-Cola Pepsi Diet Coke Pepsi

C

D Bins Coca-Cola Diet Coke Dr. Pepper Pepsi Sprite

E 19 8 5 13 5

Relative Frequency and Percent Frequency Distributions ofSoft Drink Purchases Relative Frequency

Percent Frequency (%)

Coca-Cola

0.38

38

Diet Coke

0.16

16

Dr. Pepper

0.10

10

Pepsi

0.26

26

Sprite

0.10

10

1.00

100

Total

distribution from observations of a random variable, we can estimate the probability distribution that characterizes its variability. For example, the volume of soft drinks sold by a concession stand at an upcoming concert may not be known with certainty. However, if the data used to construct Table 2.5 are representative of the concession stand’s customer population, then the concession stand manager can use this information to determine the appropriate volume of each type of soft drink.

Frequency Distributions for Quantitative Data We can also create frequency distributions for quantitative data, but we must be more careful in defining the nonoverlapping bins to be used in the frequency distribution. For Copyright 2021 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.

33

2.4 Creating Distributions from Data

Table 2.6

AuditTime

Year-End Audit Times (Days)

12

14

19

18

15

15

18

17

20

27

22

23

22

21

33

28

14

18

16

13

example, consider the quantitative data in Table 2.6. These data show the time in days required to complete year-end audits for a sample of 20 clients of Sanderson and Clifford, a small public accounting firm. The three steps necessary to define the classes for a frequency distribution with quantitative data are as follows: 1. Determine the number of nonoverlapping bins. 2. Determine the width of each bin. 3. Determine the bin limits. Let us demonstrate these steps by developing a frequency distribution for the audit time data shown in Table 2.6. Number of Bins Bins are formed by specifying the ranges used to group the data. As a

general guideline, we recommend using from 5 to 20 bins. For a small number of data items, as few as five or six bins may be used to summarize the data. For a larger number of data items, more bins are usually required. The goal is to use enough bins to show the variation in the data, but not so many that some contain only a few data items. Because the number of data items in Table 2.6 is relatively small (n 5 20), we chose to develop a frequency distribution with five bins. Width of the Bins Second, choose a width for the bins. As a general guideline, we recommend that the width be the same for each bin. Thus, the choices of the number of bins and the width of bins are not independent decisions. A larger number of bins means a smaller bin width and vice versa. To determine an approximate bin width, we begin by identifying the largest and smallest data values. Then, with the desired number of bins specified, we can use the following expression to determine the approximate bin width.

Approximate Bin Width

Largest data value 2 smallest data value Number of bins

(2.1)

The approximate bin width given by equation (2.1) can be rounded to a more convenient value based on the preference of the person developing the frequency distribution. For example, an approximate bin width of 9.28 might be rounded to 10 simply because 10 is a more convenient bin width to use in presenting a frequency distribution. For the data involving the year-end audit times, the largest data value is 33, and the smallest data value is 12. Because we decided to summarize the data with five classes, using equation (2.1) provides an approximate bin width of (33 2 12)/5 5 4.2. We therefore decided to round up and use a bin width of five days in the frequency distribution. In practice, the number of bins and the appropriate class width are determined by trial and error. Once a possible number of bins are chosen, equation (2.1) is used to find the approximate class width. The process can be repeated for a different number of bins.

34

Chapter 2 Descriptive Statistics

Although an audit time of 12 days is actually the smallest observation in our data, we have chosen a lower bin limit of 10 simply for convenience. The lowest bin limit should include the smallest observation, and the highest bin limit should include the largest observation.

Ultimately, the analyst judges the combination of the number of bins and bin width that provides the best frequency distribution for summarizing the data. For the audit time data in Table 2.6, after deciding to use five bins, each with a width of five days, the next task is to specify the bin limits for each of the classes. Bin Limits Bin limits must be chosen so that each data item belongs to one and only one

class. The lower bin limit identifies the smallest possible data value assigned to the bin. The upper bin limit identifies the largest possible data value assigned to the class. In developing frequency distributions for qualitative data, we did not need to specify bin limits because each data item naturally fell into a separate bin. But with quantitative data, such as the audit times in Table 2.6, bin limits are necessary to determine where each data value belongs. Using the audit time data in Table 2.6, we selected 10 days as the lower bin limit and 14 days as the upper bin limit for the first class. This bin is denoted 10–14 in Table 2.7. The smallest data value, 12, is included in the 10–14 bin. We then selected 15 days as the lower bin limit and 19 days as the upper bin limit of the next class. We continued defining the lower and upper bin limits to obtain a total of five classes: 10–14, 15–19, 20–24, 25–29, and 30–34. The largest data value, 33, is included in the 30–34 bin. The difference between the upper bin limits of adjacent bins is the bin width. Using the first two upper bin limits of 14 and 19, we see that the bin width is 19 2 14 5 5. With the number of bins, bin width, and bin limits determined, a frequency distribution can be obtained by counting the number of data values belonging to each bin. For example, the data in Table 2.6 show that four values—12, 14, 14, and 13—belong to the 10–14 bin. Thus, the frequency for the 10–14 bin is 4. Continuing this counting process for the 15–19, 20–24, 25–29, and 30–34 bins provides the frequency distribution shown in Table 2.7. Using this frequency distribution, we can observe the following: The most frequently occurring audit times are in the bin of 15–19 days. Eight of the 20 audit times are in this bin. ● Only one audit required 30 or more days. ●

We define the relative frequency and percent frequency distributions for quantitative data in the same manner as for qualitative data.

Other conclusions are possible, depending on the interests of the person viewing the frequency distribution. The value of a frequency distribution is that it provides insights about the data that are not easily obtained by viewing the data in their original unorganized form. Table 2.7 also shows the relative frequency distribution and percent frequency distribution for the audit time data. Note that 0.40 of the audits, or 40%, required from 15 to 19 days. Only 0.05 of the audits, or 5%, required 30 or more days. Again, additional interpretations and insights can be obtained by using Table 2.7. Frequency distributions for quantitative data can also be created using Excel. Figure2.11 shows the data from Table 2.6 entered into an Excel Worksheet. The sample of 20 audit times is contained in cells A2:A21. The upper limits of the defined bins are in cells C2:C6.

Table 2.7

AuditTime

Frequency, Relative Frequency, and Percent Frequency Distributions for the Audit Time Data

Audit Times (days)

Frequency

Relative Frequency

Percent Frequency

10–14

4

0.20

20

15–19

8

0.40

40

20–24

5

0.25

25

25–29

2

0.10

10

30–34

1

0.05

5

2.4 Creating Distributions from Data

FIGURE 2.11

35

Using Excel to Generate a Frequency Distribution for Audit Times Data

We can use the FREQUENCY function in Excel to count the number of observations in each bin. Pressing CTRL1SHIFT1 ENTER in Excel indicates that the function should return an array of values.

Step 1. Select cells D2:D6 Step 2. Type the formula 5FREQUENCY(A2:A21, C2:C6). The range A2:A21 defines the data set, and the range C2:C6 defines the bins. Step 3. Press CTRL1SHIFT1ENTER after typing the formula in Step 2 Because these were the cells selected in Step 1 above (see Figure 2.11), Excel will then fill in the values for the number of observations in each bin in cells D2 through D6.

Histograms A common graphical presentation of quantitative data is a histogram. This graphical summary can be prepared for data previously summarized in either a frequency, a relative frequency, or a percent frequency distribution. A histogram is constructed by placing the variable of interest on the horizontal axis and the selected frequency measure (absolute frequency, relative frequency, or percent frequency) on the vertical axis. The frequency measure of each class is shown by drawing a rectangle whose base is the class limits on the horizontal axis and whose height is the corresponding frequency measure. Figure 2.12 is a histogram for the audit time data. Note that the class with the greatest frequency is shown by the rectangle appearing above the class of 15–19 days. The height of the rectangle shows that the frequency of this class is 8. A histogram for the relative or percent frequency distribution of these data would look the same as the histogram in

36

Chapter 2 Descriptive Statistics

Frequency

FIGURE 2.12

Histogram for the Audit Time Data

8 7 6 5 4 3 2 1 10–14

15–19 20–24 25–29 Audit Time (days)

30–34

Figure2.12, with the exception that the vertical axis would be labeled with relative or percent frequency values. Histograms can be created in Excel using the Data Analysis ToolPak. We will use the sample of 20 year-end audit times and the bins defined in Table 2.7 to create a histogram using the Data Analysis ToolPak. As before, we begin with an Excel Worksheet in which the sample of 20 audit times is contained in cells A2:A21, and the upper limits of the bins defined in Table 2.7 are in cells C2:C6 (see Figure 2.11). Step 1. Click the Data tab in the Ribbon Step 2. Click Data Analysis in the Analyze group Step 3. When the Data Analysis dialog box opens, choose Histogram from the list of Analysis Tools, and click OK In the Input Range: box, enter A2:A21 In the Bin Range: box, enter C2:C6 Under Output Options:, select New Worksheet Ply: Select the check box for Chart Output (see Figure 2.13) Click OK The text “10-14” in cell A2 can be entered in Excel as ‘10-14. The single quote indicates to Excel that this should be treated as text rather than a numerical or date value.

The histogram created by Excel for these data is shown in Figure 2.14. We have modified the bin ranges in column A by typing the values shown in Figure 2.14 into cells A2:A6 so that the chart created by Excel shows both the lower and upper limits for each bin. We have also removed the gaps between the columns in the histogram in Excel to match the traditional format of histograms. To remove the gaps between the columns in the histogram created by Excel, follow these steps: Step 1. Right-click on one of the columns in the histogram Select Format Data Series… Step 2. When the Format Data Series pane opens, click the Series Options button, Set the Gap Width to 0% One of the most important uses of a histogram is to provide information about the shape, or form, of a distribution. Skewness, or the lack of symmetry, is an important characteristic of the shape of a distribution. Figure 2.15 contains four histograms constructed from relative frequency distributions that exhibit different patterns of skewness. Panel A shows the histogram for a set of data moderately skewed to the left. A histogram is said to

37

2.4 Creating Distributions from Data

FIGURE 2.13

FIGURE 2.14

1 2 3 4 5 6 7 8 9 10 11 12 13 14

A Bin 10–14 15–19 20–24 25–29 30–34 More

Creating a Histogram for the Audit Time Data Using Data Analysis ToolPak in Excel

Completed Histogram for the Audit Time Data Using Data Analysis ToolPak in Excel B Frequency 4 8 5 2 1 0

C

D

E

F

G

H

I

J

Histogram

Frequency

9 8 7 6 5 4 3 2 1 0

Frequency

10–14

15–19

20–24 25–29 Bin

30–34

More

38

Chapter 2 Descriptive Statistics

FIGURE 2.15

Histograms Showing Distributions with Different Levels of Skewness

Panel A: Moderately Skewed Left

Panel B: Moderately Skewed Right

0.35

0.35

0.3

0.3

0.25

0.25

0.2

0.2

0.15

0.15

0.1

0.1

0.05

0.05

Panel C: Symmetric 0.3 0.25 0.2 0.15 0.1 0.05 0

Panel D: Highly Skewed Right 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0

be skewed to the left if its tail extends farther to the left than to the right. This histogram is typical for exam scores, with no scores above 100%, most of the scores above 70%, andonly a few really low scores. Panel B shows the histogram for a set of data moderately skewed to the right. A histogram is said to be skewed to the right if its tail extends farther to the right than to the left. An example of this type of histogram would be for data such as housing prices; a few expensive houses create the skewness in the right tail. Panel C shows a symmetric histogram, in which the left tail mirrors the shape of the right tail. Histograms for data found in applications are never perfectly symmetric, but the histogram for many applications may be roughly symmetric. Data for SAT scores, the heights and weights of people, and so on lead to histograms that are roughly symmetric. Panel D shows a histogram highly skewed to the right. This histogram was constructed from data on the amount of customer purchases in one day at a women’s apparel store. Data from applications in business and economics often lead to histograms that are skewed to the right. For instance, data on housing prices, salaries, purchase amounts, and so on often result in histograms skewed to the right.

Cumulative Distributions A variation of the frequency distribution that provides another tabular summary of quantitative data is the cumulative frequency distribution, which uses the number of classes, class

39

2.4 Creating Distributions from Data

widths, and class limits developed for the frequency distribution. However, rather than showing the frequency of each class, the cumulative frequency distribution shows the number of data items with values less than or equal to the upper class limit of each class. The first two columns of Table 2.8 provide the cumulative frequency distribution for the audit time data. To understand how the cumulative frequencies are determined, consider the class with the description “Less than or equal to 24.” The cumulative frequency for this class is simply the sum of the frequencies for all classes with data values less than or equal to 24. For the frequency distribution in Table 2.7, the sum of the frequencies for classes 10–14, 15–19, and 20–24 indicates that 4 1 8 1 5 5 17 data values are less than or equal to 24. Hence, the cumulative frequency for this class is 17. In addition, the cumulative frequency distribution in Table 2.8 shows that four audits were completed in 14 days or less and that 19 audits were completed in 29 days or less. As a final point, a cumulative relative frequency distribution shows the proportion of data items, and a cumulative percent frequency distribution shows the percentage of data items with values less than or equal to the upper limit of each class. The cumulative relative frequency distribution can be computed either by summing the relative frequencies in the relative frequency distribution or by dividing the cumulative frequencies by the total number of items. Using the latter approach, we found the cumulative relative frequencies in column 3 of Table 2.8 by dividing the cumulative frequencies in column 2 by the total number of items (n 5 20). The cumulative percent frequencies were again computed by multiplying the relative frequencies by 100. The cumulative relative and percent frequency distributions show that 0.85 of the audits, or 85%, were completed in 24 days or less, 0.95of the audits, or 95%, were completed in 29 days or less, and so on.

Table 2.8

Cumulative Frequency, Cumulative Relative Frequency, and Cumulative Percent Frequency Distributions for the Audit Time Data Cumulative Frequency

Cumulative Relative Frequency

Cumulative Percent Frequency

Less than or equal to 14

4

0.20

20

Less than or equal to 19

12

0.60

60

Less than or equal to 24

17

0.85

85

Less than or equal to 29

19

0.95

95

Less than or equal to 34

20

1.00

100

Audit Time (days)

N OT E S

+

C O MM E N T S

1. If Data Analysis does not appear in your Analyze group then you need to include the Data Analysis ToolPak Add-In. To do so, click on the File tab in the Ribbon and choose Options. When the Excel Options dialog box opens, click Add-Ins. At the bottom of the Excel Options dialog box, where it says Manage: Excel Add-ins, click Go.… Select the check box for Analysis ToolPak, and click OK. 2. Distributions are often used when discussing concepts related to probability and simulation because they are used to describe uncertainty. In Chapter 4 we will discuss probability distributions, and then in Chapter 11 we

will revisit distributions when we introduce simulation models. 3. In more recent versions of Excel, histograms can also be created using the new Histogram chart which can be found by clicking on the Insert tab in the Ribbon, clicking Insert Statistic Chart in the Charts group and selecting Histogram. Excel automatically chooses the number of bins and bin sizes. These values can be changed using Format Axis, but the functionality is more limited than the steps we provide in this section to create your own histogram.

40

Chapter 2 Descriptive Statistics

2.5 Measures of Location Mean (Arithmetic Mean) The most commonly used measure of location is the mean (arithmetic mean), or average value, for a variable. The mean provides a measure of central location for the data. If the data are for a sample (typically the case), the mean is denoted by x . The sample mean is a point estimate of the (typically unknown) population mean for the variable of interest. If the data for the entire population are available, the population mean is computed in the same manner, but denoted by the Greek letter m . In statistical formulas, it is customary to denote the value of variable x for the first observation by x1, the value of variable x for the second observation by x 2, and so on. In general, the value of variable x for the ith observation is denoted by xi . For a sample with n observations, the formula for the sample mean is as follows. If the data set is not a sample, but is the entire population withN observations, the population mean is computed directly by: m5

Sx i . N

SAMPLE MEAN

x5

Sxi x1 1 x 2 1 1 x n 5 n n

(2.2)

To illustrate the computation of a sample mean, suppose a sample of home sales is taken for a suburb of Cincinnati, Ohio. Table 2.9 shows the collected data. The mean home selling price for the sample of 12 home sales is Sxi x1 1 x 2 1 1 x12 5 12 n 138,000 1 254,000 1 1 456,250 5 12 2,639,250 5 5 219,937.50 12

x 5

The mean can be found in Excel using the AVERAGE function. Figure 2.16 shows the Home Sales data from Table 2.9 in an Excel spreadsheet. The value for the mean in cell E2is calculated using the formula 5 AVERAGE(B2:B13).

Table 2.9

HomeSales

Data on Home Sales in a Cincinnati, Ohio, Suburb Home Sale

Selling Price ($)

1

138,000

2

254,000

3

186,000

4

257,500

5

108,000

6

254,000

7

138,000

8

298,000

9

199,500

10

208,000

11

142,000

12

456,250

41

2.5 Measures of Location

FIGURE 2.16

1 2 3 4 5 6 7 8 9 10 11 12 13

Calculating the Mean, Median, and Modes for the Home Sales Data Using Excel

A

B

Home Sale 1 2 3 4 5 6 7 8 9 10 11 12

Selling Price ($) 138,000 254,000 186,000 257,500 108,000 254,000 138,000 298,000 199,500 208,000 142,000 456,250

C

D

E

Mean: Median: Mode 1: Mode 2:

=AVERAGE(B2:B13) =MEDIAN(B2:B13) =MODE.MULT(B2:B13) =MODE.MULT(B2:B13)

A

B

1 Home Sale Selling Price ($) 138,000 1 2 254,000 2 3 186,000 3 4 257,500 4 5 108,000 5 6 254,000 6 7 138,000 7 8 298,000 8 9 199,500 9 10 208,000 10 11 142,000 11 12 12 456,250 13

C

D Mean: Median: Mode 1: Mode 2:

E $ 219,937.50 $ 203,750.00 $ 138,000.00 $ 254,000.00

Median The median, another measure of central location, is the value in the middle when the data are arranged in ascending order (smallest to largest value). With an odd number of observations, the median is the middle value. An even number of observations has no single middle value. In this case, we follow convention and define the median as the average of the values for the middle two observations. Let us apply this definition to compute the median class size for a sample of five college classes. Arranging the data in ascending order provides the following list: 32 42 46 46 54 Because n 5 5 is odd, the median is the middle value. Thus, the median class size is 46students. Even though this data set contains two observations with values of 46, each observation is treated separately when we arrange the data in ascending order. Suppose we also compute the median value for the 12 home sales in Table 2.9. We first arrange the data in ascending order. 108,000 138,000 138,000 142,000 186,000 199,500 208,000 254,000 254,000 257,500 298,000 456,250 Middle Two Values

42

Chapter 2 Descriptive Statistics

Because n 5 12 is even, the median is the average of the middle two values: 199,500 and 208,000. Median 5

199,500 1 208,000 5 203,750 2

The median of a data set can be found in Excel using the function MEDIAN. In Figure2.16, the value for the median in cell E3 is found using the formula 5MEDIAN(B2:B13). Although the mean is the more commonly used measure of central location, in some situations the median is preferred. The mean is influenced by extremely small and large data values. Notice that the median is smaller than the mean in Figure 2.16. This is because the one large value of $456,250 in our data set inflates the mean but does not have the same effect on the median. Notice also that the median would remain unchanged if we replaced the $456,250 with a sales price of $1.5 million. In this case, the median selling price would remain $203,750, but the mean would increase to $306,916.67. If you were looking to buy a home in this suburb, the median gives a better indication of the central selling price of the homes there. We can generalize, saying that whenever a data set contains extreme values or is severely skewed, the median is often the preferred measure of central location.

Mode A third measure of location, the mode, is the value that occurs most frequently in a data set. To illustrate the identification of the mode, consider the sample of five class sizes. 32 42 46 46 54 The only value that occurs more than once is 46. Because this value, occurring with a frequency of 2, has the greatest frequency, it is the mode. To find the mode for a data set with only one most often occurring value in Excel, we use the MODE.SNGL function. Occasionally the greatest frequency occurs at two or more different values, in which case more than one mode exists. If data contain at least two modes, we say that they are multimodal. A special case of multimodal data occurs when the data contain exactly two modes; in such cases we say that the data are bimodal. In multimodal cases when there are more than two modes, the mode is almost never reported because listing three or more modes is not particularly helpful in describing a location for the data. Also, if no value in the data occurs more than once, we say the data have no mode. The Excel MODE.SNGL function will return only a single most-often-occurring value. For multimodal distributions, we must use the MODE.MULT command in Excel to return more than one mode. For example, two selling prices occur twice in Table 2.9: $138,000 and $254,000. Hence, these data are bimodal. To find both of the modes in Excel, we take these steps: We must press CTRL1SHIFT1ENTER because the MODE.MULT function returns an array of values.

Step 1. Select cells E4 and E5 Step 2. Type the formula 5MODE.MULT(B2:B13) Step 3. Press CTRL1SHIFT1ENTER after typing the formula in Step 2. Excel enters the values for both modes of this data set in cells E4 and E5: $138,000 and $254,000.

Geometric Mean

The geometric mean for a population is computed similarly but is defined as mg to denote that it is computed using the entire population.

The geometric mean is a measure of location that is calculated by finding the nth root of the product of n values. The general formula for the sample geometric mean, denoted x g, follows. SAMPLE GEOMETRIC MEAN

x g 5 n ( x1 )( x 2 )( x n ) 5 [( x1 )( x 2 )( x n )]1/ n

(2.3)

43

2.5 Measures of Location

The geometric mean is often used in analyzing growth rates in financial data. In these types of situations, the arithmetic mean or average value will provide misleading results. To illustrate the use of the geometric mean, consider Table 2.10, which shows the percentage annual returns, or growth rates, for a mutual fund over the past 10 years. Suppose we want to compute how much $100 invested in the fund at the beginning of year 1 would be worth at the end of year 10. We start by computing the balance in the fund at the end of year 1. Because the percentage annual return for year 1 was 222.1%, the balance in the fund at the end of year 1 would be $100 2 0.221($100) 5 $100(1 2 0.221) 5 $100(0.779) 5 $77.90 The growth factor for each year is 1 plus 0.01 times the percentage return. A growth factor less than 1 indicates negative growth, whereas a growth factor greater than 1 indicates positive growth. The growth factor cannot be less than zero.

We refer to 0.779 as the growth factor for year 1 in Table 2.10. We can compute the balance at the end of year 1 by multiplying the value invested in the fund at the beginning of year 1 by the growth factor for year 1: $100(0.779) 5 $77.90. The balance in the fund at the end of year 1, $77.90, now becomes the beginning balance in year 2. So, with a percentage annual return for year 2 of 28.7%, the balance at the end of year 2 would be $77.90 1 0.287($77.90) 5 $77.90(1 1 0.287) 5 $77.90(1.287) 5 $100.26 Note that 1.287 is the growth factor for year 2. By substituting $100(0.779) for $77.90, we see that the balance in the fund at the end of year 2 is $100(0.779)(1.287) 5 $100.26 In other words, the balance at the end of year 2 is just the initial investment at the beginning of year 1 times the product of the first two growth factors. This result can be generalized to show that the balance at the end of year 10 is the initial investment times the product of all 10 growth factors. $100[(0.779)(1.287)(1.109)(1.049)(1.158)(1.055)(0.630)(1.265)(1.151)(1.021)] 5 $100(1.335) 5 $133.45 So a $100 investment in the fund at the beginning of year 1 would be worth $133.45 at the end of year 10. Note that the product of the 10 growth factors is 1.335. Thus, we can compute the balance at the end of year 10 for any amount of money invested at the beginning of year 1 by multiplying the value of the initial investment by 1.335. For instance, an initial investment of $2,500 at the beginning of year 1 would be worth $2,500(1.335), or approximately $3,337.50, at the end of year 10.

Table2.10 Year

MutualFundReturns

Percentage Annual Returns and Growth Factors for the Mutual Fund Data Return (%)

Growth Factor

1

222.1

0.779

2

28.7

1.287

3

10.9

1.109

4

4.9

1.049

5

15.8

1.158

6

5.5

1.055

7

237.0

0.630

8

26.5

1.265

9

15.1

1.151

10

2.1

1.021

44

Chapter 2 Descriptive Statistics

What was the mean percentage annual return or mean rate of growth for this investment over the 10-year period? The geometric mean of the 10 growth factors can be used to answer this question. Because the product of the 10 growth factors is 1.335, the geometric mean is the 10th root of 1.335, or x g 5 10 1.335 5 1.029 The geometric mean tells us that annual returns grew at an average annual rate of (1.0292 1) 100, or 2.9%. In other words, with an average annual growth rate of 2.9%, a $100 investment in the fund at the beginning of year 1 would grow to $100(1.029)10 5 $133.09 at the end of 10 years. We can use Excel to calculate the geometric mean for the data in Table 2.10 by using the function GEOMEAN. In Figure 2.17, the value for the geometric mean in cell C13 is found using the formula 5GEOMEAN(C2:C11). It is important to understand that the arithmetic mean of the percentage annual returns does not provide the mean annual growth rate for this investment. The sum of the 10percentage annual returns in Table 2.10 is 50.4. Thus, the arithmetic mean of the 10percentage returns is 50.4/10 5 5.04%. A salesperson might try to convince you to invest in this fund by stating that the mean annual percentage return was 5.04%. Such a statement is not only misleading, it is inaccurate. A mean annual percentage return of 5.04% corresponds to an average growth factor of 1.0504. So, if the average growth factor were really 1.0504, $100 invested in the fund at the beginning of year 1 would have grown to $100(1.0504)10 5 $163.51 at the end of 10 years. But, using the 10 annual percentage returns in Table 2.10, we showed that an initial $100 investment is worth $133.09 at the end of 10 years. The salesperson’s claim that the mean annual percentage return is 5.04% grossly overstates the true growth for this mutual fund. The problem is that the arithmetic mean is appropriate only for an additive process. For a multiplicative process, such as applications involving growth rates, the geometric mean is the appropriate measure of location. While the application of the geometric mean to problems in finance, investments, and banking is particularly common, the geometric mean should be applied any time you want to determine the mean rate of change over several successive periods. Other common applications include changes in the populations of species, crop yields, pollution levels, and birth and death rates. The geometric mean can also be applied to changes that occur over any number

FIGURE2.17

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Calculating the Geometric Mean for the Mutual Fund Data Using Excel A Year 1 2 3 4 5 6 7 8 9 10

B Return (%) –22.1 28.7 10.9 4.9 15.8 5.5 –37.0 26.5 15.1 2.1

Geometric Mean:

C Growth Factor 0.779 1.287 1.109 1.049 1.158 1.055 0.630 1.265 1.151 1.021

D

1.029

45

2.6 Measures of Variability

of successive periods of any length. In addition to annual changes, the geometric mean is often applied to find the mean rate of change over quarters, months, weeks, and even days.

2.6 Measures of Variability In addition to measures of location, it is often desirable to consider measures of variability, or dispersion. For example, suppose that you are considering two financial funds. Both funds require a $1,000 annual investment. Table 2.11 shows the annual payouts for Fund A and Fund B for $1,000 investments over the past 20 years. Fund A has paid out exactly $1,100 each year for an initial $1,000 investment. Fund B has had many different payouts, but the mean payout over the previous 20 years is also $1,100. But would you consider the payouts of Fund A and Fund B to be equivalent? Clearly, the answer is no. The difference between the two funds is due to variability. Figure 2.18 shows a histogram for the payouts received from Funds A and B. Although the mean payout is the same for the two funds, their histograms differ in that the payouts associated with Fund B have greater variability. Sometimes the payouts are considerably larger than the mean, and sometimes they are considerably smaller. In this section, we present several different ways to measure variability.

Range The simplest measure of variability is the range. The range can be found by subtracting the smallest value from the largest value in a data set. Let us return to the home sales data set to demonstrate the calculation of range. Refer to the data from home sales prices in Table2.9. The largest home sales price is $456,250, and the smallest is $108,000. The range is $456,250 2 $108,000 5 $348,250. TABLE 2.11 Year

Annual Payouts for Two Different Investment Funds Fund A ($)

Fund B ($)

1

1,100

700

2

1,100

2,500

3

1,100

1,200

4

1,100

1,550

5

1,100

1,300

6

1,100

800

7

1,100

300

8

1,100

1,600

9

1,100

1,500

10

1,100

350

11

1,100

460

12

1,100

890

13

1,100

1,050

14

1,100

800

15

1,100

1,150

16

1,100

1,200

17

1,100

1,800

18

1,100

100

19

1,100

1,750

20

1,100

1,000

Mean

1,100

1,100

46

Chapter 2 Descriptive Statistics

Histograms for Payouts of Past 20 Years from Fund A and Fund B

20

20

15

15 Frequency

10

5

0 2,

50

0 2,

00

1–

2,

00

0 1– 50

1,

50 1,

1– 00 1,

1–

1,

00

50

1,100 Fund A Payouts ($)

50

5 0

10

0–

Frequency

FIGURE2.18

Fund B Payouts ($)

Although the range is the easiest of the measures of variability to compute, it is seldom used as the only measure. The reason is that the range is based on only two of the observations and thus is highly influenced by extreme values. If, for example, we replace the selling price of $456,250 with $1.5 million, the range would be $1,500,000 2 $108,000 5 $1,392,000. This large value for the range would not be especially descriptive of the variability in the data because 11 of the 12 home selling prices are between $108,000 and $298,000. The range can be calculated in Excel using the MAX and MIN functions. The range value in cell E7 of Figure 2.19 calculates the range using the formula 5MAX(B2:B13) − MIN(B2:B13). This subtracts the smallest value in the range B2:B13 from the largest value in the range B2:B13.

Variance

If the data are for a population, the population variance, s2, can be computed directly (rather than estimated by the sample variance). For a population of N observations and with m denoting the population mean, population variance is computed by S(x i 2 m )2 s2 5 . N

The variance is a measure of variability that utilizes all the data. The variance is based on the deviation about the mean, which is the difference between the value of each observation ( xi ) and the mean. For a sample, a deviation of an observation about the mean is written ( xi 2 x ). In the computation of the variance, the deviations about the mean are squared. In most statistical applications, the data being analyzed are for a sample. When we compute a sample variance, we are often interested in using it to estimate the population variance, s 2. Although a detailed explanation is beyond the scope of this text, for a random sample, it can be shown that, if the sum of the squared deviations about the sample mean is divided by n 2 1, and not n, the resulting sample variance provides an unbiased estimate of the population variance.1 For this reason, the sample variance, denoted by s 2, is defined as follows:

SAMPLE VARIANCE

s2 5

S( xi 2 x )2 n 21

(2.4)

1 Unbiased means that if we take a large number of independent random samples of the same size from the population and calculate the sample variance for each sample, the average of these sample variances will tend to be equal to the population variance.

47

2.6 Measures of Variability

Calculating Variability Measures for the Home Sales Data in Excel

FIGURE 2.19 A 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Home Sale 1 2 3 4 5 6 7 8 9 10 11 12

B

C

D

Selling Price ($) 138000 254000 186000 257500 108000 254000 138000 298000 199500 208000 142000 456250

E Mean: Median: Mode 1: Mode 2:

=AVERAGE(B2:B13) =MEDIAN(B2:B13) =MODE.MULT(B2:B13) =MODE.MULT(B2:B13)

Range: =MAX(B2:B13)-MIN(B2:B13) Variance: =VAR.S(B2:B13) Standard Deviation: =STDEV.S(B2:B13) Coefficient of Variation: =E9/E2 85th Percentile: =PERCENTILE.EXC(B2:B13,0.85) 1st Quartile: =QUARTILE. EXC(B2:B13,1) 2nd Quartile: =QUARTILE. EXC(B2:B13,2) 3rd Quartile: =QUARTILE. EXC(B2:B13,3) IQR: =E17-E15

A B C D E 1 Home Sale Selling Price ($) 1 138000 Mean: $ 219,937.50 2 2 254000 Median: $ 203,750.00 3 3 186000 Mode 1: $ 138,000.00 4 4 257500 Mode 2: $ 254,000.00 5 108000 5 6 6 254000 Range: $ 348,250.00 7 Variance: 9037501420 138000 7 8 Standard Deviation: $ 95,065.77 298000 8 9 199500 9 10 43.22% Coefficient of Variation: 208000 10 11 142000 11 12 85th Percentile: $ 305,912.50 456250 12 13 14 1st Quartile: $ 139,000.00 15 2nd Quartile: $ 203,750.00 16 3rd Quartile: $ 256,625.00 17 18 IQR: $ 117,625.00 19

To illustrate the computation of the sample variance, we will use the data on class size from page 41 for the sample of five college classes. A summary of the data, including the computation of the deviations about the mean and the squared deviations about the mean, is shown in Table 2.12. The sum of squared deviations about the mean is S( xi 2 x )2 5 256. Hence, with n 2 1 5 4 , the sample variance is s2 5

256 S( xi 2 x )2 5 5 64 4 n 21

Note that the units of variance are squared. For instance, the sample variance for our calculation is s 2 5 64 (students)2. In Excel, you can find the variance for sample data using the VAR.S function. Figure 2.19 shows the data for home sales examined in the previous section. The variance in cell E8 is calculated using the formula 5VAR.S(B2:B13). Excel calculates the variance of the sample of 12 home sales to be 9,037,501,420.

Standard Deviation The standard deviation is defined to be the positive square root of the variance. We use s to denote the sample standard deviation and s to denote the population standard deviation. The sample standard deviation, s, is a point estimate of the population standard deviation, s , and is derived from the sample variance in the following way:

SAMPLE STANDARD DEVIATION

s 5 s2

(2.5)

48

Chapter 2 Descriptive Statistics

TABLE2.12

Computation of Deviations and Squared Deviations About the Mean for the Class Size Data

Number of Students in Class ( x i )

Mean Class Size ( x )

46

44

2

4

54

44

10

100

42

44

22

4

46

44

2

4

32

44

212

144

S(x i 2 x ) 5 0

If the data are for a population, the population standard deviation s is obtained by taking the positive square root of the population variance: s 5 s2 . To calculate the population variance and population standard deviation in Excel, we use the functions 5VAR.P and 5STDEV.P.

Squared Deviation About the Mean (x i 2 x )2

Deviation About the Mean (x i 2 x )

S(x i 2 x )2 5 256

The sample variance for the sample of class sizes in five college classes is s 2 5 64 . Thus, the sample standard deviation is s 5 64 5 8. Recall that the units associated with the variance are squared and that it is difficult to interpret the meaning of squared units. Because the standard deviation is the square root of the variance, the units of the variance, (students)2 in our example, are converted to students in the standard deviation. In other words, the standard deviation is measured in the same units as the original data. For this reason, the standard deviation is more easily compared to the mean and other statistics that are measured in the same units as the original data. Figure 2.19 shows the Excel calculation for the sample standard deviation of the home sales data, which can be calculated using Excel’s STDEV.S function. The sample standard deviation in cell E9 is calculated using the formula 5STDEV.S(B2:B13). Excel calculates the sample standard deviation for the home sales to be $95,065.77.

Coefficient of Variation In some situations we may be interested in a descriptive statistic that indicates how large the standard deviation is relative to the mean. This measure is called the coefficient of variation and is usually expressed as a percentage. COEFFICIENT OF VARIATION

Standard deviation 3 100 % Mean

(2.6)

For the class size data on page 41, we found a sample mean of 44 and a sample standard deviation of 8. The coefficient of variation is (8/44 3 100) 5 18.2%. In words, the coefficient of variation tells us that the sample standard deviation is 18.2% of the value of the sample mean. The coefficient of variation for the home sales data is shown in Figure2.19. It is calculated in cell E11 using the formula 5E9/E2, which divides the standard deviation by the mean. The coefficient of variation for the home sales data is 43.22%. In general, the coefficient of variation is a useful statistic for comparing the relative variability of different variables, each with different standard deviations and different means.

2.7 Analyzing Distributions In Section 2.4 we demonstrated how to create frequency, relative, and cumulative distributions for data sets. Distributions are very useful for interpreting and analyzing data. Adistribution describes the overall variability of the observed values of a variable. In this section we introduce additional ways of analyzing distributions.

49

2.7 Analyzing Distributions

Percentiles

Several procedures can be used to compute the location of the pth percentile using sample data. All provide similar values, especially for large data sets. The procedure we show here is the procedure used by Excel’s PERCENTILE.EXC function as well as several other statistical software packages.

A percentile is the value of a variable at which a specified (approximate) percentage of observations are below that value. The pth percentile tells us the point in the data where approximately p% of the observations have values less than the pth percentile; hence,approximately (100 − p)% of the observations have values greater than the pth percentile. Colleges and universities frequently report admission test scores in terms of percentiles. For instance, suppose an applicant obtains a raw score of 54 on the verbal portion of an admission test. How this student performed in relation to other students taking the same test may not be readily apparent. However, if the raw score of 54 corresponds to the 70th percentile, we know that approximately 70% of the students scored lower than this individual, and approximately 30% of the students scored higher. To calculate the pth percentile for a data set containing n observations we must first arrange the data in ascending order (smallest value to largest value). The smallest value is in position 1, the next smallest value is in position 2, and so on. The location of the pth percentile, denoted by L p, is computed using the following equation: Location of the pth Percentile

Lp 5

p (n 1 1) 100

(2.7)

Once we find the position of the value of the pth percentile, we have the information we need to calculate the pth percentile. To illustrate the computation of the pth percentile, let us compute the 85th percentile for the home sales data in Table 2.9. We begin by arranging the sample of 12 starting salaries in ascending order. 108,000 138,000 138,000 142,000 186,000 199,500 208,000 254,000 254,000 257,500 298,000 456,250 Position 1

2

3

4

5

6

7

8

9

10

11

12

The position of each observation in the sorted data is shown directly below its value. For instance, the smallest value (108,000) is in position 1, the next smallest value (138,000) is in position 2, and so on. Using equation (2.7) with p 5 85 and n 5 12, the location of the 85th percentile is L85 5

p (n 1 1) 5 100

85 (12 1 1) 5 11.05 100

The interpretation of L85 5 11.05 is that the 85th percentile is 5% of the way between the value in position 11 and the value in position 12. In other words, the 85th percentile is the value in position 11 (298,000) plus 0.05 times the difference between the value inposition 12 (456,250) and the value in position 11 (298,000). Thus, the 85th percentileis 85th percentile 5 298,000 1 0.05(456,250 2 298,000) 5 298,000 1 0.05(158,250) 5 305,912.50 Therefore, $305,912.50 represents the 85th percentile of the home sales data. The pth percentile can also be calculated in Excel using the function PERCENTILE.EXC. Figure 2.19 shows the Excel calculation for the 85th percentile of the home sales data. The value in cell E13 is calculated using the formula 5PERCENTILE.EXC(B2:B13,0.85); B2:B13 defines the data set for which we are calculating a percentile, and 0.85 defines the percentile of interest.

50

Chapter 2 Descriptive Statistics

Similar to percentiles, there are multiple methods for computing quartiles that all give similar results. Here we describe a commonly used method that is equivalent to Excel’s QUARTILE.EXC function.

Quartiles It is often desirable to divide data into four parts, with each part containing approximately one-fourth, or 25 percent, of the observations. These division points are referred to as the quartiles and are defined as follows: Q1 5 first quartile, or 25th percentile Q2 5 second quartile, or 50th percentile ( also the median ) Q3 5 third quartile, or 75th percentile To demonstrate quartiles, the home sales data are again arranged in ascending order.

108,000 138,000 138,000 142,000 186,000 199,500 208,000 254,000 254,000 257,500 298,000 456,250 Position 1

2

3

4

5

6

7

8

9

10

11

12

We already identified Q2, the second quartile (median), as 203,750. To find Q1 and Q3 we must find the 25th and 75th percentiles. For Q1, L25 5

p 25 (n 1 1) 5 (12 1 1) 5 3.25 100 100

25th percentile 5 138,000 1 0.25(142,000 2 138,000) 5 138,000 1 0.25(4,000) 5 139,000 For Q3, L75 5

p 75 (n 1 1) 5 (12 1 1) 5 9.75 100 100

75th percentile 5 254, 000 1 0.75(257,500 2 254, 000) 5 254, 000 1 0.75(3,500) 5 256,625 Therefore, the 25th percentile for the home sales data is $139,000 and the 75th percentile is $256,625. The quartiles divide the home sales data into four parts, with each part containing 25%of the observations. 108,000 138,000 138,000

142,000 186,000 199,500

Q1 5 139,000

208,000 254,000 254,000

Q2 5 203,750

257,500 298,000 456,250

Q3 5 256,625

The difference between the third and first quartiles is often referred to as the interquartile range, or IQR. For the home sales data, IQR 5 Q3 2 Q1 5 256,625 2 139,000 5 117,625. Because it excludes the smallest and largest 25% of values in the data, the IQR is a useful measure of variation for data that have extreme values or are highly skewed. A quartile can be computed in Excel using the function QUARTILE.EXC. Figure 2.19 shows the calculations for first, second, and third quartiles for the home sales data. The formula used in cell E15 is 5QUARTILE.EXC(B2:B13,1). The range B2:B13 defines the data set, and 1 indicates that we want to compute the first quartile. Cells E16 and E17 use similar formulas to compute the second and third quartiles.

z-Scores A z-score allows us to measure the relative location of a value in the data set. More specifically, a z-score helps us determine how far a particular value is from the mean relative

51

2.7 Analyzing Distributions

to the data set’s standard deviation. Suppose we have a sample of n observations, with the values denoted by x1 , x 2 ,…, x n . In addition, assume that the sample mean, x , and the sample standard deviation, s, are already computed. Associated with each value, xi , is another value called its z-score. Equation (2.8) shows how the z-score is computed for each xi : z-SCORE

zi 5

xi 2 x s

(2.8)

where zi 5 the z -score for xi x 5 the sample mean s 5 the sample standard deviation

The z-score is often called the standardized value. The z-score, zi , can be interpreted as the number of standard deviations, xi , from the mean. For example, z1 5 1.2 indicates that x1 is 1.2 standard deviations greater than the sample mean. Similarly, z2 5 20.5 indicates that x 2 is 0.5, or 1/2, standard deviation less than the sample mean. A z-score greater than zero occurs for observations with a value greater than the mean, and a z-score less than zero occurs for observations with a value less than the mean. A z-score of zero indicates that the value of the observation is equal to the mean. The z-scores for the class size data are computed in Table 2.13. Recall the previously computed sample mean, x 5 44 , and sample standard deviation, s 5 8. The z-score of 21.50 for the fifth observation shows that it is farthest from the mean; it is 1.50 standard deviations below the mean. The z-score can be calculated in Excel using the function STANDARDIZE. Figure 2.20 demonstrates the use of the STANDARDIZE function to compute z-scores for the home sales data. To calculate the z-scores, we must provide the mean and standard deviation for the data set in the arguments of the STANDARDIZE function. For instance, the z-score in cell C2 is calculated with the formula 5STANDARDIZE(B2, $B$15, $B$16), where cell B15 contains the mean of the home sales data and cell B16 contains the standard deviation of the home sales data. We can then copy and paste this formula into cells C3:C13.

Empirical Rule When the distribution of data exhibits a symmetric bell-shaped distribution, as shown in Figure 2.21, the empirical rule can be used to determine the percentage of data values that are within a specified number of standard deviations of the mean. Many, but not all, distributions of data found in practice exhibit a symmetric bell-shaped distribution. TABLE 2.13

z-Scores for the Class Size Data

No. of Students in Class ( x i )

Deviation About the Mean (x i 2 x )

z-Score

xi 2 x s

46

2

2/8 5

0.25

54

10

10/8 5

1.25

42

22

22/8 5 20.25

46

2

32

212

2/8 5

0.25

212/8 5 21.50

52

Chapter 2 Descriptive Statistics

FIGURE 2.20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

1 2 3 4 5 6 7 8 9 10 11 12

A Home Sale

Calculating z-Scores for the Home Sales Data in Excel B Selling Price ($) 138000 254000 186000 257500 108000 254000 138000 298000 199500 208000 142000 456250

C z-Score =STANDARDIZE(B2,$B$15,$B$16) =STANDARDIZE(B3,$B$15,$B$16) =STANDARDIZE(B4,$B$15,$B$16) =STANDARDIZE(B5,$B$15,$B$16) =STANDARDIZE(B6,$B$15,$B$16) =STANDARDIZE(B7,$B$15,$B$16) =STANDARDIZE(B8,$B$15,$B$16) =STANDARDIZE(B9,$B$15,$B$16) =STANDARDIZE(B10,$B$15,$B$16) =STANDARDIZE(B11,$B$15,$B$16) =STANDARDIZE(B12,$B$15,$B$16) =STANDARDIZE(B13,$B$15,$B$16)

Mean: =AVERAGE(B2:B13) Standard Deviation: =STDEV.S(B2:B13)

FIGURE 2.21

A B Selling Price ($) 1 Home Sale 1 138,000 2 254,000 3 2 186,000 4 3 257,500 5 4 108,000 6 5 7 254,000 6 8 138,000 7 9 298,000 8 10 199,500 9 11 208,000 10 12 142,000 11 13 456,250 12 14 219,937.50 15 Mean: $ 16 Standard Deviation: $ 95,065.77

C z-Score –0.862 0.358 –0.357 0.395 –1.177 0.358 –0.862 0.821 –0.215 –0.126 –0.820 2.486

A Symmetric Bell-Shaped Distribution

2.7 Analyzing Distributions

53

EMPIRICAL RULE

For data having a bell-shaped distribution: ••Approximately 68% of the data values will be within 1 standard deviation of the mean. ••Approximately 95% of the data values will be within 2 standard deviations of the mean. ••Almost all of the data values will be within 3 standard deviations of the mean. The height of adult males in the United States has a bell-shaped distribution similar to that shown in Figure 2.21, with a mean of approximately 69.5 inches and standard deviation of approximately 3 inches. Using the empirical rule, we can draw the following conclusions. ••Approximately 68% of adult males in the United States have heights between 69.5 2 3 5 66.5 and 69.5 1 3 5 72.5 inches. ••Approximately 95% of adult males in the United States have heights between 63.5and 75.5 inches. ••Almost all adult males in the United States have heights between 60.5 and 78.5inches.

Identifying Outliers Sometimes a data set will have one or more observations with unusually large or unusually small values. These extreme values are called outliers. Experienced statisticians take steps to identify outliers and then review each one carefully. An outlier may be a data value that has been incorrectly recorded; if so, it can be corrected before the data are analyzed further. An outlier may also be from an observation that doesn’t belong to the population we are studying and was incorrectly included in the data set; if so, it can be removed. Finally, an outlier may be an unusual data value that has been recorded correctly and is a member of the population we are studying. In such cases, the observation should remain. Standardized values (z-scores) can be used to identify outliers. Recall that the empirical rule allows us to conclude that for data with a bell-shaped distribution, almost all the data values will be within 3 standard deviations of the mean. Hence, in using z-scores to identify outliers, we recommend treating any data value with a z-score less than −3 or greater than 13 as an outlier. Such data values can then be reviewed to determine their accuracy and whether they belong in the data set.

Boxplots A boxplot is a graphical summary of the distribution of data. A boxplot is developed from the quartiles for a data set. Figure 2.22 is a boxplot for the home sales data. Here are the steps used to construct the boxplot: Boxplots are also known as box-and-whisker plots.

1. A box is drawn with the ends of the box located at the first and third quartiles. For the home sales data, Q1 5 139,000 and Q3 5 256,625. This box contains the middle 50% of the data. 2. A vertical line is drawn in the box at the location of the median (203,750 for the home sales data). 3. By using the interquartile range, IQR 5 Q3 2 Q1, limits are located. The limits for the boxplot are 1.5(IQR) below Q1 and 1.5(IQR) above Q3. For the home sales data, IQR 5 Q3 2 Q1 5 256,625 2 139,000 5 117,625. Thus, the limits are 139,000 2 1.5(117,625) 5 237,437.5 and 256,625 1 1.5(117,625) 5 433,062.5. Data outside these limits are considered outliers. 4. The dashed lines in Figure 2.22 are called whiskers. The whiskers are drawn from the ends of the box to the smallest and largest values inside the limits computed in Step 3. Thus, the whiskers end at home sales values of 108,000 and 298,000.

54

Chapter 2 Descriptive Statistics

Boxplot for the Home Sales Data

FIGURE 2.22

Q1

Median

Q3 Outlier

* Whisker IQR 0

Clearly, we would not expect a home sales price less than 0, so we could also define the lower limit here to be $0.

Boxplots can be drawn horizontally or vertically. Figure2.22 shows a horizontal boxplot, and Figure 2.23 shows vertical boxplots.

100,000

200,000 300,000 Price ($)

400,000

500,000

5. Finally, the location of each outlier is shown with an asterisk (*). In Figure 2.22, wesee one outlier, 456,250. Boxplots are also very useful for comparing different data sets. For instance, if we want to compare home sales from several different communities, we could create boxplots for recent home sales in each community. An example of such boxplots is shown in Figure2.23. What can we learn from these boxplots? The most expensive houses appear to be in Shadyside and the cheapest houses in Hamilton. The median home sales price in Groton is about the same as the median home sales price in Irving. However, home sales prices in Irving have much greater variability. Homes appear to be selling in Irving for many different prices, from very low to very high. Home sales prices have the least variation in Groton and Hamilton. The only outlier that appears in these boxplots is for home sales in Groton. However, note that most homes sell for very similar prices in Groton, so the selling price does not have to be too far from the median to be considered an outlier.

FIGURE2.23

Boxplots Comparing Home Sales Prices in Different Communities

500,000

Selling Price ($)

400,000

300,000

200,000

100,000

Fairview

Shadyside

Groton

Irving

Hamilton

55

2.7 Analyzing Distributions

Note that boxplots use a different definition of an outlier than what we described for using z-scores because the distribution of the data in a boxplot is not assumed to follow a bell-shaped curve. However, the interpretation is the same. The outliers in a boxplot are extreme values that should be investigated to ensure data accuracy. The step-by-step directions below illustrate how to create boxplots in Excel for both a single variable and multiple variables. First we will create a boxplot for a single variable using the HomeSales file. Step 1. Select cells B1:B13 Step 2. Click the Insert tab on the Ribbon Click the Insert Statistic Chart button in the Charts group Choose the Box and Whisker chart from the drop-down menu

HomeSales

The resulting boxplot created in Excel is shown in Figure 2.24. Comparing this figure to Figure 2.22, we see that all the important elements of a boxplot are generated here. Excel orients the boxplot vertically, and by default it also includes a marker for the mean. Next we will use the HomeSalesComparison file to create boxplots in Excel for multiple variables similar to what is shown in Figure 2.26. Step 1. Select cells B1:F11 Step 2. Click the Insert tab on the Ribbon Click the Insert Statistic Chart button in the Charts group Choose the Box and Whisker chart from the drop-down menu HomeSalesComparison

Boxplot Created in Excel for Home Sales Data

A B 1 Home Sale Selling Price ($) 2 1 138000 2 254000 3 4 3 186000 4 257500 5 6 5 108000 7 6 254000 7 138000 8 9 8 298000 10 9 199500 10 208000 11 12 11 142000 12 13 456250 14 15 16 17 18 19 20 21 22 23 24 25 26

C

D

E

F

G

H

I

J

500000 450000

Outlier

400000 350000 Selling Price ($)

FIGURE 2.24

The boxplot created in Excel is shown in Figure 2.25. Excel again orients the boxplot vertically. The different selling locations are shown in the Legend at the top of the figure, and different colors are used for each boxplot.

300000 250000 200000 150000

Whisker Q1 Mean Median

X

IQR

Q3

100000 50000 0

56

Chapter 2 Descriptive Statistics

Boxplots for Multiple Variables Created in Excel

B C D E F Hamilton Fairview Shadyside Groton Irving 302,000 336,000 152,000 201,000 102,000 265,000 398,000 158,000 365,000 108,000 280,000 378,000 154,000 115,000 88,000 220,000 298,000 170,000 105,000 111,000 Selling 149,000 425,000 132,000 225,000 105,000 Prices 155,000 344,000 164,000 115,000 87,000 198,000 302,000 198,000 108,000 95,000 187,000 300,000 158,000 218,000 111,000 208,000 298,000 149,000 454,000 98,000 174,000 342,000 165,000 103,000 78,000 A

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

N otes

G

H

I

J

K

Fairview Shadyside

L

M

Groton

Irving

N

O

P

Q

Hamilton

500,000 450,000 400,000 Selling Price ($)

FIGURE 2.25

350,000 300,000 250,000 200,000 150,000 100,000 50,000 -

+

C o m m ents

1. The empirical rule applies only to distributions that have an approximately bell-shaped distribution because it is based on properties of the normal probability distribution, which we will discuss in Chapter 4. For distributions that do not have a bell-shaped distribution, one can use Chebyshev’s theorem to make statements about the proportion of data values that must be within a specified number of standard deviations of 1 the mean. Chebyshev’s theorem states that at least 1 2 2 z 3 100% of the data values must be within z standard devia-

2. The ability to create boxplots in Excel is only available in more recent versions of Excel. Unfortunately, there is no easy way to generate boxplots in older versions of Excel that do not have the Insert Statistic Chart button. 3. Note that the boxplot in Figure 2.24 has been formatted using Excel’s Chart Elements button. These options will be discussed in more detail in Chapter 3. We have also added the text descriptions of the different elements of the boxplot.

tions of the mean, where z is any value greater than 1.

2.8 Measures of Association Between Two Variables Thus far, we have examined numerical methods used to summarize the data for one variable at a time. Often a manager or decision maker is interested in the relationship between two variables. In this section, we present covariance and correlation as descriptive measures of the relationship between two variables. To illustrate these concepts, we consider the case of the sales manager of Queensland Amusement Park, who is in charge of ordering bottled water to be purchased by park customers. The sales manager believes that daily bottled water sales in the summer are related to the outdoor temperature. Table 2.14 shows data for high temperatures and bottled water sales for 14 summer days. The data have been sorted by high temperature from lowest value to highest value.

Scatter Charts A scatter chart is also known as a scatter diagram or a scatter plot.

A scatter chart is a useful graph for analyzing the relationship between two variables. Figure 2.26 shows a scatter chart for sales of bottled water versus the high temperature experienced on 14 consecutive days. The scatter chart in the figure suggests that higher

57

2.8 Measures of Association Between Two Variables

Table2.14

Data for Bottled Water Sales at Queensland Amusement Park for a Sample of 14 Summer Days

High Temperature ( 8F )

Bottled Water Sales (Cases)

78

23

79

22

80

24

80

22

82

24

83

26

85

27

86

25

87

28

87

26

88

29

88

30

90

31

92

31

BottledWater

FIGURE2.26

Chart Showing the Positive Linear Relation Between Sales and High Temperatures 35

Sales (cases)

30 25 20 15 10 5 0 76

Scatter charts are covered in Chapter 3.

78

80

82 84 86 88 90 High Temperature (˚F)

92

94

daily high temperatures are associated with higher bottled water sales. This is an example of a positive relationship, because when one variable (high temperature) increases, the other variable (sales of bottled water) generally also increases. The scatter chart also suggests that a straight line could be used as an approximation for the relationship between high temperature and sales of bottled water.

58

Chapter 2 Descriptive Statistics

Covariance Covariance is a descriptive measure of the linear association between two variables. For a sample of size n with the observations ( x1 , y1 ), ( x 2 , y2 ), and so on, the sample covariance is defined as follows: Sample Covariance BottledWater

If data consist of a population of N observations, the population covariance s xy is computed by: s xy 5

S(x i 2m x )S(y i 2m y ) . N

Note that this equation is similar to equation (2.8), but uses population parameters instead of sample estimates (and divides by N instead of n 2 1 for technical reasons beyond the scope of this book).

sxy 5

S( xi 2 x )( yi 2 y ) n 21

(2.9)

This formula pairs each xi with a yi . We then sum the products obtained by multiplying the deviation of each xi from its sample mean ( xi 2 x ) by the deviation of the corresponding yi from its sample mean ( yi 2 y ) ; this sum is then divided by n 2 1. To measure the strength of the linear relationship between the high temperature x and the sales of bottled water y at Queensland, we use equation (2.9) to compute the sample covariance. The calculations in Table 2.15 show the computation S( xi 2 x )( yi 2 y ) . Note that for our calculations, x 5 84.6 and y 5 26.3. The covariance calculated in Table 2.15 is sxy 5 12.8 . Because the covariance is greater than 0, it indicates a positive relationship between the high temperature and sales of bottled water. This verifies the relationship we saw in the scatter chart in Figure 2.26 that as the high temperature for a day increases, sales of bottled water generally increase. The sample covariance can also be calculated in Excel using the COVARIANCE.S function. Figure 2.27 shows the data from Table 2.14 entered into an Excel Worksheet. The covariance is calculated in cell B17 using the formula 5COVARIANCE.S(A2:A15, B2:B15). Sample Covariance Calculations for Daily High Temperature and Bottled Water Sales at Queensland Amusement Park

Table 2.15 xi

Totals

yi

xi 2 x

yi 2 y

( x i 2 x )( y i 2 y )

78

23

26.6

23.3

21.78

79

22

25.6

24.3

24.08

80

24

24.6

22.3

10.58

80

22

24.6

24.3

19.78

82

24

22.6

22.3

5.98

83

26

21.6

20.3

0.48

85

27

0.4

0.7

0.28 21.82

86

25

1.4

21.3

87

28

2.4

1.7

4.08

87

26

2.4

20.3

20.72

88

29

3.4

2.7

9.18

88

30

3.4

3.7

12.58

90

31

5.4

4.7

25.38

92

31

7.4

4.7

34.78

1,185

368

0.6

20.2

166.42

x y

5 84.6 5 26.3

s xy

5

S( x i 2 x )( y i 2 y ) 166.42 5 5 12.8 n 21 14 2 1

59

2.8 Measures of Association Between Two Variables

FIGURE 2.27

Calculating Covariance and Correlation Coefficient for Bottled Water Sales UsingExcel A

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

B

High Temperature (8F) 78 79 80 80 82 83 85 86 87 87 88 88 90 92

Bottled Water Sales (cases) 23 22 24 22 24 26 27 25 28 26 29 30 31 31

A

B

High Temperature Bottled Water Sales (cases) (8F) 1 23 2 78 22 79 3 24 80 4 22 80 5 Covariance: 5COVARIANCE.S(A2:A15,B2:B15) 24 82 6 Correlation Coefficient: 5CORREL(A2:A15,B2:B15) 26 83 7 27 85 8 25 86 9 28 87 10 26 87 11 29 88 12 30 88 13 31 90 14 31 92 15 16 12.80 Covariance: 17 0.93 18 Correlation Coefficient:

A2:A15 defines the range for the x variable (high temperature), and B2:B15 defines the range for the y variable (sales of bottled water). For the bottled water, the covariance is positive, indicating that higher temperatures (x) are associated with higher sales (y). If the covariance is near 0, then the x and y variables are not linearly related. If the covariance is less than 0, then the x and y variables are negatively related, which means that as x increases, y generally decreases. Figure 2.28 demonstrates several possible scatter charts and their associated covariance values. One problem with using covariance is that the magnitude of the covariance value is difficult to interpret. Larger sxy values do not necessarily mean a stronger linear relationship because the units of covariance depend on the units of x and y. For example, suppose we are interested in the relationship between height x and weight y for individuals. Clearly the strength of the relationship should be the same whether we measure height in feet or inches. Measuring the height in inches, however, gives us much larger numerical values for ( xi − x ) than when we measure height in feet. Thus, with height measured in inches, we would obtain a larger value for the numerator S( xi 2 x )( yi 2 y ) in equation (2.9)—and hence a larger covariance—when in fact the relationship does not change.

60

Chapter 2 Descriptive Statistics

FIGURE2.28

Scatter Charts and Associated Covariance Values for Different Variable Relationships

sxy Positive: (x and y are positively linearly related)

y

x

sxy Approximately 0: (x and y are not linearly related)

y

x

sxy Negative: (x and y are negatively linearly related)

y

x

If data are a population, the population correlation coefficient is computed by s xy rxy 5 . Note that this is sx sy

similar to equation(2.10) but uses population parameters instead of sample estimates.

61

2.8 Measures of Association Between Two Variables

Correlation Coefficient The correlation coefficient measures the relationship between two variables, and, unlike covariance, the relationship between two variables is not affected by the units of measurement for x and y. For sample data, the correlation coefficient is defined as follows: Sample Correlation Coefficient

rxy 5

sxy sx s y

(2.10)

where rxy

5 sample correlation coefficient

s xy 5 sample covariance sx sy

5 sample standard deviation of x 5 sample standard deviation of y

The sample correlation coefficient is computed by dividing the sample covariance by the product of the sample standard deviation of x and the sample standard deviation of y. This scales the correlation coefficient so that it will always take values between 21 and 11. Let us now compute the sample correlation coefficient for bottled water sales at Queensland Amusement Park. Recall that we calculated sxy 5 12.8 using equation (2.9). Using data in Table 2.14, we can compute sample standard deviations for x and y. sx 5

S( xi 2 x )2 5 4.36 n 21

sy 5

S( yi 2 y )2 5 3.15 n 21

The sample correlation coefficient is computed from equation (2.10) as follows: rxy 5

sxy 12.8 5 5 0.93 sx s y (4.36)(3.15)

The correlation coefficient can take only values between 21 and 11. Correlation coefficient values near 0 indicate no linear relationship between the x and y variables. Correlation coefficients greater than 0 indicate a positive linear relationship between the x and y variables. The closer the correlation coefficient is to 11, the closer the x and y values are to forming a straight line that trends upward to the right (positive slope). Correlation coefficients less than 0 indicate a negative linear relationship between the x and y variables. The closer the correlation coefficient is to 21, the closer the x and y values are to forming a straight line with negative slope. Because rxy 5 0.93 for the bottled water, we know that there is a very strong positive linear relationship between these two variables. As we can see in Figure 2.26, one could draw a straight line with a positive slope that would be very close to all of the data points in the scatter chart. Because the correlation coefficient defined here measures only the strength of the linear relationship between two quantitative variables, it is possible for the correlation coefficient to be near zero, suggesting no linear relationship, when the relationship between the two variables is nonlinear. For example, the scatter chart in Figure 2.29 shows the relationship between the amount spent by a small retail store for environmental control (heating and cooling) and the daily high outside temperature for 100 consecutive days. The sample correlation coefficient for these data is rxy 5 20.007 and indicates that there is no linear relationship between the two variables. However, Figure 2.29 provides strong visual evidence of a nonlinear relationship. That is, we can see that as the daily high

62

Chapter 2 Descriptive Statistics

Dollars Spent on Environmental Control

FIGURE2.29

Example of Nonlinear Relationship Producing a Correlation Coefficient Near Zero

$1,600 $1,400 $1,200 $1,000 $800 $600 $400 $200 $0

20

40 60 Outside Temperature (˚F)

80

100

outside temperature increases, the money spent on environmental control first decreases as less heating is required and then increases as greater cooling is required. We can compute correlation coefficients using the Excel function CORREL. The correlation coefficient in Figure 2.27 is computed in cell B18 for the sales of bottled water using the formula 5CORREL(A2:A15, B2:B15), where A2:A15 defines the range for the x variable and B2:B15 defines the range for the y variable. N otes

+

C o m m ents

1. The correlation coefficient discussed in this chapter was developed by Karl Pearson and is sometimes referred to as Pearson product moment correlation coefficient. It is appropriate for use only with two quantitative variables. A variety of alternatives, such as the Spearman rank-correlation coefficient, exist to measure the association of categorical variables. The Spearman rank-correlation coefficient is discussed in Chapter 11.

2. Correlation measures only the association between two variables. A large positive or large negative correlation coefficient does not indicate that a change in the value of one of the two variables causes a change in the value of the other variable.

2.9 Data Cleansing The data in a data set are often said to be “dirty” and “raw” before they have been put into a form that is best suited for investigation, analysis, and modeling. Data preparation makes heavy use of the descriptive statistics and data-visualization methods to gain an understanding of the data. Common tasks in data preparation include treating missing data, identifying erroneous data and outliers, and defining the appropriate way to represent variables.

Missing Data Data sets commonly include observations with missing values for one or more variables. In some cases missing data naturally occur; these are called legitimately missing data. For example, respondents to a survey may be asked if they belong to a fraternity or a sorority, and

2.9 Data Cleansing

63

then in the next question are asked how long they have belonged to a fraternity or a sorority. If a respondent does not belong to a fraternity or a sorority, she or he should skip the ensuing question about how long. Generally no remedial action is taken for legitimately missing data. In other cases missing data occur for different reasons; these are called illegitimately missing data. These cases can result for a variety of reasons, such as a respondent electing not to answer a question that she or he is expected to answer, a respondent dropping out of a study before its completion, or sensors or other electronic data collection equipment failing during a study. Remedial action is considered for illegitimately missing data. The primary options for addressing such missing data are (1) to discard observations (rows) with any missing values, (2) to discard any variable (column) with missing values, (3) to fill in missing entries with estimated values, or (4) to apply a data-mining algorithm (such as classification and regression trees) that can handle missing values. Deciding on a strategy for dealing with missing data requires some understanding of why the data are missing and the potential impact these missing values might have on an analysis. If the tendency for an observation to be missing the value for some variable is entirely random, then whether data are missing does not depend on either the value of the missing data or the value of any other variable in the data. In such cases the missing value is called missing completely at random (MCAR). For example, if missing value for a question on a survey is completely unrelated to the value that is missing and is also completely unrelated to the value of any other question on the survey, the missing value is MCAR. However, the occurrence of some missing values may not be completely at random. If the tendency for an observation to be missing a value for some variable is related to the value of some other variable(s) in the data, the missing value is called missing at random (MAR). For data that is MAR, the reason for the missing values may determine its importance. For example if the responses to one survey question collected by a specific employee were lost due to a data entry error, then the treatment of the missing data may be less critical. However, in a health care study, suppose observations corresponding to patient visits are missing the results of diagnostic tests whenever the doctor deems the patient too sick to undergo the procedure. In this case, the absence of a variable measurement actually provides additional information about the patient’s condition, which may be helpful in understanding other relationships in the data. A third category of missing data is missing not at random (MNAR). Data is MNAR if the tendency for the value of a variable to be missing is related to the value that is missing. For example, survey respondents with high incomes may be less inclined than respondents with lower incomes to respond to the question on annual income, and so these missing data for annual income are MNAR. Understanding which of these three categories—MCAR, MAR, and MNAR—missing values fall into is critical in determining how to handle missing data. If a variable has observations for which the missing values are MCAR or MAR and only a relatively small number of observations are missing values, the observations that are missing values can be ignored. We will certainly lose information if the observations that are missing values for the variable are ignored, but the results of an analysis of the data will not be biased by the missing values. If a variable has observations for which the missing values are MNAR, the observation with missing values cannot be ignored because any analysis that includes the variable with MNAR values will be biased. If the variable with MNAR values is thought to be redundant with another variable in the data for which there are few or no missing values, removing the MNAR variable from consideration may be an option. In particular, if the MNAR variable is highly correlated with another variable that is known for a majority of observations, the loss of information may be minimal. Whether the missing values are MCAR, MAR, or MNAR, the first course of action when faced with missing values is to try to determine the actual value that is missing by examining the source of the data or logically determining the likely value that is missing. If the missing values cannot be determined and ignoring missing values or removing a variable with missing values from consideration is not an option, imputation (systematic replacement of missing values with values that seems reasonable) may be useful. Options for replacing the missing entries for a variable include replacing the missing value with

64

Chapter 2 Descriptive Statistics

the variable’s mode, mean, or median. Imputing values in this manner is truly valid only if variable values are MCAR; otherwise, we may be introducing misleading information into the data. If missing values are particularly troublesome and MAR, it may be possible to build a model to predict a variable with missing values and then to use these predictions in place of the missing entries. How to deal with missing values is fairly subjective, and caution must be used to not induce bias by replacing missing values.

Blakely Tires Blakely Tires is a U.S. producer of automobile tires. In an attempt to learn about the conditions of its tires on automobiles in Texas, the company has obtained information for each of the four tires from 116 automobiles with Blakely brand tires that have been collected through recent state automobile inspection facilities in Texas. The data obtained by Blakely includes the position of the tire on the automobile (left front, left rear, right front, right rear), age of the tire, mileage on the tire, and depth of the remaining tread on the tire. Before Blakely management attempts to learn more about its tires on automobiles in Texas, it wants to assess the quality of these data. The tread depth of a tire is a vertical measurement between the top of the tread rubber to the bottom of the tire’s deepest grooves, and is measured in 32nds of an inch in the United States. New Blakely brand tires have a tread depth of 10/32nds of an inch, and a tire’s tread depth is considered insufficient if it is 2/32nds of an inch or less. Shallow tread depth is dangerous as it results in poor traction and so makes steering the automobile more difficult. Blakely’s tires generally last for four to five years or 40,000 to 60,000 miles. We begin assessing the quality of these data by determining which (if any) observations have missing values for any of the variables in the TreadWear data. We can do so using Excel’s COUNTBLANK function. After opening the file TreadWear Step 1. Enter the heading # of Missing Values in cell G2 Step 2. Enter the heading Life of Tire (Months) in cell H1 Step 3. Enter 5COUNTBLANK(C2:C457) in cell H2

TreadWear

The result in cell H2 shows that none of the observations in these data is missing its value for Life of Tire. By repeating this process for the remaining quantitative variables in the data (Tread Depth and Miles) in columns I and J, we determine that there are no missing values for Tread Depth and one missing value for Miles. The first few rows of the resulting Excel spreadsheet is provided in Figure 2.30. Next we sort all of Blakely’s data on Miles from smallest to largest value to determine which observation is missing its value of this variable. Excel’s sort procedure will list all observations with missing values for the sort variable, Miles, as the last observations in the sorted data. FIGURE2.30

A 1 ID Number 13391487 2 3 21678308 18414311 4 5 19778103 16355454 6 8952817 7 6559652 8

Portion of Excel Spreadsheet Showing Number of Missing Values for Variables in TreadWear Data

B

C

Position on Life of Tire Automobile (Months) 58.4 LR 17.3 LR 16.5 RR 8.2 RR 13.7 RR 52.8 LR 14.7 RR

D Tread Depth 2.2 8.3 8.6 9.8 8.9 3.0 8.8

E Miles 2805 39371 13367 1931 23992 48961 4585

F

G

# of Missing Values

H Life of Tire (Months) 0

I Tread Depth

J Miles 0

1

65

2.9 Data Cleansing

Occasionally missing values in a data set are indicated with a unique value, such as 9999999. Be sure to check to see if a unique value is being used to indicate a missing value in the data.

We can see in Figure 2.31 that the value of Miles is missing from the left front tire of the automobile with ID Number 3354942. Because only one of the 456 observations is missing its value for Miles, this is likely MCAR and so ignoring the observation would not likely bias any analysis we wish to undertake with these data. However, we may be able to salvage this observation by logically determining a reasonable value to substitute for this missing value. It is sensible to suspect that the value of Miles for the left front tire of the automobile with the ID Number 3354942 would be identical to the value of miles for the other three tires on this automobile, so we sort all the data on ID number and scroll through the data to find the four tires that belong to the automobile with the ID Number 3354942. Figure 2.32 shows that the value of Miles for the other three tires on the automobile with the ID Number 3354942 is 33,254, so this may be a reasonable value for the Miles of the left front tire of the automobile with the ID Number 3354942. However, before substituting this value for the missing value of the left front tire of the automobile with ID Number 3354942, we should attempt to ascertain (if possible) that this value is valid—there are legitimate reasons why a driver might replace a single tire. In this instance we will assume that the correct value of Miles for the left front tire on the automobile with the ID Number 3354942 is 33,254 and substitute that number in the appropriate cell of the spreadsheet.

FIGURE2.31

Portion of Excel Spreadsheet Showing TreadWear Data Sorted on Miles from Lowest to Highest Value

A

Note that we have hidden rows 5 through 454.

1 ID Number 15890813 2 3 15890813 15890813 4 455 9306585 456 9306585 457 3354942

B

C

Position on Life of Tire Automobile (Months) LF 16.1 LR 16.1 RF 16.1 RR 45.4 LF 45.4 LF 17.1

FIGURE2.32

54 55 56 57 58 59 60 61 62 63 64 65

D Tread Depth 8.6 8.6 8.6 4.1 4.1 8.5

E Miles 206 206 206 107237 107237

F

G

# of Missing Values

H Life of Tire (Months) 0

I Tread Depth

J Miles 0

1

Portion of Excel Spreadsheet Showing TreadWear Data Sorted from Lowest to Highest by ID Number

3121851 3121851 3121851 3121851 3354942 3354942 3354942 3354942 3374739 3574739 3574739 3574739

LR RR RF LF LF RF RR LR RR RF LF LR

17.1 17.1 17.1 17.1 17.1 21.4 21.4 21.4 73.3 73.3 73.3 73.3

8.4 8.4 8.4 8.5 8.5 7.7 7.8 7.7 0.2 0.2 0.2 0.2

21378 21378 21378 21378 33254 33254 33254 57313 57313 57313 57313

66

Chapter 2 Descriptive Statistics

Identification of Erroneous Outliers and Other Erroneous Values Examining the variables in the data set by use of summary statistics, frequency distributions, bar charts and histograms, z-scores, scatter charts, correlation coefficients, and other tools can uncover data-quality issues and outliers. For example, finding the minimum or maximum value for Tread Depth in the TreadWear data may reveal unrealistic values— perhaps even negative values—for Tread Depth, which would indicate a problem for the value of Tread Depth for any such observation. It is important to note here that many software, including Excel, ignore missing values when calculating various summary statistics such as the mean, standard deviation, minimum, and maximum. However, if missing values in a data set are indicated with a unique value (such as 9999999), these values may be used by software when calculating various summary statistics such as the mean, standard deviation, minimum, and maximum. Both cases can result in misleading values for summary statistics, which is why many analysts prefer to deal with missing data issues prior to using summary statistics to attempt to identify erroneous outliers and other erroneous values in the data. We again consider the Blakely tire data. We calculate the mean and standard deviation of each variable (age of the tire, mileage on the tire, and depth of the remaining tread on the tire) to assess whether values of these variable are reasonable in general. Return to the file TreadWear and complete the following steps: Step 1. Step 2. Step 3. Step 4.

If you do not have good information on what are reasonable values for a variable, you can use z-scores to identify outliers to be investigated.

Enter the heading Mean in cell G3 Enter the heading Standard Deviation in cell G4 Enter 5AVERAGE(C2:C457) in cell H3 Enter 5STDEV.S(C2:C457) in cell H4

The results in cells H3 and H4 show that the mean and standard deviation for life of tires are 23.8 months and 31.83 months, respectively. These values appear to be reasonable for the life of tires in months. By repeating this process for the remaining variables in the data (Tread Depth and Miles) in columns I and J, we determine that the mean and standard deviation for tread depth are 7.62/12ths of an inch and 2.47/12ths of an inch, respectively, and the mean and standard deviation for miles are 25,440.22 and 23,600.21, respectively. These values appear to be reasonable for tread depth and miles. The results of this analysis are provided in Figure 2.33. Summary statistics only provide an overall perspective on the data. We also need to attempt to determine if there are any erroneous individual values for our three variables. We start by finding the minimum and maximum values for each variable. Return again to the file TreadWear and complete the following steps: Step 1. Step 2. Step 3. Step 4.

Enter the heading Minimum in cell G5 Enter the heading Maximum in cell G6 Enter 5MIN(C2:C457) in cell H5 Enter 5MAX(C2:C457) in cell H6

The results in cells H5 and H6 show that the minimum and maximum values for Life of Tires (Months) are 1.8 months and 601.0, respectively. The minimum value of life of tires in months appears to be reasonable, but the maximum (which is equal to slightly over 50 years) is not a reasonable value for Life of Tires (Months). In order to identify the automobile with this extreme value, we again sort the entire data set on Life of Tire (Months) and scroll to the last few rows of the data. We see in Figure 2.34 that the observation with Life of Tire (Months) value of 601.0 is the left rear tire from the automobile with ID Number 8696859. Also note that the left rear tire of the automobile with ID Number 2122934 has a suspiciously high value for Life of Tire (Months) of 111. Sorting the data by ID Number and scrolling until we find the four tires from the automobile with ID Number 8696859, we find the value for Life of Tire (Months) for the other three tires from this automobile is 60.1. This suggests that the

67

2.9 Data Cleansing

FIGURE2.33

A 1 ID Number 80441 2 3 80441 80441 4 5 80441 6 7 8

95990 95990 95990

FIGURE2.34

A 1 ID Number 9091771 2 3 9091771 9091771 4 5 7712178 7712178 6 3574739 452 3574739 453 3574739 454 3574739 455 2122934 456 8696859 457

Portion of Excel Spreadsheet Showing the Mean and Standard Deviation for Each Variable in the TreadWear Data

B

C

Position on Life of Tire Automobile (Months) 19.0 LR 19.0 LF 19.0 RR 19.0 RF 8.6 RR 8.6 LR 8.6 LF

D Tread Depth 8.1 8.1 8.2 8.1 9.7 9.7 9.7

E

F

Miles 37419 37419 37419 37419 5670 5670 5670

G

# of Missing Values Mean Standard Deviation

H Life of Tire (Months) 0 23.80 31.82

I Tread Depth 0 7.68 2.62

J Miles 1 25440.22 23600.21

Portion of Excel Spreadsheet Showing the TreadWear Data Sorted on Life of Tires (Months) from Lowest to Highest Value

B

C

Position on Life of Tire Automobile (Months) 1.8 RF 1.8 RR 1.8 LF 2.1 LF 2.1 RR 73.3 RR 73.3 LF LR 73.3 LR 73.3 LR 111.0 LR 601.0

D Tread Depth 10.8 10.7 10.7 10.7 10.7 0.2 0.2 0.2 0.2 9.3 2.0

E Miles 2917 2917 2917 2186 2186 57313 57313 57313 57313 21000 26129

F

G

H Life of Tire (Months)

# of Missing Values Mean Standard Deviation Minimum

0 23.80 31.82 1.8

Maximum

601.0

I Tread Depth 0 7.68 2.62

J Miles 1 25440.22 23600.21

decimal for Life of Tire (Months) for this automobile’s left rear tire value is in the wrong place. Scrolling to find the four tires from the automobile with ID Number 2122934, we find the value for Life of Tire (Months) for the other three tires from this automobile is 11.1, which suggests that the decimal for Life of Tire (Months) for this automobile’s left rear tire value is also misplaced. Both of these erroneous entries can now be corrected. By repeating this process for the remaining variables in the data (Tread Depth and Miles) in columns I and J, we determine that the minimum and maximum values for Tread Depth are 0.0/12ths of an inch and 16.7/12ths of an inch, respectively, and the minimum and maximum values for Miles are 206.0 and 107237.0, respectively. Neither the minimum nor the maximum value for Tread Depth is reasonable; a tire with no tread would not be drivable, and the maximum value for tire depth in the data actually exceeds the tread depth on new Blakely brand tires. The minimum value for Miles is reasonable, but the maximum value is not. A similar investigation should be made into these values to determine if they are in error and if so, what might be the correct value. Not all erroneous values in a data set are extreme; these erroneous values are much more difficult to find. However, if the variable with suspected erroneous values has a relatively strong relationship with another variable in the data, we can use this knowledge to look for erroneous values. Here we will consider the variables Tread Depth and Miles; because more miles driven should lead to less tread depth on an automobile tire, we expect

Chapter 2 Descriptive Statistics

these two variables to have a negative relationship. A scatter chart will enable us to see whether any of the tires in the data set have values for Tread Depth and Miles that are counter to this expectation. The red ellipse in Figure 2.35 shows the region in which the points representing Tread Depth and Miles would generally be expected to lie on this scatter plot. The points that lie outside of this ellipse have values for at least one of these variables that is inconsistent with the negative relationship exhibited by the points inside the ellipse. If we position the cursor over one of the points outside the ellipse, Excel will generate a pop-up box that shows that the values of Tread Depth and Miles for this point are 1.0 and 1472.1, respectively. The tire represented by this point has very little tread and has been driven relatively few miles, which suggests that the value of one or both of these two variables for this tire may be inaccurate and should be investigated. Closer examination of outliers and potential erroneous values may reveal an error or a need for further investigation to determine whether the observation is relevant to the current analysis. A conservative approach is to create two data sets, one with and one without outliers and potentially erroneous values, and then construct a model on both data sets. If a model’s implications depend on the inclusion or exclusion of outliers and erroneous values, then you should spend additional time to track down the cause of the outliers.

Variable Representation In many data-mining applications, it may be prohibitive to analyze the data because of the number of variables recorded. In such cases, the analyst may have to first identify variables that can be safely omitted from further analysis before proceeding with a data-mining technique. Dimension reduction is the process of removing variables from the analysis without losing crucial information. One simple method for reducing the number of

FIGURE2.35

Scatter Chart of Tread Depth and Miles for the TreadWear Data

120,000 100,000 80,000 60,000 Miles

68

40,000 20,000 0 0.0

2.0 4.0 6.0 Series “Miles” Point “1.0” (1.0, 1472.1)

8.0 10.0 Tread Depth

12.0

14.0

16.0

18.0

69

Summary

variables is to examine pairwise correlations to detect variables or groups of variables that may supply similar information. Such variables can be aggregated or removed to allow more parsimonious model development. A critical part of data mining is determining how to represent the measurements of the variables and which variables to consider. The treatment of categorical variables is particularly important. Typically, it is best to encode categorical variables with 0–1 dummy variables. Consider a data set that contains the variable Language to track the language preference of callers to a call center. The variable Language with the possible values of English, German, and Spanish would be replaced with three binary variables called English, German, and Spanish. An entry of German would be captured using a 0 for the English dummy variable, a 1 for the German dummy variable, and a 0 for the Spanish dummy variable. Using 0–1 dummy variables to encode categorical variables with many different categories results in a large number of variables. In these cases, the use of PivotTables is helpful in identifying categories that are similar and can possibly be combined to reduce the number of 0–1 dummy variables. For example, some categorical variables (zip code, product model number) may have many possible categories such that, for the purpose of model building, there is no substantive difference between multiple categories, and therefore the number of categories may be reduced by combining categories. Often data sets contain variables that, considered separately, are not particularly insightful but that, when appropriately combined, result in a new variable that reveals an important relationship. Financial data supplying information on stock price and companyearnings may be as useful as the derived variable representing the price/earnings (PE) ratio. A variable tabulating the dollars spent by a household on groceries may not beinteresting because this value may depend on the size of the household. Instead, considering the proportion of total household spending on groceries may be more informative. N otes

+

C o m m ents

1. Many of the data visualization tools described in Chapter 3 can be used to aid in data cleansing. 2. In some cases, it may be desirable to transform a numerical variable into categories. For example, if we wish to analyze the circumstances in which a numerical outcome variable exceeds a certain value, it may be helpful to create a binary categorical variable that is 1 for observations with the variable value greater than the threshold and 0 otherwise. In another case, if a variable has a skewed distribution, it may be helpful to categorize the values into quantiles. 3. Most dedicated statistical software packages provide functionality to apply a more sophisticated dimension- reduction approach called principal components analysis. Principal components analysis creates a collection of

metavariables (components) that are weighted sums of the original variables. These components are not correlated with each other, and often only a few of them are needed to convey the same information as the large set of original variables. In many cases, only one or two components are necessary to explain the majority of the variance in the original variables. Then the analyst can continue to build a data-mining model using just a few of the most explanatory components rather than the entire set of original variables. Although principal components analysis can reduce the number of variables in this manner, it may be harder to explain the results of the model because the interpretation of a component that is a linear combination of variables can be unintuitive.

S umma r y In this chapter we have provided an introduction to descriptive statistics that can be used to summarize data. We began by explaining the need for data collection, defining the types of data one may encounter, and providing a few commonly used sources for finding data. We presented several useful functions for modifying data in Excel, such as sorting and filtering to aid in data analysis. We introduced the concept of a distribution and explained how to generate frequency, relative, percent, and cumulative distributions for data. We also demonstrated the use of

70

Chapter 2 Descriptive Statistics

histograms as a way to visualize the distribution of data. We then introduced measures of location for a distribution of data such as mean, median, mode, and geometric mean, as well as measures of variability such as range, variance, standard deviation, coefficient of variation, and interquartile range. We presented additional measures for analyzing a distribution of data including percentiles, quartiles, and z-scores. We showed that boxplots are effective for visualizing a distribution. We discussed measures of association between two variables. Scatter plots allow one to visualize the relationship between variables. Covariance and the correlation coefficient summarize the linear relationship between variables into a single number. We also introduced methods for data cleansing. Analysts typically spend large amounts of their time trying to understand and cleanse raw data before applying analytics models. We discussed methods for identifying missing data and how to deal with missing data values and outliers. G lossa r y Bins The nonoverlapping groupings of data used to create a frequency distribution. Bins for categorical data are also known as classes. Boxplot A graphical summary of data based on the quartiles of a distribution. Categorical data Data for which categories of like items are identified by labels or names. Arithmetic operations cannot be performed on categorical data. Coefficient of variation A measure of relative variability computed by dividing the standard deviation by the mean and multiplying by 100. Correlation coefficient A standardized measure of linear association between two variables that takes on values between 21 and 11. Values near 21 indicate a strong negative linear relationship, values near 11 indicate a strong positive linear relationship, and values near zero indicate the lack of a linear relationship. Covariance A measure of linear association between two variables. Positive values indicate a positive relationship; negative values indicate a negative relationship. Cross-sectional data Data collected at the same or approximately the same point in time. Cumulative frequency distribution A tabular summary of quantitative data showing the number of data values that are less than or equal to the upper class limit of each bin. Data The facts and figures collected, analyzed, and summarized for presentation and interpretation. Dimension reduction The process of removing variables from the analysis without losing crucial information. Empirical rule A rule that can be used to compute the percentage of data values that must be within 1, 2, or 3 standard deviations of the mean for data that exhibit a bell-shaped distribution. Frequency distribution A tabular summary of data showing the number (frequency) of data values in each of several nonoverlapping bins. Geometric mean A measure of central location that is calculated by finding the nth root of the product of n values. Growth factor The percentage increase of a value over a period of time is calculated using the formula (growth factor 2 1). A growth factor less than 1 indicates negative growth, whereas a growth factor greater than 1 indicates positive growth. The growth factor cannot be less than zero. Histogram A graphical presentation of a frequency distribution, relative frequency distribution, or percent frequency distribution of quantitative data constructed by placing the bin intervals on the horizontal axis and the frequencies, relative frequencies, or percent frequencies on the vertical axis. Illegitimately missing data Missing data that do not occur naturally. Imputation Systematic replacement of missing values with values that seem reasonable. Interquartile range The difference between the third and first quartiles. Legitimately missing data Missing data that occur naturally.

Problems

71

Mean (arithmetic mean) A measure of central location computed by summing the data values and dividing by the number of observations. Median A measure of central location provided by the value in the middle when the data are arranged in ascending order. Missing at random (MAR) The tendency for an observation to be missing a value of some variable is related to the value of some other variable(s) in the data. Missing completely at random (MCAR) The tendency for an observation to be missing a value of some variable is entirely random. Missing not at random (MNAR) The tendency for an observation to be missing a value of some variable is related to the missing value. Mode A measure of central location defined as the value that occurs with greatest frequency. Observation A set of values corresponding to a set of variables. Outliers An unusually large or unusually small data value. Percent frequency distribution A tabular summary of data showing the percentage of data values in each of several nonoverlapping bins. Percentile A value such that approximately p% of the observations have values less than the pth percentile; hence, approximately (100 2 p)% of the observations have values greater than the pth percentile. The 50th percentile is the median. Population The set of all elements of interest in a particular study. Quantitative data Data for which numerical values are used to indicate magnitude, such as how many or how much. Arithmetic operations such as addition, subtraction, and multiplication can be performed on quantitative data. Quartiles The 25th, 50th, and 75th percentiles, referred to as the first quartile, second quartile (median), and third quartile, respectively. The quartiles can be used to divide a data set into four parts, with each part containing approximately 25% of the data. Random sampling Collecting a sample that ensures that (1) each element selected comes from the same population and (2) each element is selected independently. Random variable, or uncertain variable A quantity whose values are not known with certainty. Range A measure of variability defined to be the largest value minus the smallest value. Relative frequency distribution A tabular summary of data showing the fraction or proportion of data values in each of several nonoverlapping bins. Sample A subset of the population. Scatter chart A graphical presentation of the relationship between two quantitative variables. One variable is shown on the horizontal axis and the other on the vertical axis (scatter chart or scatter plot). Skewness A measure of the lack of symmetry in a distribution. Standard deviation A measure of variability computed by taking the positive square root of the variance. Time series data Data that are collected over a period of time (minutes, hours, days, months, years, etc.). Variable A characteristic or quantity of interest that can take on different values. Variance A measure of variability based on the squared deviations of the data values about the mean. Variation Differences in values of a variable over observations. z-score A value computed by dividing the deviation about the mean ( xi 2 x ) by the standard deviation s. A z-score is referred to as a standardized value and denotes the number of standard deviations that xi is from the mean. P r oblems 1. Wall Street Journal Subscriber Characteristics. A Wall Street Journal subscriber survey asked 46 questions about subscriber characteristics and interests. State whether each of the following questions provides categorical or quantitative data. a. What is your age? b. Are you male or female?

72

Chapter 2 Descriptive Statistics

c. When did you first start reading the WSJ? High school, college, early career, midcareer, late career, or retirement? d. How long have you been in your present job or position? e. What type of vehicle are you considering for your next purchase? Nine response categories include sedan, sports car, SUV, minivan, and so on. 2. Gross Domestic Products. The following table contains a partial list of countries, the continents on which they are located, and their respective gross domestic products (GDPs) in U.S. dollars. A list of 125 countries and their GDPs is contained in the file GDPlist.

Country

GDPlist

Continent

GDP (Millions of US$)

Afghanistan

Asia

18,181

Albania

Europe

12,847

Algeria

Africa

190,709

Angola

Africa

100,948

Argentina

South America

447,644

Australia

Oceania

Austria

Europe

1,488,221 419,243

Azerbaijan

Europe

62,321

Bahrain

Asia

26,108

Bangladesh

Asia

113,032

Belarus

Europe

55,483 513,396

Belgium

Europe

Bolivia

South America

24,604

Bosnia and Herzegovina

Europe

17,965

Botswana

Africa

17,570

a. Sort the countries in GDPlist from largest to smallest GDP. What are the top 10 countries according to GDP? b. Filter the countries to display only the countries located in Africa. What are the top 5 countries located in Africa according to GDP? c. What are the top 5 countries by GDP that are located in Europe? 3. On-Time Performance of Logistics Companies. Ohio Logistics manages the logistical activities for firms by matching companies that need products shipped with carriers that can provide the best rates and best service for the companies. Ohio Logistics is very concerned that its carriers deliver their customers’ material on time, so it carefully monitors the percentage of on-time deliveries. The following table contains a list of the carriers used by Ohio Logistics and the corresponding on-time percentages for the current and previous years.

Carrier

Carriers

Previous Year On-Time Deliveries (%)

Current Year On-Time Deliveries (%)

Blue Box Shipping

88.4

94.8

Cheetah LLC

89.3

91.8

Granite State Carriers

81.8

87.6

Honsin Limited

74.2

80.1

73

Problems

Previous Year On-Time Deliveries (%)

Carrier

Current Year On-Time Deliveries (%)

Jones Brothers

68.9

82.8

Minuteman Company

91.0

84.2

Rapid Response

78.8

70.9

Smith Logistics

84.3

88.7

Super Freight

92.1

86.8

a. Sort the carriers in descending order by their current year’s percentage of on-time deliveries. Which carrier is providing the best service in the current year? Which carrier is providing the worst service in the current year? b. Calculate the change in percentage of on-time deliveries from the previous to the current year for each carrier. Use Excel’s conditional formatting to highlight the carriers whose on-time percentage decreased from the previous year to the current year. c. Use Excel’s conditional formatting tool to create data bars for the change in percentage of on-time deliveries from the previous year to the current year for each carrier calculated in part b. d. Which carriers should Ohio Logistics try to use in the future? Why? 4. Relative Frequency Distribution. A partial relative frequency distribution is given. Class

Relative Frequency

A

0.22

B

0.18

C

0.40

D

a. What is the relative frequency of class D? b. The total sample size is 200. What is the frequency of class D? c. Show the frequency distribution. d. Show the percent frequency distribution. 5. Most Visited Web Sites. In a recent report, the top five most-visited English- language web sites were google.com (GOOG), facebook.com (FB), youtube.com (YT), yahoo.com (YAH), and wikipedia.com (WIKI). The most-visited web sites for a sample of 50 Internet users are shown in the following table:

WebSites

YAH

WIKI

YT

WIKI

GOOG

YT

YAH

GOOG

GOOG

GOOG

WIKI

GOOG

YAH

YAH

YAH

YAH

YT

GOOG

YT

YAH

GOOG

FB

FB

WIKI

GOOG

GOOG

GOOG

FB

FB

WIKI

FB

YAH

YT

YAH

YAH

YT

GOOG

YAH

FB

FB

WIKI

GOOG

YAH

WIKI

WIKI

YAH

YT

GOOG

GOOG

WIKI

a. Are these data categorical or quantitative? b. Provide frequency and percent frequency distributions. c. On the basis of the sample, which web site is most frequently the most-often-visited web site for Internet users? Which is second?

74

Chapter 2 Descriptive Statistics

6. CEO Time in Meetings. In a study of how chief executive officers (CEOs) spend their days, it was found that CEOs spend an average of about 18 hours per week in meetings, not including conference calls, business meals, and public events. Shown here are the times spent per week in meetings (hours) for a sample of 25 CEOs:

CEOtime

BBB

14 19 23 16 19

15 20 21 15 22

23 15 20 18 21

15 23 21 19 12

a. What is the least amount of time a CEO spent per week in meetings in this sample? The highest? b. Use a class width of 2 hours to prepare a frequency distribution and a percent frequency distribution for the data. c. Prepare a histogram and comment on the shape of the distribution. 7. Complaints Reported to BBB. Consumer complaints are frequently reported to the Better Business Bureau. Industries with the most complaints to the Better Business Bureau are often banks, cable and satellite television companies, collection agencies, cellular phone providers, and new car dealerships. The results for a sample of 200 complaints are in the file BBB. a. Show the frequency and percent frequency of complaints by industry. b. Which industry had the highest number of complaints? c. Comment on the percentage frequency distribution for complaints. 8. Busiest North American Airports. Based on the total passenger traffic, the airports in the following list are the 20 busiest airports in North America in 2018 (The World Almanac).

Airport (Airport Code)

Airports

18 13 15 18 23

Boston Logan (BOS) Charlotte Douglas (CLT) Chicago O’Hare (ORD) Dallas/Ft. Worth (DFW) Denver (DEN) Detroit Metropolitan (DTW) Hartsfield-Jackson Atlanta (ATL) Houston George Bush (IAH) Las Vegas McCarran (LAS) Los Angeles (LAX) Miami (MIA) Minneapolis/St. Paul (MSP) New York John F. Kennedy (JFK) Newark Liberty (EWR) Orlando (MCO) Philadelphia (PHL) Phoenix Sky Harbor (PHX) San Francisco (SFO) Seattle-Tacoma (SEA) Toronto Pearson (YYZ)

Total Passengers (Million) 36.3 44.4 78 65.7 58.3 34.4 104.2 41.6 47.5 80.9 44.6 37.4 59.1 40.6 41.9 36.4 43.3 53.1 45.7 44.3

75

Problems

a. Which is busiest airport in terms of total passenger traffic? Which is the least busy airport in terms of total passenger traffic? b. Using a class width of 10, develop a frequency distribution of the data starting with 30–39.9, 40–49.9, 50–59.9, and so on. c. Prepare a histogram. Interpret the histogram. 9. Relative and Percent Frequency Distributions. Consider the following data:

Frequency

14 19 24 19 16 20 24 20 21 22

24 18 17 23 26 22 23 25 25 19

18 16 15 24 21 16 19 21 23 20

22 22 16 16 16 12 25 19 24 20

a. Develop a frequency distribution using classes of 12–14, 15–17, 18–20, 21–23, and 24–26. b. Develop a relative frequency distribution and a percent frequency distribution using the classes in part a. 1 0. Cumulative Frequency Distribution. Consider the following frequency distribution. Class

Frequency

10–19

10

20–29

14

30–39

17

40–49

7

50–59

2

Construct a cumulative frequency distribution. 11. Repair Shop Waiting Times. The owner of an automobile repair shop studied the waiting times for customers who arrive at the shop for an oil change. The following data with waiting times in minutes were collected over a one-month period. 2 5 10 12 4 4 5 17 11 8 9 8 12 21 6 8 7 13 18 3 RepairShop

Using classes of 0–4, 5–9, and so on, show the following: a. The frequency distribution b. The relative frequency distribution c. The cumulative frequency distribution d. The cumulative relative frequency distribution e. The proportion of customers needing an oil change who wait 9 minutes or less 1 2. Largest University Endowments. University endowments are financial assets that are donated by supporters to be used to provide income to universities. There is a large discrepancy in the size of university endowments. The following table provides a listing of many of the universities that have the largest endowments as reported by the National Association of College and University Business Officers in 2017.

76

Chapter 2 Descriptive Statistics

University

Endowments

Endowment Amount ($ Billion)

University

Endowment Amount ($ Billion)

Amherst College 2.2 Smith College Boston College 2.3 Stanford University Boston University 2.0 Swarthmore College Brown University 3.2 Texas A&M University California Institute Tufts University of Technology 2.6 University of California Carnegie Mellon University of California, University 2.2 Berkeley Case Western University of California, Reserve University 1.8 Los Angeles Columbia University 10.0 University of Chicago Cornell University 6.8 University of Illinois Dartmouth College 5.0 University of Michigan Duke University 7.9 University of Minnesota Emory University 6.9 University of North Carolina George Washington at Chapel Hill University 1.7 University of Notre Dame Georgetown University 1.7 University of Oklahoma Georgia Institute University of Pennsylvania of Technology 2.0 University of Pittsburgh 1.9 University of Richmond Grinnell College Harvard University 36.0 University of Rochester Indiana University 2.2 University of Southern Johns Hopkins University 3.8 California Massachusetts Institute University of Texas of Technology 15.0 University of Virginia Michigan State University 2.7 University of Washington New York University 4.0 University of Northwestern University 10.4 Wisconsin–Madison Ohio State University 4.3 Vanderbilt University Virginia Commonwealth Pennsylvania State University 4.0 University Pomona College 2.2 Washington University in Princeton University 23.8 St. Louis Purdue University 2.4 Wellesley College Rice University 5.8 Williams College Rockefeller University 2.0 Yale University

1.8 24.8 2.0 11.6 1.7 9.8 1.8 2.1 7.5 2.6 10.9 3.5 3.0 9.4 1.6 12.2 3.9 2.4 2.1 5.1 26.5 8.6 2.5 2.7 4.1 1.8 7.9 1.9 2.5 27.2

Summarize the data by constructing the following: a. A frequency distribution (classes 0–1.9, 2.0–3.9, 4.0–5.9, 6.0–7.9, and so on). b. A relative frequency distribution. c. A cumulative frequency distribution. d. A cumulative relative frequency distribution. e. What do these distributions tell you about the endowments of universities? f. Show a histogram. Comment on the shape of the distribution. g. What is the largest university endowment and which university holds it?

77

Problems

13. Computing Mean and Median. Consider a sample with data values of 10, 20, 12, 17, and 16. a. Compute the mean and median. b. Consider a sample with data values 10, 20, 12, 17, 16, and 12. How would you expect the mean and median for these sample data to compare to the mean and median for part a (higher, lower, or the same)? Compute the mean and median for the sample data 10, 20, 12, 17, 16, and 12. 14. Computing Percentiles. Consider a sample with data values of 27, 25, 20, 15, 30, 34, 28, and 25. Compute the 20th, 25th, 65th, and 75th percentiles. 15. Computing Mean, Median, and Mode. Consider a sample with data values of 53, 55, 70, 58, 64, 57, 53, 69, 57, 68, and 53. Compute the mean, median, and mode. 16. Mean Annual Growth Rate of Asset. If an asset declines in value from $5,000 to $3,500 over nine years, what is the mean annual growth rate in the asset’s value over these nine years? 17. Comparing Mutual Fund Investments. Suppose that you initially invested $10,000 in the Stivers mutual fund and $5,000 in the Trippi mutual fund. The value of each investment at the end of each subsequent year is provided in the table:

StiversTrippi

Year

Stivers ($)

Trippi ($)

1

11,000

5,600

2

12,000

6,300

3

13,000

6,900

4

14,000

7,600

5

15,000

8,500

6

16,000

9,200

7

17,000

9,900

8

18,000

10,600

Which of the two mutual funds performed better over this time period? 18. Commute Times. The average time that Americans commute to work is 27.7 minutes (Sterling’s Best Places). The average commute times in minutes for 48 cities are as follows:

CommuteTimes

Albuquerque Atlanta Austin Baltimore Boston Charlotte Chicago Cincinnati Cleveland Columbus Dallas Denver Detroit El Paso Fresno Indianapolis

23.3 28.3 24.6 32.1 31.7 25.8 38.1 24.9 26.8 23.4 28.5 28.1 29.3 24.4 23.0 24.8

Jacksonville Kansas City Las Vegas Little Rock Los Angeles Louisville Memphis Miami Milwaukee Minneapolis Nashville New Orleans New York Oklahoma City Orlando Philadelphia

26.2 23.4 28.4 20.1 32.2 21.4 23.8 30.7 24.8 23.6 25.3 31.7 43.8 22.0 27.1 34.2

Phoenix Pittsburgh Portland Providence Richmond Sacramento Salt Lake City San Antonio San Diego San Francisco San Jose Seattle St. Louis Tucson Tulsa Washington, D.C.

28.3 25.0 26.4 23.6 23.4 25.8 20.2 26.1 24.8 32.6 28.5 27.3 26.8 24.0 20.1 32.8

78

Chapter 2 Descriptive Statistics

a. What is the mean commute time for these 48 cities? b. What is the median commute time for these 48 cities? c. What is the mode for these 48 cities? d. What is the variance and standard deviation of commute times for these 48 cities? e. What is the third quartile of commute times for these 48 cities? 1 9. Patient Waiting Times. Suppose that the average waiting time for a patient at a physician’s office is just over 29 minutes. To address the issue of long patient wait times, some physicians’ offices are using wait-tracking systems to notify patients of expected wait times. Patients can adjust their arrival times based on this information and spend less time in waiting rooms. The following data show wait times (in minutes) for a sample of patients at offices that do not have a wait-tracking system and wait times for a sample of patients at offices with such systems.

PatientWaits

Without Wait-Tracking System

With Wait-Tracking System

24

31

67

11

17

14

20

18

31

12

44

37

12

9

23

13

16

12

37

15

a. What are the mean and median patient wait times for offices with a wait-tracking system? What are the mean and median patient wait times for offices without a wait-tracking system? b. What are the variance and standard deviation of patient wait times for offices with a wait-tracking system? What are the variance and standard deviation of patient wait times for visits to offices without a wait-tracking system? c. Create a boxplot for patient wait times for offices without a wait-tracking system. d. Create a boxplot for patient wait times for offices with a wait-tracking system. e. Do offices with a wait-tracking system have shorter patient wait times than offices without a wait-tracking system? Explain. 2 0. Number of Hours Worked per Week by Teachers. According to the National Education Association (NEA), teachers generally spend more than 40 hours each week working on instructional duties. The following data show the number of hours worked per week for a sample of 13 high school science teachers and a sample of 11 high school English teachers.

Teachers

High school science teachers 53 56 54 54 55 58 49 61 54 54 52 53 54 High school English teachers 52 47 50 46 47 48 49 46 55 44 47 a. What is the median number of hours worked per week for the sample of 13 high school science teachers? b. What is the median number of hours worked per week for the sample of 11 high school English teachers? c. Create a boxplot for the number of hours worked for high school science teachers. d. Create a boxplot for the number of hours worked for high school English teachers. e. Comment on the differences between the boxplots for science and English teachers.

79

Problems

PatientWaits

21. z-Scores for Patient Waiting Times. Return to the waiting times given for the physician’s office in Problem 19. a. Considering only offices without a wait-tracking system, what is the z-score for the 10th patient in the sample (wait time 5 37 minutes) ? b. Considering only offices with a wait-tracking system, what is the z-score for the 6th patient in the sample (wait time 5 37 minutes) ? How does this z-score compare with the z-score you calculated for part a? c. Based on z-scores, do the data for offices without a wait-tracking system contain any outliers? Based on z-scores, do the data for offices without a wait-tracking system contain any outliers? 22. Amount of Sleep per Night. The results of a national survey showed that on average adults sleep 6.9 hours per night. Suppose that the standard deviation is 1.2 hours and that the number of hours of sleep follows a bell-shaped distribution. a. Use the empirical rule to calculate the percentage of individuals who sleep between 4.5 and 9.3 hours per day. b. What is the z-score for an adult who sleeps 8 hours per night? c. What is the z-score for an adult who sleeps 6 hours per night? 23. GMAT Exam Scores. The Graduate Management Admission Test (GMAT) is a standardized exam used by many universities as part of the assessment for admission to graduate study in business. The average GMAT score is 547 (Magoosh web site). Assume that GMAT scores are bell-shaped with a standard deviation of 100. a. What percentage of GMAT scores are 647 or higher? b. What percentage of GMAT scores are 747 or higher? c. What percentage of GMAT scores are between 447 and 547? d. What percentage of GMAT scores are between 347 and 647? 24. Scatter Chart. Five observations taken for two variables follow. xi

4

6

11

3

16

yi

50

50

40

60

30

a. Develop a scatter chart with x on the horizontal axis. b. What does the scatter chart developed in part a indicate about the relationship between the two variables? c. Compute and interpret the sample covariance. d. Compute and interpret the sample correlation coefficient. 2 5. Company Profits and Market Cap. The scatter chart in the following figure was created using sample data for profits and market capitalizations from a sample of firms in the Fortune 500.

Fortune500

Market Cap (millions of $)

200,000 160,000 120,000 80,000 40,000 0

4,000 8,000 12,000 Profits (millions of $)

16,000

80

Chapter 2 Descriptive Statistics

a. Discuss what the scatter chart indicates about the relationship between profits and market capitalization? b. The data used to produce this are contained in the file Fortune500. Calculate the covariance between profits and market capitalization. Discuss what the covariance indicates about the relationship between profits and market capitalization? c. Calculate the correlation coefficient between profits and market capitalization. What does the correlation coefficient indicate about the relationship between profits and market capitalization? 2 6. Jobless Rate and Percent of Delinquent Loans. The economic downturn in 2008– 2009 resulted in the loss of jobs and an increase in delinquent loans for housing. In projecting where the real estate market was headed in the coming year, economists studied the relationship between the jobless rate and the percentage of delinquent loans. The expectation was that if the jobless rate continued to increase, there would also be an increase in the percentage of delinquent loans. The following data show the jobless rate and the delinquent loan percentage for 27 major real estate markets.

Metro Area

JoblessRate

Jobless Rate (%)

Delinquent Loans (%)

Metro Area

Jobless Rate (%)

Delinquent Loans (%)

Atlanta

7.1

7.02

New York

6.2

5.78

Boston

5.2

5.31

Orange County

6.3

6.08 10.05

Charlotte

7.8

5.38

Orlando

7.0

Chicago

7.8

5.40

Philadelphia

6.2

4.75

Dallas

5.8

5.00

Phoenix

5.5

7.22

Denver

5.8

4.07

Portland

6.5

3.79

Detroit

9.3

6.53

Raleigh

6.0

3.62

Houston

5.7

5.57

Sacramento

8.3

9.24

Jacksonville

7.3

6.99

Las Vegas

7.6

11.12

Los Angeles

8.2

7.56

St. Louis

7.5

4.40

San Diego

7.1

6.91

San Francisco

6.8

5.57

Miami

7.1

12.11

Seattle

5.5

3.87

Minneapolis

6.3

4.39

Tampa

7.5

8.42

Nashville

6.6

4.78

Source: The Wall Street Journal, January 27, 2009.

JavaCup

a. Compute the correlation coefficient. Is there a positive correlation between the jobless rate and the percentage of delinquent housing loans? What is your interpretation? b. Show a scatter chart of the relationship between the jobless rate and the percentage of delinquent housing loans. 2 7. Java Cup Taste Data. Huron Lakes Candies (HLC) has developed a new candy bar called Java Cup that is a milk chocolate cup with a coffee-cream center. In order to assess the market potential of Java Cup, HLC has developed a taste test and follow-up survey. Respondents were asked to taste Java Cup and then rate Java Cup’s taste, texture, creaminess of filling, sweetness, and depth of the chocolate flavor of the cup on a 100-point scale. The taste test and survey were administered to 217 randomly selected adult consumers. Data collected from each respondent are provided in the file JavaCup. a. Are there any missing values in HLC’s survey data? If so, identify the respondents for which data are missing and which values are missing for each of these respondents.

81

Case Problem 1: Heavenly Chocolates WebSite Transactions

AttendMLB

b. Are there any values in HLC’s survey data that appear to be erroneous? If so, identify the respondents for which data appear to be erroneous and which values appear to be erroneous for each of these respondents. 2 8. Major League Baseball Attendance. Marilyn Marshall, a professor of sports economics, has obtained a data set of home attendance for each of the 30 major league baseball franchises for each season from 2010 through 2016. Dr. Marshall suspects the data, provided in the file AttendMLB, is in need of a thorough cleansing. You should also find a reliable source of Major League Baseball attendance for each franchise between 2010 and 2016 to use to help you identify appropriate imputation values for data missing in the AttendMLB file. a. Are there any missing values in Dr. Marshall’s data? If so, identify the teams and seasons for which data are missing and which values are missing for each of these teams and seasons. Use the reliable source of Major League Baseball Attendance for each franchise between 2010 and 2016 you have found to find the correct value in each instance. b. Are there any values in Dr. Marshall’s data that appear to be erroneous? If so, identify the teams and seasons for which data appear to be erroneous and which values appear to be erroneous for each of these teams and seasons. Use the reliable source of Major League Baseball Attendance for each franchise between 2010 and 2016 you have found to find the correct value in each instance. C ase P r oblem 1 : H eave n l y W eb S i te T r a n sact i o n s

C hocolates

Heavenly Chocolates manufactures and sells quality chocolate products at its plant and retail store located in Saratoga Springs, New York. Two years ago, the company developed a web site and began selling its products over the Internet. Web site sales have exceeded the company’s expectations, and management is now considering strategies to increase sales even further. To learn more about the web site customers, a sample of 50 Heavenly Chocolates transactions was selected from the previous month’s sales. Data showing the day of the week each transaction was made, the type of browser the customer used, the time spent on the web site, the number of web pages viewed, and the amount spent by each of the 50 customers are contained in the file HeavenlyChocolates. A portion of the data is shown in the table that follows:

Customer

HeavenlyChocolates

Day

Browser

Time (min)

Pages Viewed

1

Mon

Chrome

12.0

4

Amount Spent ($) 54.52 94.90

2

Wed

Other

19.5

6

3

Mon

Chrome

8.5

4

26.68

4

Tue

Firefox

11.4

2

44.73

5

Wed

Chrome

11.3

4

66.27

6

Sat

Firefox

10.5

6

67.80

7

Sun

Chrome

11.4

2

36.04

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

48

Fri

Chrome

9.7

5

103.15

49

Mon

Other

7.3

6

52.15

50

Fri

Chrome

13.4

3

98.75

82

Chapter 2 Descriptive Statistics

Heavenly Chocolates would like to use the sample data to determine whether online shoppers who spend more time and view more pages also spend more money during their visit to the web site. The company would also like to investigate the effect that the day of the week and the type of browser have on sales. Managerial Report

Use the methods of descriptive statistics to learn about the customers who visit the Heavenly Chocolates web site. Include the following in your report. 1. Graphical and numerical summaries for the length of time the shopper spends on the web site, the number of pages viewed, and the mean amount spent per transaction. Discuss what you learn about Heavenly Chocolates’ online shoppers from these numerical summaries. 2. Summarize the frequency, the total dollars spent, and the mean amount spent per transaction for each day of week. Discuss the observations you can make about Heavenly Chocolates’ business based on the day of the week? 3. Summarize the frequency, the total dollars spent, and the mean amount spent per transaction for each type of browser. Discuss the observations you can make about Heavenly Chocolates’ business based on the type of browser? 4. Develop a scatter chart, and compute the sample correlation coefficient to explore the relationship between the time spent on the web site and the dollar amount spent. Use the horizontal axis for the time spent on the web site. Discuss your findings. 5. Develop a scatter chart, and compute the sample correlation coefficient to explore the relationship between the number of web pages viewed and the amount spent. Use the horizontal axis for the number of web pages viewed. Discuss your findings. 6. Develop a scatter chart, and compute the sample correlation coefficient to explore the relationship between the time spent on the web site and the number of pages viewed. Use the horizontal axis to represent the number of pages viewed. Discuss your findings. C A S E PR O B L E M P O P U L AT I O N S

2 :

A F RI C A N

E L E P H AN T

Although millions of elephants once roamed across Africa, by the mid-1980s elephant populations in African nations had been devastated by poaching. Elephants are important to African ecosystems. In tropical forests, elephants create clearings in the canopy that encourage new tree growth. In savannas, elephants reduce bush cover to create an environment that is favorable to browsing and grazing animals. In addition, the seeds of many plant species depend on passing through an elephant’s digestive tract before germination. The status of the elephant now varies greatly across the continent. In some nations, strong measures have been taken to effectively protect elephant populations; for example, Kenya has destroyed over five tons of elephant ivory confiscated from poachers in an attempt to deter the growth of illegal ivory trade (Associated Press, July 20, 2011). In other nations the elephant populations remain in danger due to poaching for meat and ivory, loss of habitat, and conflict with humans. The table below shows elephant populations for several African nations in 1979, 1989, 2007, and 2012 (ElephantDatabase.org web site). The David Sheldrick Wildlife Trust was established in 1977 to honor the memory of naturalist David Leslie William Sheldrick, who founded Warden of Tsavo East National Park in Kenya and headed the Planning Unit of the Wildlife Conservation and Management Department in that country. Management of the Sheldrick Trust would like to know what these data indicate about elephant populations in various African countries since 1979.

83

Case Problem 2: African Elephant Populations

Elephant Population Country

AfricanElephants

Angola Botswana Cameroon Cen African Rep Chad Congo Dem Rep of Congo Gabon Kenya Mozambique Somalia Tanzania Zambia Zimbabwe

1979

1989

2007

2012

12,400 20,000 16,200 63,000 15,000 10,800 377,700 13,400 65,000 54,800 24,300 316,300 150,000 30,000

12,400 51,000 21,200 19,000 3,100 70,000 85,000 76,000 19,000 18,600 6,000 80,000 41,000 43,000

2,530 175,487 15,387 3,334 6,435 22,102 23,714 70,637 31,636 26,088 70 167,003 29,231 99,107

2,530 175,454 14,049 2,285 3,004 49,248 13,674 77,252 36,260 26,513 70 117,456 21,589 100,291

Managerial Report

Use methods of descriptive statistics to summarize the data and comment on changes in elephant populations since 1979. Include the following in your report. 1. Use the geometric mean calculation to find the mean annual change in elephant population for each country in the 10 years from 1979 to 1989, and a discussion of which countries saw the largest changes in elephant population over this 10-year period. 2. Use the geometric mean calculation to find the mean annual change in elephant population for each country in the 18 years from 1989 to 2007, and a discussion of which countries saw the largest changes in elephant population over this 18-year period. 3. Use the geometric mean calculation to find the mean annual change in elephant population for each country in the 5 years from 2007 to 2012, and a discussion of which countries saw the largest changes in elephant population over this 5-year period. 4. Create a multiple boxplot graph that includes boxplots of the elephant population observations in each year (1979, 1989, 2007, 2012). Use these boxplots and the results of your analyses in points 1 through 3 above to comment on how the populations of elephants have changed during these time periods.

Chapter 3 Data Visualization Contents Analytics in Action: Cincinnati Zoo & Botanical Garden 3.1 OVERVIEW OF DATA VISUALIZATION Effective Design Techniques 3.2 TABLES Table Design Principles Crosstabulation PivotTables in Excel Recommended PivotTables in Excel 3.3 CHARTS Scatter Charts Recommended Charts in Excel Line Charts Bar Charts and Column Charts A Note on Pie Charts and Three-Dimensional Charts Bubble Charts Heat Maps Additional Charts for Multiple Variables PivotCharts in Excel 3.4 ADVANCED DATA VISUALIZATION Advanced Charts Geographic Information Systems Charts 3.5 DATA DASHBOARDS Principles of Effective Data Dashboards Applications of Data Dashboards Summary 128 Glossary 128 Problems 129 APPENDIX: DATA VISUALIZATION IN TABLEAU 141 AVAILABLE IN THE MINDTAP READER: APPENDIX: Creating Tabular and Graphical Presentations WITH R

86

Chapter 3 Data Visualization

A n a ly t i c s

i n

Ac t i o n

Cincinnati Zoo & Botanical Garden1 The Cincinnati Zoo & Botanical Garden, located in Cincinnati, Ohio, is one of the oldest zoos in the United States. In 2019, it was named the best zoo in North America by USA Today. To improve decision making by becoming more data-driven, management decided they needed to link the various facets of their business and provide nontechnical managers and executives with an intuitive way to better understand their data. A complicating factor is that when the zoo is busy, managers are expected to be on the grounds interacting with guests, checking on operations, and dealing with issues as they arise or anticipating them. Therefore, being able to monitor what is happening in real time was a key factor in The authors are indebted to John Lucas of the Cincinnati Zoo & Botanical Garden for providing this application.

1

FIGURE 3.1

deciding what to do. Zoo management concluded that a data-visualization strategy was needed to address the problem. Because of its ease of use, real-time updating capability, and iPad compatibility, the Cincinnati Zoo decided to implement its data-visualization strategy using IBM’s Cognos advanced data-visualization software. Using this software, the Cincinnati Zoo developed the set of charts shown in Figure3.1 (known as a data dashboard) to enable management to track the following key measures of performance: ●●

●●

●●

Item analysis (sales volumes and sales dollars by location within the zoo) Geoanalytics (using maps and displays of where the day’s visitors are spending their time at the zoo) Customer spending

Data Dashboard for the Cincinnati Zoo

87

Analytics in Action

●● ●●

●●

Cashier sales performance Sales and attendance data versus weather patterns Performance of the zoo’s loyalty rewards program

An iPad mobile application was also developed to enable the zoo’s managers to be out on the grounds and still see and anticipate occurrences in real time. The Cincinnati Zoo’s iPad application, shown in Figure3.2, provides managers with access to the following information: ●●

Real-time attendance data, including what types of guests are coming to the zoo (members, nonmembers, school groups, and so on)

FIGURE 3.2

●●

●●

Real-time analysis showing which locations are busiest and which items are selling the fastest inside the zoo Real-time geographical representation of where the zoo’s visitors live

Having access to the data shown in Figures 3.1 and3.2 allows the zoo managers to make better decisions about staffing levels, which items to stock based on weather and other conditions, and how to better target advertising based on geodemographics. The impact that data visualization has had on the zoo has been substantial. Within the first year of use, the system was directly responsible for revenue growth of over $500,000, increased visitation to the zoo, enhanced customer service, and reduced marketing costs.

The Cincinnati Zoo iPad Data Dashboard

88

Chapter 3 Data Visualization

The first step in trying to interpret data is often to visualize it in some way. Data visualization can be as simple as creating a summary table, or it could require generating charts to help interpret, analyze, and learn from the data. Data visualization is very helpful for identifying data errors and for reducing the size of your data set by highlighting important relationships and trends. Data visualization is also important in conveying your analysis to others. Although business analytics is about making better decisions, in many cases, the ultimate decision maker is not the person who analyzes the data. Therefore, the person analyzing the data has to make the analysis simple for others to understand. Proper data-visualization techniques greatly improve the ability of the decision maker to interpret the analysis easily. In this chapter we discuss some general concepts related to data visualization to help you analyze data and convey your analysis to others. We cover specifics dealing with how to design tables and charts, as well as the most commonly used charts, and present an overview of some more advanced charts. We also introduce the concept of data dashboards and geographic information systems (GISs). Our detailed examples use Excel to generate tables and charts, and we discuss several software packages that can be used for advanced data visualization.

3.1 Overview of Data Visualization Decades of research studies in psychology and other fields show that the human mind can process visual images such as charts much faster than it can interpret rows of numbers. However, these same studies also show that the human mind has certain limitations in its ability to interpret visual images and that some images are better at conveying information than others. The goal of this chapter is to introduce some of the most common forms of visualizing data and demonstrate when each form is appropriate. Microsoft Excel is a ubiquitous tool used in business for basic data visualization. Software tools such as Excel make it easy for anyone to create many standard examples of data visualization. However, as discussed in this chapter, the default settings for tables and charts created with Excel can be altered to increase clarity. New types of software that are dedicated to data visualization have appeared recently. We focus our techniques on Excel in this chapter, but we also mention some of these more advanced software packages for specific data-visualization uses.

Effective Design Techniques One of the most helpful ideas for creating effective tables and charts for data visualization is the idea of the data-ink ratio, first described by Edward R. Tufte in 2001 in his book The Visual Display of Quantitative Information. The data-ink ratio measures the proportion of what Tufte terms “data-ink” to the total amount of ink used in a table or chart. Data-ink is the ink used in a table or chart that is necessary to convey the meaning of the data to the audience. Non-data-ink is ink used in a table or chart that serves no useful purpose in conveying the data to the audience. Let us consider the case of Gossamer Industries, a firm that produces fine silk clothing products. Gossamer is interested in tracking the sales of one of its most popular items, a particular style of women’s scarf. Table 3.1 and Figure 3.3 provide examples of a table and chart with low data-ink ratios used to display sales of this style of women’s scarf. The data used in this table and figure represent product sales by day. Both of these examples are similar to tables and charts generated with Excel using common default settings. In Table3.1, most of the grid lines serve no useful purpose. Likewise, in Figure 3.3, the horizontal lines in the chart also add little additional information. In both cases, most of these lines can be deleted without reducing the information conveyed. However, an i mportant piece of information is missing from Figure 3.3: labels for axes. Axes should always be labeled in a chart unless both the meaning and unit of measure are obvious.

89

3.1 Overview of Data Visualization

Table 3.1

Example of a Low Data-Ink Ratio Table Scarf Sales

Day

Sales (units)

Day

Sales (units)

1

150

11

170

2

170

12

160

3

140

13

290

4

150

14

200

5

180

15

210

6

180

16

110

7

210

17

90

8

230

18

140

9

140

19

150

10

200

20

230

Table 3.2 shows a modified table in which all grid lines have been deleted except for those around the title of the table. Deleting the grid lines in Table 3.1 increases the data-ink ratio because a larger proportion of the ink in the table is used to convey the information (the actual numbers). Similarly, deleting the unnecessary horizontal lines in Figure3.4 increases the data-ink ratio. Note that deleting these horizontal lines and removing (or reducing the size of) the markers at each data point can make it more difficult to determine the exact values plotted in the chart. However, as we discuss later, a simple chart is not the most effective way of presenting data when the audience needs to know exact values; in these cases, it is better to use a table. In many cases, white space in a table or a chart can improve readability. This principle is similar to the idea of increasing the data-ink ratio. Consider Table 3.2 and Figure 3.4. Removing the unnecessary lines has increased the “white space,” making it easier to read both the table and the chart. The fundamental idea in creating effective tables and charts is to make them as simple as possible in conveying information to the reader. Example of a Low Data-Ink Ratio Chart

FIGURE 3.3

Scarf Sales 350

Sales

300 250 200 150 100 50 0

1

2

3

4

5

6

7

8

9 10 11 12 13 14 15 16 17 18 19 20

90

Chapter 3 Data Visualization

Table 3.2

Increasing the Data-Ink Ratio by Removing UnnecessaryGridlines Scarf Sales

Day

Sales (units)

Day

Sales (units)

1

150

11

170

2

170

12

160

3

140

13

290

4

150

14

200

5

180

15

210 110

6

180

16

7

210

17

90

8

230

18

140

9

140

19

150

10

200

20

230

FIGURE 3.4

Increasing the Data-Ink Ratio by Adding Labels to Axes and Removing Unnecessary Lines and Labels Scarf Sales

350

Sales (Units)

300 250 200 150 100 50 0

1

3

5

7

9

11

13

15

17

19

Day

No t e s

+

C o m m e n t s

1. Tables have been used to display data for more than a thousand years. However, charts are much more recent inventions. The famous 17th-century French mathematician, René Descartes, is credited with inventing the now familiar graph with horizontal and vertical axes. William Playfair invented bar charts, line charts, and pie charts in the late 18th century, all of which we will discuss in this chapter. More recently, individuals such as William

Cleveland, Edward R. Tufte, and Stephen Few have introduced design techniques for both clarity and beauty in data visualization. 2. Many of the default settings in Excel are not ideal for displaying data using tables and charts that communicate effectively. Before presenting Excel-generated tables and charts to others, it is worth the effort to remove unnecessary lines and labels.

91

3.2 Tables

Table 3.3

Table Showing Exact Values for Costs and Revenues by Month for Gossamer Industries Month 1

2

3

4

5

6

Total

Costs ($)

48,123

56,458

64,125

52,158

54,718

50,985

326,567

Revenues ($)

64,124

66,128

67,125

48,178

51,785

55,687

353,027

3.2 Tables The first decision in displaying data is whether a table or a chart will be more effective. In general, charts can often convey information faster and easier to readers, but in some cases a table is more appropriate. Tables should be used when the 1. reader needs to refer to specific numerical values. 2. reader needs to make precise comparisons between different values and not just relative comparisons. 3. values being displayed have different units or very different magnitudes. When the accounting department of Gossamer Industries is summarizing the company’s annual data for completion of its federal tax forms, the specific numbers corresponding to revenues and expenses are important and not just the relative values. Therefore, these data should be presented in a table similar to Table 3.3. Similarly, if it is important to know by exactly how much revenues exceed expenses each month, then this would also be better presented as a table rather than as a line chart as seen in Figure 3.5. Notice that it is very difficult to determine the monthly revenues and costs in Figure 3.5. We could add these values using data labels, but they would clutter the figure. The preferred solution is to combine the chart with the table into a single figure, as in Figure 3.6, to allow the reader to easily see the monthly changes in revenues and costs while also being able to refer to the exact numerical values. Now suppose that you wish to display data on revenues, costs, and head count for each month. Costs and revenues are measured in dollars, but head count is measured in number of employees. Although all these values can be displayed on a line chart using multiple FIGURE 3.5

Line Chart of Monthly Costs and Revenues atGossamerIndustries

80,000 70,000 60,000

Revenues ($) Costs ($)

50,000 40,000 30,000 20,000 10,000 0

1

2

3 Month

4

5

6

92

Chapter 3 Data Visualization

FIGURE 3.6

Combined Line Chart and Table for Monthly Costs and Revenues at Gossamer Industries

80,000 70,000 60,000

Revenues ($) Costs ($)

50,000 40,000 30,000 20,000 10,000 0

1

2

3 Month

4

5

6

Month 1

2

3

4

5

6

Total

48,123

56,458

64,125

52,158

54,718

50,985

326,567

Revenues ($) 64,124

66,128

67,125

48,178

51,785

55,687

353,027

Costs ($)

vertical axes, this is generally not recommended. Because the values have widely different magnitudes (costs and revenues are in the tens of thousands, whereas head count is approximately 10 each month), it would be difficult to interpret changes on a single chart. Therefore, a table similar to Table 3.4 is recommended.

Table Design Principles In designing an effective table, keep in mind the data-ink ratio and avoid the use of unnecessary ink in tables. In general, this means that we should avoid using vertical lines in a table unless they are necessary for clarity. Horizontal lines are generally necessary only for separating column titles from data values or when indicating that a calculation has taken place. Consider Figure 3.7, which compares several forms of a table displaying Gossamer’s costs and revenue data. Most people find Design D, with the fewest grid lines, easiest to read. In this table, grid lines are used only to separate the column headings from the data and to indicate that a calculation has occurred to generate the Profits row and the Total column. In large tables, vertical lines or light shading can be useful to help the reader differentiate the columns and rows. Table 3.5 breaks out the revenue data by location for nine cities

Table 3.4

Table Displaying Head Count, Costs, and Revenues at Gossamer Industries Month 1

Head count

2

3

4

5

6

Total

8

9

10

9

9

9

Costs ($)

48,123

56,458

64,125

52,158

54,718

50,985

326,567

Revenues ($)

64,124

66,128

67,125

48,178

51,785

55,687

353,027

93

3.2 Tables

Comparing Different Table Designs

FIGURE 3.7 Design A:

Design C: Month

Month

1

2

3

4

5

6

Costs ($)

48,123

56,458

64,125

52,158

54,718

50,985 326,567

Revenues ($)

64,124

66,128

67,125

48,178

51,785

55,687 353,027

Profits ($)

16,001

9,670

3,000

(3,980)

(2,933)

4,702

Total

26,460

Design B:

1

2

3

4

5

6

Total

Costs ($)

48,123

56,458

64,125

52,158

54,718

50,985

326,567

Revenues ($)

64,124

66,128

67,125

48,178

51,785

55,687

353,027

Profits ($)

16,001

9,670

3,000

(3,980)

(2,933)

4,702

26,460

1

2

3

4

5

6

Total

52,158

54,718

50,985

326,567

Design D: Month

Month

1

2

3

4

5

6

Total

Costs ($)

48,123

56,458

64,125

52,158

54,718

50,985

326,567

Costs ($)

48,123

56,458

64,125

Revenues ($)

64,124

66,128

67,125

48,178

51,785

55,687

353,027

Revenues ($)

64,124

66,128

67,125

48,178

51,785

55,687

353,027

Profits ($)

16,001

9,670

3,000

(3,980)

(2,933)

4,702

26,460

Profits ($)

16,001

9,670

3,000

(3,980)

(2,933)

4,702

26,460

We depart from these guidelines in some figures and tables in this textbook to more closely match Excel’s output.

and shows 12 months of revenue and cost data. In Table 3.5, every other column has been lightly shaded. This helps the reader quickly scan the table to see which values correspond with each month. The horizontal line between the revenue for Academy and the Total row helps the reader differentiate the revenue data for each location and indicates that a calculation has taken place to generate the totals by month. If one wanted to highlight the differences among locations, the shading could be done for every other row instead of every other column. Notice also the alignment of the text and numbers in Table 3.5. Columns of numerical values in a table should usually be right-aligned; that is, the final digit of each number should be aligned in the column. This makes it easy to see differences in the magnitude of values. If you are showing digits to the right of the decimal point, all values should include the same number of digits to the right of the decimal. Also, use only the number of digits that are necessary to convey the meaning in comparing the values; there is no need to include additional digits if they are not meaningful for comparisons. In many business applications, we report financial values, in which case we often round to the nearest dollar or include two digits to the right of the decimal if such precision is necessary. Additional digits to the right of the decimal are usually unnecessary. For extremely large numbers, we may prefer to display data rounded to the nearest thousand, ten thousand, or even million. For instance, if we need to include, say, $3,457,982 and $10,124,390 in a table when exact dollar values are not necessary, we could write these as 3.458 and 10.124 and indicate that all values in the table are in units of $1,000,000. It is generally best to left-align text values within a column in a table, as in the Revenues by Location (the first) column of Table 3.5. In some cases, you may prefer to center text, but you should do this only if the text values are all approximately the same length. Otherwise, aligning the first letter of each data entry promotes readability. Column headings should either match the alignment of the data in the columns or be centered over the values, as in Table 3.5.

Crosstabulation Types of data such as categorical and quantitative are discussed in Chapter2.

A useful type of table for describing data of two variables is a crosstabulation, which provides a tabular summary of data for two variables. To illustrate, consider the following application based on data from Zagat’s Restaurant Review. Data on the quality rating, meal price, and the usual wait time for a table during peak hours were collected for a sample of 300 Los Angeles area restaurants. Table 3.6 shows the data for the first 10 restaurants.

94

7,617

48,123

Costs ($)

4,170

Academy 64,124

5,266

Lampasas

Total

5,257 5,316

Harker Heights

Gatesville

7,671 7,642

Belton

Granger

56,458

66,128

5,266

5,129

5,245

5,326

7,744

9,143 12,063

8,212 11,603

Killeen

2 8,595

Waco

1 8,987

Temple

3

64,125

67,125

7,472

5,022

5,056

4,998

7,836

7,896

11,173

8,714

8,958

4

52,158

48,178

1,594

3,022

3,317

4,304

5,833

6,899

9,622

6,869

6,718

54,718

51,785

1,732

3,088

3,852

4,106

6,002

7,877

8,912

8,150

8,066

5

6

2,025

4,289

4,026

4,980

6,728

6,621

9,553

8,891

8,574

50,985

55,687

Month

Larger Table Showing Revenues by Location for 12 Months of Data

Revenues by Location ($)

Table 3.5

7

57,898

69,125

8,772

5,110

5,135

5,084

7,848

7,765

11,943

8,766

8,701

8

62,050

64,288

1,956

5,073

5,132

5,061

7,717

7,720

12,947

9,193

9,490

9

65,215

66,128

3,304

4,978

5,052

5,186

7,646

7,824

12,925

9,603

9,610

10

61,819

68,128

3,090

5,343

5,271

5,179

7,620

7,938

14,050

10,374

9,262

11

67,828

69,125

3,579

4,984

5,304

4,955

7,728

7,943

14,300

10,456

9,875

12

69,558

69,258

2,487

5,315

5,154

5,326

8,013

7,047

13,877

10,982

11,058

Total

710,935

759,079

45,446

56,620

57,859

59,763

88,357

90,819

142,967

109,353

107,895

95

3.2 Tables

Table 3.6

Quality Rating and Meal Price for 300 Los Angeles Restaurants

Restaurant

Restaurant

Quality Rating

Meal Price ($)

Wait Time (min)

1

Good

18

5

2

Very Good

22

6

3

Good

28

1

4

Excellent

38

74

5

Very Good

33

6

6

Good

28

5

7

Very Good

19

11

8

Very Good

11

9

9

Very Good

23

13

Good

13

1

10

Quality ratings are an example of categorical data, and meal prices are an example of quantitative data. For now, we will limit our consideration to the quality-rating and meal-price variables. A crosstabulation of the data for quality rating and meal price is shown in Table 3.7. The left and top margin labels define the classes for the two variables. In the left margin, the row labels (Good, Very Good, and Excellent) correspond to the three classes of the quality-rating variable. In the top margin, the column labels ($10–19, $20–29, $30–39, and $40–49) correspond to the four classes (or bins) of the meal-price variable. Each restaurant in the sample provides a quality rating and a meal price. Thus, each restaurant in the sample is associated with a cell appearing in one of the rows and one of the columns of the crosstabulation. For example, restaurant 5 is identified as having a very good quality rating and a meal price of $33. This restaurant belongs to the cell in row 2 and column 3. In constructing a crosstabulation, we simply count the number of restaurants that belong to each of the cells in the crosstabulation. Table 3.7 shows that the greatest number of restaurants in the sample (64) have a very good rating and a meal price in the $20–29 range. Only two restaurants have an excellent rating and a meal price in the $10–19 range. Similar interpretations of the other frequencies can be made. In addition, note that the right and bottom margins of the crosstabulation give the frequencies of quality rating and meal price separately. From the right margin, we see that data on quality ratings show 84 good restaurants, 150 very good restaurants, and 66 excellent restaurants. Similarly, the bottom margin shows the counts for the meal price variable. The value of 300 in the bottom-right corner of the table indicates that 300 restaurants were included in this data set.

Table 3.7

Crosstabulation of Quality Rating and Meal Price for 300 Los Angeles Restaurants Meal Price $10–19

$20–29

$30–39

$40–49

Total

Good

Quality Rating

42

40

2

84

Very Good

150

34

64

46

6

Excellent

2

14

28

22

66

Total

78

118

76

28

300

96

Chapter 3 Data Visualization

PivotTables in Excel A crosstabulation in Microsoft Excel is known as a PivotTable. We will first look at a simple example of how Excel’s PivotTable is used to create a crosstabulation of the Zagat’s restaurant data shown previously. Figure 3.8 illustrates a portion of the data contained in the file Restaurant; the data for the 300 restaurants in the sample have been entered into cells B2:D301. To create a PivotTable in Excel, we follow these steps: Restaurant

Step 1. Click the Insert tab on the Ribbon Step 2. Click PivotTable in the Tables group Step 3. When the Create PivotTable dialog box appears: Choose Select a table or range Enter A1:D301 in the Table/Range: box Select New Worksheet as the location for the PivotTable Report Click OK The resulting initial PivotTable Field List and PivotTable Report are shown in Figure 3.9. Each of the four columns in Figure 3.8 [Restaurant, Quality Rating, Meal Price ($), and Wait Time (min)] is considered a field by Excel. Fields may be chosen to represent rows, columns, or values in the body of the PivotTable Report. The following steps show how to use Excel’s PivotTable Field List to assign the Quality Rating field to the rows, the Meal Price ($) field to the columns, and the Restaurant field to the body of the PivotTable report. FIGURE 3.8

Excel Worksheet Containing Restaurant Data A

Restaurant

B

C

D

1 Restaurant Quality Rating Meal Price ($) Wait Time (min) 1 5 Good 18 2 2 6 Very Good 22 3 3 1 Good 28 4 4 74 Excellent 38 5 5 6 Very Good 33 6 6 5 Good 28 7 7 11 Very Good 19 8 8 9 Very Good 11 9 9 13 Very Good 23 10 10 1 Good 13 11 11 18 Very Good 33 12 12 7 Very Good 44 13 13 18 Excellent 42 14 14 46 Excellent 34 15 15 0 Good 25 16 16 3 Good 22 17 17 3 Good 26 18 18 36 Excellent 17 19 19 7 Very Good 30 20 20 3 Good 19 21 21 10 Very Good 33 22 22 14 Very Good 22 23 23 27 Excellent 32 24 24 80 Excellent 33 25 25 9 Very Good 34 26

97

3.2 Tables

Initial PivotTable Field List and PivotTable Field Report for the Restaurant Data

FIGURE 3.9 A 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

B

C

D

E

F

G

PivotTable1 To build a report, choose fields from the PivotTable Field List

Step 4. In the PivotTable Fields task pane, go to Drag fields between areas below: Drag the Quality Rating field to the ROWS area Drag the Meal Price ($) field to the COLUMNS area Drag the Restaurant field to the VALUES area Step 5. Click on Sum of Restaurant in the VALUES area Step 6. Select Value Field Settings from the list of options Step 7. When the Value Field Settings dialog box appears: Under Summarize value field by, select Count Click OK Figure 3.10 shows the completed PivotTable Field List and a portion of the PivotTable worksheet as it now appears. To complete the PivotTable, we need to group the columns representing meal prices and place the row labels for quality rating in the proper order: Step 8. Right-click in cell B4 or any other cell containing a meal price column label Step 9. Select Group from the list of options Step 10. When the Grouping dialog box appears: Enter 10 in the Starting at: box Enter 49 in the Ending at: box Enter 10 in the By: box Click OK Step 11. Right-click on “Excellent” in cell A5 Step 12. Select Move and click Move “Excellent” to End The final PivotTable, shown in Figure 3.11, provides the same information as the crosstabulation in Table 3.7. The values in Figure 3.11 can be interpreted as the frequencies of the data. For instance,row 8 provides the frequency distribution for the data over the quantitative variable of meal price. Seventy-eight restaurants have meal prices of $10 to $19. ColumnF provides the frequency distribution for the data over the categorical variable of quality.

98

Chapter 3 Data Visualization

FIGURE 3.10

A

Completed PivotTable Field List and a Portion of the PivotTable Report for the Restaurant Data (Columns H:AK Are Hidden) B

C D E F G AL AM

AN

AO AP AQ AR

1 2 3 4 5 6 7 8

Count of Restaurant Columns Labels 10 11 12 13 14 15 47 48 Grand Total Row Labels 1 2 2 66 Excellent 6 4 3 3 2 4 84 Good 1 4 3 5 6 1 1 150 Very Good 7 8 6 9 8 5 2 3 300 Grand Total

9 10 11 12 13 14 15 16 17 18 19 20 21

FIGURE 3.11 A 1 2 3 4 5 6 7 8 9 10 11 12 13 16 15 16 17 18 19 20 21

Final PivotTable Report for the Restaurant Data B

Count of Restaurant Column Labels Row Labels 10–19 Good Very Good Excellent Grand Total

C

42 34 2 78

D

E

F

G H I

20–29 30–39 40–49 Grand Total 84 2 40 6 150 46 64 22 66 28 14 28 300 76 118

99

3.2 Tables

Atotal of 150 restaurants have a quality rating of Very Good. We can also use a PivotTable to create percent frequency distributions, as shown in the following steps: Step 1. To invoke the PivotTable Fields task pane, select any cell in the pivot table Step 2. In the PivotTable Fields task pane, click the Count of Restaurant in the VALUES area Step 3. Select Value Field Settings . . . from the list of options Step 4. When the Value Field Settings dialog box appears, click the tab for Show Values As Step 5. In the Show values as area, select % of Grand Total from the drop-down menu Click OK Figure 3.12 displays the percent frequency distribution for the Restaurant data as a PivotTable. The figure indicates that 50% of the restaurants are in the Very Good quality category and that 26% have meal prices between $10 and $19. PivotTables in Excel are interactive, and they may be used to display statistics other than a simple count of items. As an illustration, we can easily modify the PivotTable in Figure3.11 to display summary information on wait times instead of meal prices. Step 1. To invoke the PivotTable Fields task pane, select any cell in the pivot table Step 2. In the PivotTable Fields task pane, click the Count of Restaurant field in the VALUES area Select Remove Field Step 3. Drag the Wait Time (min) to the VALUES area Step 4. Click on Sum of Wait Time (min) in the VALUES area Step 5. Select Value Field Settings… from the list of options Step 6. When the Value Field Settings dialog box appears: Under Summarize value field by, select Average Click Number Format In the Category: area, select Number Enter 1 for Decimal places: Click OK When the Value Field Settings dialog box reappears, click OK FIGURE 3.12

A 1 2 3 4 5 6 7 8 9 10 11 12 13 16 15 16 17 18

Percent Frequency Distribution as a PivotTable for the Restaurant Data B

Count of Restaurant Column Row Labels Labels 10–19 14.00% Good 11.33% Very Good 0.67% Excellent 26.00% Grand Total

C

20–29 13.33% 21.33% 4.67% 39.33%

D

E

F

G

30–39 40–49 Grand Total 28.00% 0.67% 0.00% 50.00% 15.33% 2.00% 22.00% 9.33% 7.33% 100.00% 25.33% 9.33%

100

Chapter 3 Data Visualization

You can also filter data in a PivotTable by dragging the field that you want to filter to the FILTERS area in the PivotTable Fields.

The completed PivotTable appears in Figure 3.13. This PivotTable replaces the counts of restaurants with values for the average wait time for a table at a restaurant for each grouping of meal prices ($10–19, $20–29, $30–39, and $40–49). For instance, cell B7 indicates that the average wait time for a table at an Excellent restaurant with a meal price of $10–19 is 25.5 minutes. Column F displays the total average wait times for tables in each quality rating category. We see that Excellent restaurants have the longest average wait of 35.2minutes and that Good restaurants have an average wait time of only 2.5 minutes. Finally, cell D7 shows us that the longest wait times can be expected at Excellent restaurants with meal prices in the $30–39 range (34 minutes). We can also examine only a portion of the data in a PivotTable using the Filter option in Excel. To Filter data in a PivotTable, click on the Filter Arrow next to Row Labels or Column Labels and then uncheck the values that you want to remove from the PivotTable. For example, we could click on the arrow next to Row Labels and then uncheck the Good value to examine only Very Good and Excellent restaurants.

Recommended PivotTables in Excel Excel also has the ability to recommend PivotTables for your data set. To illustrate Recommended PivotTables in Excel, we return to the restaurant data in Figure 3.8. To create a Recommended PivotTable, follow the steps below using the file Restaurant. Hovering your pointer over the different options will display the full name of each option, as shown in Figure3.14.

Step 1. Step 2. Step 3. Step 4.

Select any cell in table of data (for example, cell A1) Click the Insert tab on the Ribbon Click Recommended PivotTables in the Tables group When the Recommended PivotTables dialog box appears: Select the Count of Restaurant, Sum of Wait Time (min), Sum of Meal Price ($) by Quality Rating option (see Figure 3.14) Click OK

The steps above will create the PivotTable shown in Figure 3.15 on a new Worksheet. The Recommended PivotTables tool in Excel is useful for quickly creating commonly used PivotTables for a data set, but note that it may not give you the option to create the PivotTable Report for the Restaurant Data with Average Wait Times Added

FIGURE 3.13

A 1 2 3 4 5 6 7 8 9 10 11 12 13 16 15 16 17

B

C

D

E

F

G

Average of Wait Time (min) Column Row Labels Labels 10–19 20–29 30–39 40–49 Grand Total 2.5 0.5 2.5 2.6 Good 12.3 10.0 12.6 12.6 12.0 Very Good 32.1 32.3 25.5 29.1 34.0 Excellent 13.9 27.5 7.6 11.1 19.8 Grand Total

101

3.2 Tables

Recommended PivotTables Dialog Box in Excel

FIGURE 3.14

Recommended PivotTables

Count of Restaurant, Sum of Wait Time (min), a...

Sum of Meal Price (S) b... Row Labels Excellent Good Very Good Grand Total

Sum of Meal Price ($) 2267 1657 3845 7769

Row Labels Excellent Good Very Good

Count of Restaurant Sum of W 66 84 150

Grand Total

300

Sum of Wait Time (min) ... Row Labels Excellent Good Very Good Grand Total

Sum of Wait Time (min) 2120 207 1848 4175

Count of Restaurant, Su...

Row Labels Excellent Good Very Good

Count of Restaurant Sum of W 66 84 150

Grand Total

300

Count of Restaurant, Sum of Wait Time (min), Sum of Meal Price ($) by Quality Rating

Sum of Restaurant by Qu... Row Labels

Sum of Restaurant

Blank PivotTable

FIGURE 3.15

A 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27

Row Labels Excellent Good Very Good Grand Total

OK

Changes Source Data...

Cancel

Default PivotTable Created for Restaurant Data Using Excel’s Recommended PivotTables Tool B

C

D

Count of Restaurant Sum of Wait Time (min) Sum of Meal Price ($) 66 2120 2267 84 207 1657 150 1848 3845 300 4175 7769

E

PivotTable Fields Choose fields to add to report: Restaurant Quantity Rating Meal Price ($) Wait Time (min) MORE TABLES...

..................................................................................................... Drag field between areas below: FILTERS

COLUMNS Meal Price ($)

ROWS Quality Rating

VALUES Count of Restaurant Sum of Wait Time (min) Sum of Meal Price ($)

102

Chapter 3 Data Visualization

FIGURE3.16

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27

Completed PivotTable for Restaurant Data Using Excel’s Recommended PivotTables Tool

A

B

Row Labels Excellent Good Very Good Grand Total

Count of Restaurant 66 84 150 300

C Average of Wait Time (min) 32.1 2.5 12.3 13.9

D Average of Meal Price ($) 34.35 19.73 25.63 25.90

E

PivotTable Fields

Serach... Restaurant Quality Rating Meal Price ($) Wait Time (min) MORE TABLES...

.....................................................................................................

FILTERS

COLUMNS Values

ROWS Quality Rating

VALUES Count of Restaurant Average of Wait Time (min) Average of Meal Price ($)

exact PivotTable that will be of the most use for your data analysis. Displaying the sum of wait times and the sum of meal prices within each quality-rating category, as shown in Figure3.15, is not particularly useful here; the average wait times and average meal prices within each quality-rating category would be more useful to us. But we can easily modify the PivotTable in Figure 3.14 to show the average values by selecting any cell in the PivotTable to invoke the PivotTable Fields task pane, clicking on Sum of Wait Time (min) and then Sum of Meal Price ($), and using the Value Field Settings… to change theSummarize value field by option to Average. The finished PivotTable is shown in Figure 3.16.

3.3 Charts Charts (or graphs) are visual methods for displaying data. In this section, we introduce some of the most commonly used charts to display and analyze data including scatter charts, line charts, and bar charts. Excel is the most commonly used software package for creating simple charts. We explain how to use Excel to create scatter charts, line charts, sparklines, bar charts, bubble charts, and heat maps.

Scatter Charts A scatter chart is a graphical presentation of the relationship between two quantitative variables. As an illustration, consider the advertising/sales relationship for an electronics store in San Francisco. On 10 occasions during the past three months, the store used weekend television commercials to promote sales at its stores. The managers want to investigate whether a relationship exists between the number of commercials shown and sales at the store the following week. Sample data for the 10 weeks, with sales in hundreds of dollars, are shown in Table 3.8.

103

3.3 Charts

Sample Data for the San Francisco Electronics Store

Table 3.8

Electronics

No. of Commercials

Sales ($100s)

Week

x

y

1

2

50

2

5

57

3

1

41

4

3

54

5

4

54

6

1

38

7

5

63

8

3

48

9

4

59

10

2

46

We will use the data from Table 3.8 to create a scatter chart using Excel’s chart tools and the data in the file Electronics:

Hovering the pointer over the chart type buttons in Excel will display the names of the buttons and short descriptions of the types of chart.

Steps 9 and 10 are optional, but they improve the chart’s readability. We would want to retain the gridlines only if they helped the reader to determine more precisely where data points are located relative to certain values on the horizontal and/or vertical axes.

Step 1. Select cells B2:C11 Step 2. Click the Insert tab in the Ribbon Step 3. Click the Insert Scatter (X,Y) or Bubble Chart button in the Chartsgroup Step 4. When the list of scatter chart subtypes appears, click the Scatter button Step 5. Click the Design tab under the Chart Tools Ribbon Step 6. Click Add Chart Element in the Chart Layouts group Select Chart Title, and click Above Chart Click on the text box above the chart, and replace the text with Scatter Chart for the San Francisco Electronics Store Step 7. Click Add Chart Element in the Chart Layouts group Select Axis Title, and click Primary Horizontal Click on the text box under the horizontal axis, and replace “Axis Title” with Number of Commercials Step 8. Click Add Chart Element in the Chart Layouts group Select Axis Title, and click Primary Vertical Click on the text box next to the vertical axis, and replace “Axis Title” with Sales ($100s) Step 9. Right-click on one of the horizontal grid lines in the body of the chart, and click Delete Step 10. Right-click on one of the vertical grid lines in the body of the chart, and clickDelete We can also use Excel to add a trendline to the scatter chart. A trendline is a line that provides an approximation of the relationship between the variables. To add a linear trendline using Excel, we use the following steps: Step 1. Right-click on one of the data points in the scatter chart, and select Add Trendline… Step 2. When the Format Trendline task pane appears, select Linear under Trendline Options Figure 3.17 shows the scatter chart and linear trendline created with Excel for the data in Table 3.8. The number of commercials (x) is shown on the horizontal axis, and sales (y)

104

Chapter 3 Data Visualization

Scatter Chart for the San Francisco Electronics Store

A B C 1 Week No. of Commercials Sales Volume 1 2 2 50 3 2 5 57 3 4 1 41 4 5 3 54 6 5 4 54 1 7 6 38 8 7 5 63 9 8 3 48 10 9 4 59 11 10 2 46 12 13 14 15 16 17 18 19

Scatter charts are often referred to as scatter plots or scatter diagrams. Chapter 2 introduces scatter charts and relates them to the concepts of covariance and correlation.

D

E

F

G

H

I

J

K

L

Scatter Chart for the San Francisco Electronics Store

Sales ($100s)

FIGURE 3.17

70 60 50 40 30 20 10 0

1

2 3 4 No. of Commercials

5

6

are shown on the vertical axis. For week 1, x 5 2 and y 5 50. A point is plotted on the scatter chart at those coordinates; similar points are plotted for the other nine weeks. Note that during two of the weeks, one commercial was shown, during two of the weeks, two commercials were shown, and so on. The completed scatter chart in Figure 3.17 indicates a positive linear relationship (or positive correlation) between the number of commercials and sales: Higher sales are associated with a higher number of commercials. The linear relationship is not perfect because not all of the points are on a straight line. However, the general pattern of the points and the trendline suggest that the overall relationship is positive. This implies that the covariance between sales and commercials is positive and that the correlation coefficient between these two variables is between 0 and 11. The Chart Buttons in Excel allow users to quickly modify and format charts. Three buttons appear next to a chart whenever you click on a chart to make it active. Clicking on the Chart Elements button brings up a list of check boxes to quickly add and remove axes, axis titles, chart titles, data labels, trendlines, and more. Clicking on the Chart Styles button allows the user to quickly choose from many preformatted styles to change the look of the chart. Clicking on the Chart Filter button allows the user to select the data to be included in the chart. The Chart Filter button is very useful for performing additional data analysis.

Recommended Charts in Excel Similar to the ability to recommend PivotTables, Excel has the ability to recommend charts for a given data set. The steps below demonstrate the Recommended Charts tool in Excel for the Electronics data.

Electronics

Step 1. Step 2: Step 3: Step 4:

Select cells B2:C11 Click the Insert tab in the Ribbon Click the Recommended Charts button in the Charts group When the Insert Chart dialog box appears, select the Scatter option (seeFigure 3.18) Click OK

105

3.3 Charts

These steps create the basic scatter chart that can then be formatted (using the ChartButtons or Chart Tools Ribbon) to create the completed scatter chart shown in Figure3.17. Note that the Recommended Charts tool gives several possible recommendations for the electronics data in Figure 3.18. These recommendations include scatter charts, line charts, and bar charts, which will be covered later in this chapter. Excel’s Recommended Charts tool generally does a good job of interpreting your data and providing recommended charts, but take care to ensure that the selected chart is meaningful and follows good design practice.

Line Charts A line chart for time series data is often called a time series plot.

FIGURE 3.18

Line charts are similar to scatter charts, but a line connects the points in the chart. Line charts are very useful for time series data collected over a period of time (minutes, hours, days, years, etc.). As an example, Kirkland Industries sells air compressors to manufacturing companies. Table 3.9 contains total sales amounts (in $100s) for air compressors during

Insert Chart Dialog Box from Recommended Charts Tool in Excel

Insert Chart Recommended Charts

Sales value

All Charts

Scatter Sales Volume 70 60

Chart Title

50 40 30 20 10

Chart Title

0 0

1

2

3

4

5

6

A scatter chart is used to compare at least two sets of values or pairs of data. Use it to show relationships between sets of values.

Chart Title

Chart Title

OK

Cancel

106

Chapter 3 Data Visualization

Table 3.9

Monthly Sales Data of Air Compressors at Kirkland Industries

Kirkland

Month

Sales ($100s)

Jan

135

Feb

145

Mar

175

Apr

180

May

160

Jun

135

Jul

210

Aug

175

Sep

160

Oct

120

Nov

115

Dec

120

each month in the most recent calendar year. Figure 3.19 displays a scatter chart and a line chart created in Excel for these sales data. The line chart connects the points of the scatter chart. The addition of lines between the points suggests continuity, and it is easier for the reader to interpret changes over time. To create the line chart in Figure 3.19 in Excel, we follow these steps: Step 1. Step 2. Step 3. Step 4.

Select cells A2:B13 Click the Insert tab on the Ribbon Click the Insert Line Chart button in the Charts group When the list of line chart subtypes appears, click the Line with Markers button

under 2-D Line

This creates a line chart for sales with a basic layout and minimum formatting Step 5. Select the line chart that was just created to reveal the Chart Buttons FIGURE 3.19

Scatter Chart and Line Chart for Monthly Sales Data atKirklandIndustries

Scatter Chart for Monthly Sales Data

Line Chart for Monthly Sales Data 250

200

200

150 100

Sales ($100s)

250

150 100 50

Ja n Fe b M ar A p M r ay Ju n Ju Al ug Se p O ct N ov D ec

50

Ja n Fe b M ar A pr M ay Ju n Ju Al ug Se p O ct N ov D ec

Sales ($100s)

In the line chart in Figure 3.19, we have kept the markers at each data point. This is a matter of personal taste, but removing the markers tends to suggest that the data are continuous when in fact we have only one data point per month.

107

3.3 Charts

Step 6. Click the Chart Elements button Select the check boxes for Axes, Axis Titles, and Chart Title Deselect the check box for Gridlines Click on the text box next to the vertical axis, and replace “Axis Title” with Sales ($100s) Click on the text box next to the horizontal axis and replace “Axis Title” with Month Click on the text box above the chart, and replace “Sales ($100s)” with Line Chart for Monthly Sales Data

Because the gridlines do not add any meaningful information here, we do not select the check box for Gridlines in Chart Elements, as it increases the data-ink ratio.

Figure 3.20 shows the line chart created in Excel along with the selected options for the Chart Elements button. Line charts can also be used to graph multiple lines. Suppose we want to break out Kirkland’s sales data by region (North and South), as shown in Table 3.10. We can create a line chart in Excel that shows sales in both regions, as in Figure 3.21 by following similar steps but selecting cells A2:C14 in the file KirklandRegional before creating the line chart. Figure 3.21 shows an interesting pattern. Sales in both the North and the South regions seemed to follow the same increasing/decreasing pattern until October. Starting in October, sales in the North continued to decrease while sales in the South increased. We would probably want to investigate any changes that occurred in the North region around October. A special type of line chart is a sparkline, which is a minimalist type of line chart that can be placed directly into a cell in Excel. Sparklines contain no axes; they display only the line for the data. Sparklines take up very little space, and they can be effectively used to provide information on overall trends for time series data. Figure 3.22 illustrates the use of sparklines in Excel for the regional sales data. To create a sparkline in Excel: Step 1. Click the Insert tab on the Ribbon Step 2. Click Line in the Sparklines group

KirklandRegional

FIGURE 3.20

Line Chart and Excel’s Chart Elements Button Options for Monthly Sales Data at Kirkland Industries

CHART ELEMENTS

Line Chart for Monthly Sales Data

Axes

250

Axis Titles Chart Title

200

Sales ($100s)

Data Labels Data Table

150

Error Bars 100

Gridlines Legend

50

Trendline Up/Down Bars

0 Jan

Feb

Mar

Apr

May

Jun

Jul

Aug

Sep

Oct

Nov

Dec

Month

108

Chapter 3 Data Visualization

Regional Sales Data by Month for Air Compressors at Kirkland Industries

Table 3.10

Sales ($100s) Month

KirklandRegional

North

South

Jan

95

40

Feb

100

45

Mar

120

55

Apr

115

65

May

100

60

Jun

85

50

Jul

135

75

Aug

110

65

Sep

100

60

Oct

50

70

Nov

40

75

Dec

40

80

Step 3. When the Create Sparklines dialog box appears: Enter B3:B14 in the Data Range: box Enter B15 in the Location Range: box Click OK Step 4. Copy cell B15 to cell C15 The sparklines in cells B15 and C15 do not indicate the magnitude of sales in the North and the South regions, but they do show the overall trend for these data. Sales in the North appear to be decreasing and sales in the South increasing overall. Because sparklines are input directly into the cell in Excel, we can also type text directly into the same cell that will then be overlaid on the sparkline, or we can add shading to the cell, which will appear as the background. In Figure 3.22, we have shaded cells B15 and C15 to highlight the sparklines. As can be seen, sparklines provide an efficient and simple way to display basic information about a time series. FIGURE 3.21

Line Chart of Regional Sales Data at Kirkland Industries Line Chart of Regional Sales Data

160 140 120 Sales ($100s)

In the line chart in Figure3.21, we have replaced Excel’s default legend with text boxes labeling the lines corresponding to sales in the North and the South. This can often make the chart look cleaner and easier to interpret.

100 South

80 60

North

40 20 0

Jan Feb Mar Apr May Jun

Jul Aug Sep Oct Nov Dec

109

3.3 Charts

Sparklines for the Regional Sales Data at Kirkland Industries

FIGURE 3.22 A 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Month Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

B

C Sales ($100s) North South 95 40 100 45 120 55 115 65 100 60 85 50 135 75 110 65 100 60 50 70 40 75 40 80

D

E

F

G

H

I

Bar Charts and Column Charts AccountsManaged

Bar charts and column charts provide a graphical summary of categorical data. Bar charts use horizontal bars to display the magnitude of the quantitative variable. Column charts use vertical bars to display the magnitude of the quantitative variable. Bar and column charts are very helpful in making comparisons between categorical variables. Consider a regional supervisor who wants to examine the number of accounts being handled by each manager. Figure 3.23 shows a bar chart created in Excel displaying these data. To create this bar chart in Excel: Step 1. Step 2. Step 3. Step 4.

Select cells A2:B9 Click the Insert tab on the Ribbon Click the Insert Column or Bar Chart button When the list of bar chart subtypes appears: Click the Clustered Bar button

in the Charts group

in the 2-D Bar section

Step 5. Select the bar chart that was just created to reveal the Chart Buttons Step 6. Click the Chart Elements button Select the check boxes for Axes, Axis Titles, and Chart Title Deselect the check box for Gridlines Click on the text box next to the vertical axis, and replace “Axis Title” with Accounts Managed Click on the text box next to the vertical axis, and replace “Axis Title” with Manager Click on the text box above the chart, and replace “Chart Title” with Bar Chart of Accounts Managed From Figure 3.23 we can see that Gentry manages the greatest number of accounts and Williams the fewest. We can make this bar chart even easier to read by ordering the results by the number of accounts managed. We can do this with the following steps: Step 1. Select cells A1:B9 Step 2. Right-click any of the cells A1:B9 Select Sort Click Custom Sort

Chapter 3 Data Visualization

FIGURE 3.23 A 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Manager Davis Edwards Francois Gentry Jones Lopez Smith Williams

Bar Chart for Accounts Managed Data B Accounts Managed 24 11 28 37 15 29 21 6

C

D

E

F

G

H

I

J

Bar Chart of Accounts Managed

Manager

110

Williams Smith Lopez Jones Gentry Francois Edwards Davis 0

10

20 30 Accounts Managed

40

Step 3. When the Sort dialog box appears: Make sure that the check box for My data has headers is checked Select Accounts Managed in the Sort by box under Column Select Smallest to Largest under Order Click OK In the completed bar chart in Excel, shown in Figure 3.24, we can easily compare the relative number of accounts managed for all managers. However, note that it is difficult to interpret from the bar chart exactly how many accounts are assigned to each manager. If this information is necessary, these data are better presented as a table or by adding data labels to the bar chart, as in Figure 3.25, which is created in Excel using the following steps: Alternatively, you can add Data Labels by rightclicking on a bar in the chart and selecting Add Data Labels.

Step 1. Select the chart to reveal the Chart Buttons Step 2. Click the Chart Elements button Select the check box for Data Labels This adds labels of the number of accounts managed to the end of each bar so that the reader can easily look up exact values displayed in the bar chart.

A Note on Pie Charts and Three-Dimensional Charts Pie charts are another common form of chart used to compare categorical data. However, many experts argue that pie charts are inferior to bar charts for comparing data. The pie chart in Figure 3.26 displays the data for the number of accounts managed in Figure 3.23. Visually, it is still relatively easy to see that Gentry has the greatest number of accounts and that Williams has the fewest. However, it is difficult to say whether Lopez or Francois has more accounts. Research has shown that people find it very difficult to perceive differences in area. Compare Figure 3.26 to Figure 3.24. Making visual comparisons is much easier in the bar chart than in the pie chart (particularly when using a limited number of colors for differentiation). Therefore, we recommend against using pie charts in most situations and suggest instead using bar charts for comparing categorical data.

111

3.3 Charts

Sorted Bar Chart for Accounts Managed Data A 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

D

E

F

G

H

I

J

Bar Chart of Accounts Managed Gentry Lopez Francois Davis Smith Jones Edwards Williams 0

20

10

30

40

Accounts Managed

Bar Chart with Data Labels for Accounts Managed Data A

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

C

Manager Williams Edwards Jones Smith Davis Francois Lopez Gentry

B Accounts Managed 6 11 15 21 24 28 29 37

C

D

E

F

G

H

I

J

Bar Chart of Accounts Managed

Manager

FIGURE 3.25

Manager Williams Edwards Jones Smith Davis Francois Lopez Gentry

B Accounts Managed 6 11 15 21 24 28 29 37

Manager

FIGURE 3.24

37

Gentry Lopez Francois Davis Smith Jones Edwards Williams

29 28 24 21 15 11 6 0

10

20

30

40

Accounts Managed

112

Chapter 3 Data Visualization

FIGURE 3.26

Pie Chart of Accounts Managed

Davis Edwards Francois Gentry Jones Lopez Smith Williams

Because of the difficulty in visually comparing area, many experts also recommend against the use of three-dimensional (3-D) charts in most settings. Excel makes it very easy to create 3-D bar, line, pie, and other types of charts. In most cases, however, the 3-D effect simply adds unnecessary detail that does not help explain the data. As an alternative, consider the use of multiple lines on a line chart (instead of adding a z-axis), employing multiple charts, or creating bubble charts in which the size of the bubble can represent the z-axis value. Never use a 3-D chart when a two-dimensional chart will suffice.

Bubble Charts A bubble chart is a graphical means of visualizing three variables in a two-dimensional graph and is therefore sometimes a preferred alternative to a 3-D graph. Suppose that we want to compare the number of billionaires in various countries. Table 3.11 provides a sample of six countries, showing, for each country, the number of billionaires per 10 million residents, the per capita income, and the total number of billionaires. We can create a bubble chart using Excel to further examine these data: Billionaires

Step 1. Select cells B2:D7 Step 2. Click the Insert tab on the Ribbon Step 3. In the Charts group, click Insert Scatter (X,Y) or Bubble Chart In the Bubble subgroup, click Bubble

Step 4. Select the chart that was just created to reveal the Chart Buttons

Table 3.11 Country United States China Germany

Sample Data on Billionaires per Country Billionaires per 10M Residents

Per Capita Income

No. of Billionaires

54.7

$54,600

1,764

1.5

$12,880

213

12.5

$45,888

103

India

0.7

$5,855

90

Russia

6.2

$24,850

88

Mexico

1.2

$17,881

15

3.3 Charts

113

Step 5. Click the Chart Elements button Select the check boxes for Axes, Axis Titles, Chart Title and Data Labels. Deselect the check box for Gridlines. Click on the text box under the horizontal axis, and replace “Axis Title” with Billionaires per 10 Million Residents Click on the text box next to the vertical axis, and replace “Axis Title” with Per Capita Income Click on the text box above the chart, and replace “Chart Title” with Billionaires by Country Step 6. Double-click on one of the Data Labels in the chart (e.g., the “$54,600” next to the largest bubble in the chart) to reveal the Format Data Labels task pane Step 7. In the Format Data Labels task pane, click the Label Options icon and open the Label Options area Under Label Contains, select Value from Cells and click the Select Range… button When the Data Label Range dialog box opens, select cells A2:A8 in the Worksheet Click OK Step 8. In the Format Data Labels task pane, deselect Y Value under Label Contains, and select Right under Label Position The completed bubble chart appears in Figure 3.27. This size of each bubble in igure3.27 is proportionate to the number of billionaires in that country. The per capita F income and billionaires per 10 million residents is displayed on the vertical and horizontal axes. This chart shows us that the United States has the most billionaires and the highest number of billionaires per 10 million residents. We can also see that China has quite a fewbillionaires but with much lower per capita income and much lower billionaires per 10million residents (because of China’s much larger population). Germany, Russia, and India all appear to have similar numbers of billionaires, but the per capita income and billionaires per 10 million residents are very different for each country. Bubble charts can be very effective for comparing categorical variables on two different quantitative values.

Heat Maps A heat map is a two-dimensional graphical representation of data that uses different shades of color to indicate magnitude. Figure 3.28 shows a heat map indicating the magnitude of changes for a metric called same-store sales, which are commonly used in the retail industry to measure trends in sales. The cells shaded red in Figure 3.28 indicate declining samestore sales for the month, and cells shaded blue indicate increasing same-store sales for the month. Column N in Figure 3.28 also contains sparklines for the same-store sales data. Figure 3.28 can be created in Excel by following these steps:

SameStoreSales

Step 1. Select cells B2:M17 Step 2. Click the Home tab on the Ribbon Step 3. Click Conditional Formatting in the Styles group Select Color Scales and click on Blue–White–Red Color Scale To add the sparklines in column N, we use the following steps: Step 4. Step 5. Step 6. Step 7.

Select cell N2 Click the Insert tab on the Ribbon Click Line in the Sparklines group When the Create Sparklines dialog box appears: Enter B2:M2 in the Data Range: box Enter N2 in the Location Range: box Click OK Step 8. Copy cell N2 to N3:N17

114

Chapter 3 Data Visualization

Bubble Chart Comparing Billionaires by Country

FIGURE 3.27

A Country United States China Germany India Russia Mexico

Billionaires per 10M Residents

D

Per Capita Income

54.7 1.5 12.5 0.7 6.2 1.2

No. of Billionaires

54,600 12,880 45,888 5,855 24,850 17,881

$ $ $ $ $ $

E

1764 213 103 90 88 15

$70,000 $60,000 United States

$50,000 Germany

$40,000 $30,000 Russia

$20,000

$–

$(10,000)

. To display

this button, select cells B2:M17. The Quick Analysis button will appear at the bottom right of the selected cells. Click the button to display options for heat maps, sparklines, and other dataanalysis tools.

Mexico China

$10,000

–10

Both the heat map and the sparklines described here can also be created using the Quick Analysis button

C

Billionaires by Country

Per Capita Income

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

B

India

10

20

30

40

50

60

70

Billionaires per 10 Million Residents

The heat map in Figure 3.28 helps the reader to easily identify trends and patterns. We can see that Austin has had positive increases throughout the year, while Pittsburgh has had consistently negative same-store sales results. Same-store sales at Cincinnati started the year negative but then became increasingly positive after May. In addition, we can differentiate between strong positive increases in Austin and less substantial positive increases in Chicago by means of color shadings. A sales manager could use the heat map in Figure3.28 to identify stores that may require intervention and stores that may be used as models. Heat maps can be used effectively to convey data over different areas, across time, or both, as seen here. Because heat maps depend strongly on the use of color to convey information, one must be careful to make sure that the colors can be easily differentiated and that they do not become overwhelming. To avoid problems with interpreting differences in color, we can add sparklines as shown in column N of Figure 3.28. The sparklines clearly show the overall trend (increasing or decreasing) for each location. However, we cannot gauge

115

3.3 Charts

Heat Map and Sparklines for Same-Store Sales Data

FIGURE 3.28

A

B

C

D

E

F

G

H

I

J

JUL

AUG

SEP

K

L

M

1

JAN

FEB MAR APR MAY JUN

2 St. Louis 3 Phoenix 4 Albany

–2%

–1%

–1%

0%

2%

4%

3%

5%

6%

7%

8%

8%

5%

4%

4%

2%

2%

–2%

–5%

–8%

–6%

–5%

–7%

–8%

–5%

–6%

–4%

–5%

–2%

–5%

–5%

–3%

–1%

–2%

–1%

–2%

16%

15%

15%

16%

18%

17%

14%

15%

16%

19%

18%

16%

–9%

–6%

–7%

–3%

3%

6%

8%

11%

10%

11%

13%

11%

7 San Francisco

2%

4%

5%

8%

4%

2%

4%

3%

1%

–1%

1%

2%

8 Seattle 9 Chicago

7%

7%

8%

7%

5%

4%

2%

0%

–2%

–4%

–6%

–5%

5 Austin 6 Cincinnati

5%

3%

2%

6%

8%

7%

8%

5%

8%

10%

9%

8%

12%

14%

13%

17%

12%

11%

8%

7%

7%

8%

5%

3%

2%

3%

0%

1%

–1%

–4%

–6%

–8% –11% –13% –11% –10%

–6% 12 Minneapolis 5% 13 Denver 14 Salt Lake City 7%

–6%

–8%

–5%

–6%

–5%

–5%

–7%

–5%

–2%

–1%

–2%

4%

1%

1%

2%

3%

1%

–1%

0%

1%

2%

3%

7%

7%

13%

12%

8%

5%

9%

10%

9%

7%

6%

10 Atlanta 11 Miami

N

OCT NOV DEC SPARKLINES

15 Raleigh 16 Boston

4%

2%

0%

5%

4%

3%

5%

5%

9%

11%

8%

6%

–5%

–5%

–3%

4%

–5%

–4%

–3%

–1%

1%

2%

3%

5%

17 Pittsburgh

–6%

–6%

–4%

–5%

–3%

–3%

–1%

–2%

–2%

–1%

–2%

–1%

differences in the magnitudes of increases and decreases among locations using sparklines. The combination of a heat map and sparklines here is a particularly effective way to show both trend and magnitude.

Additional Charts for Multiple Variables

FIGURE 3.29

A 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Willia Month Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Figure 3.29 provides an alternative display for the regional sales data of air compressors for Kirkland Industries. The figure uses a stacked-column chart to display the North and the South regional sales data previously shown in a line chart in Figure 3.21. We could also Stacked-Column Chart for Regional Sales Data for Kirkland Industries B C Sales ($100s) North South 95 40 100 45 120 55 115 65 100 60 85 50 135 75 110 65 100 60 50 70 40 75 40 80

D

E

F

200 Sales ($100s)

KirklandRegional

G

H

I

J

K

L

M

South North

150 100 50 0

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

116

Chapter 3 Data Visualization

use a stacked-bar chart to display the same data by using horizontal bars instead of vertical. To create the stacked-column chart shown in Figure 3.29, we use the following steps: Note that here we have not included the additional steps for formatting the chart in Excel using the Chart Elements button, but the steps are similar to those used to create the previous charts.

Clustered-column (bar) charts are also referred to as side-by-side-column (bar) charts.

Step 1. Select cells A2:C14 Step 2. Click the Insert tab on the Ribbon Step 3. In the Charts group, click the Insert Column or Bar Chart button Select Stacked Column

under 2-D Column

Stacked-column and stacked-bar charts allow the reader to compare the relative values of quantitative variables for the same category in a bar chart. However, these charts suffer from the same difficulties as pie charts because the human eye has difficulty perceiving small differences in areas. As a result, experts often recommend against the use of stacked-column and stacked-bar charts for more than a couple of quantitative variables in each category. An alternative chart for these same data is called a clustered-column (or clustered-bar) chart. It is created in Excel following the same steps but selecting Clustered Column under the 2-D Column in Step 3. Clustered-column and clustered-bar charts are often superior to stacked-column and stacked-bar charts for comparing quantitative variables, but they can become cluttered for more than a few quantitative variables per category. An alternative that is often preferred to both stacked and clustered charts, particularly when many quantitative variables need to be displayed, is to use multiple charts. For the regional sales data, we would include two column charts: one for sales in the North and one for sales in the South. For additional regions, we would simply add additional column charts. To facilitate comparisons between the data displayed in each chart, it is important to maintain consistent axes from one chart to another. The categorical variables should be listed in the same order in each chart, and the axis for the quantitative variable should have the same range. For instance, the vertical axis for both North and South sales starts at 0 and ends at 140. This makes it easy to see that, in most months, the North region has greater sales. Figure 3.30 compares the approaches using stacked-, clustered-, and multiple-bar charts for the regional sales data. Figure 3.30 shows that the multiple-column charts require considerably more space than the stacked- and clustered-column charts. However, when comparing many quantitative variables, using multiple charts can often be superior even if each chart must be made smaller. Stacked-column and stacked-bar charts should be used only when comparing a few quantitative variables and when there are large differences in the relative values of the quantitative variables within the category. An especially useful chart for displaying multiple variables is the scatter-chart matrix. Table 3.12 contains a partial listing of the data for each of New York City’s 55 sub-boroughs (a designation of a community within New York City) on monthly median rent, percentage of college graduates, poverty rate, and mean travel time to work. Suppose we want to examine the relationship between these different variables. Figure 3.31 displays a scatter-chart matrix (scatter-plot matrix) for data related to rentals in New York City. A scatter-chart matrix allows the reader to easily see the relationships among multiple variables. Each scatter chart in the matrix is created in the same manner as for creating a single scatter chart. Each column and row in the scatter-chart matrix corresponds to one categorical variable. For instance, row 1 and column 1 in Figure 3.31 correspond to the median monthly rent variable. Row 2 and column 2 correspond to the percentage of college graduates variable. Therefore, the scatter chart shown in row 1, column 2 shows the relationship between median monthly rent (on the y-axis) and the percentage of college graduates (on the x-axis) in New York City sub-boroughs. The scatter chart shown in row 2, column 3 shows the relationship between the percentage of college graduates (on the y-axis) and poverty rate (on the x-axis). Figure 3.31 allows us to infer several interesting findings. Because the points in the scatter chart in row 1, column 2 generally get higher moving from left to right, this tells us that sub-boroughs with higher percentages of college graduates appear to have higher median monthly rents. The scatter chart in row 1, column 3 indicates that sub-boroughs with higher

117

3.3 Charts

FIGURE 3.30

Comparing Stacked-, Clustered-, and Multiple-Column Charts for the Regional Sales Data for Kirkland Industries

Stacked-Column Chart:

Clustered-Column Chart: South North

120 Sales ($100s)

Sales ($100s)

140

South North

200 150 100 50

100 80 60 40 20

0 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Multiple-Column Charts: 140 100 80 60 40 20

100 80 60 40 20

0 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Table 3.12

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Rental Data for New York City Sub-Boroughs Median Monthly Rent ($)

Percentage College Graduates (%)

Poverty Rate (%)

Travel Time (min)

1,106

36.8

15.9

35.4

Bay Ridge

1,082

34.3

15.6

41.9

Bayside/Little Neck

1,243

41.3

7.6

40.6

Bedford Stuyvesant

822

21.0

34.2

40.5

Bensonhurst

876

17.7

14.4

44.0

Borough Park

980

26.0

27.6

35.3

1,086

55.3

17.4

34.5

Brownsville/ Ocean Hill

714

11.6

36.0

40.3

Bushwick

945

13.3

33.5

35.5

Central Harlem

665

30.6

27.1

25.0

1,624

66.1

12.7

43.7

786

27.2

20.0

46.3

…

…

…

…

Area (Sub-Borough) Astoria

NYCityData

South

120 Sales ($100s)

Sales ($100s)

140

North

120

Brooklyn Heights/ Fort Greene

Chelsea/Clinton/ Midtown Coney Island

…

118

Chapter 3 Data Visualization

We demonstrate how to create scatter-chart matrixes in several different software packages in the online appendix.

Row 3

PovertyRate

Row 2

Column 1

Column 2

Column 3

Column 4

MedianRent

CollegeGraduates

PovertyRate

CommuteTime

CollegeGraduates

Row 1

Scatter-Chart Matrix for New York City Rental Data

MedianRent

FIGURE 3.31

Row 4

CommuteTime

The scatter charts along the diagonal in a scatter-chart matrix (e.g., in row 1, column 1 and in row 2, column 2) display the relationship between a variable and itself. Therefore, the points in these scatter charts will always fall along a straight line at a 45-degree angle, as shown in Figure3.31.

poverty rates appear to have lower median monthly rents. The data in row 2, column 3 show that sub-boroughs with higher poverty rates tend to have lower percentages of college graduates. The scatter charts in column 4 show that the relationships between the mean travel time and the other categorical variables are not as clear as relationships in other columns. The scatter-chart matrix is very useful in analyzing relationships among variables. Unfortunately, it is not possible to generate a scatter-chart matrix using standard Excel functions. Each scatter chart must be created individually in Excel using the data from those two variables to be displayed on the chart.

PivotCharts in Excel Restaurant

To summarize and analyze data with both a crosstabulation and charting, Excel pairs PivotCharts with PivotTables. Using the restaurant data introduced in Table3.7 and Figure 3.7, we can create a PivotChart by taking the following steps: Step 1. Click the Insert tab on the Ribbon Step 2. In the Charts group, select PivotChart Step 3. When the Create PivotChart dialog box appears: Choose Select a Table or Range

119

3.3 Charts

Enter A1:D301 in the Table/Range: box Select New Worksheet as the location for the PivotTable Report Click OK Step 4. In the PivotChart Fields area, under Choose fields to add to report: Drag the Quality Rating field to the AXIS (CATEGORIES) area Drag the Meal Price ($) field to the LEGEND (SERIES) area Drag the Wait Time (min) field to the VALUES area Step 5. Click on Sum of Wait Time (min) in the Values area Step 6. Select Value Field Settings… from the list of options that appear Step 7. When the Value Field Settings dialog box appears: Under Summarize value field by, select Average Click Number Format In the Category: box, select Number Enter 1 for Decimal places: Click OK When the Value Field Settings dialog box reappears, click OK Step 8. Right-click in cell B2 or any cell containing a meal price column label Step 9. Select Group from the list of options that appears Step 10. When the Grouping dialog box appears: Enter 10 in the Starting at: box Enter 49 in the Ending at: box Enter 10 in the By: box Click OK Step 11. Right-click on “Excellent” in cell A5 Step 12. Select Move and click Move “Excellent” to End The completed PivotTable and PivotChart appear in Figure 3.32. The PivotChart is a clustered-column chart whose column heights correspond to the average wait times and are clustered into the categorical groupings of Good, Very Good, and Excellent. The columns Like PivotTables, PivotCharts are interactive. You can use the arrows on the axes and legend labels to change the categorical data being displayed. For example, you can click on the Quality Rating horizontal axis label (see Figure3.32) and choose to look at only Very Good and Excellent restaurants, or you can click on the Meal Price ($) legend label and choose to view only certain meal price categories.

FIGURE 3.32

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

PivotTable and PivotChart for the Restaurant Data

A Average of Wait Time (min)

B Columns Labels

Row Labels Good Very Good Excellent Grand Total

10–19 2.6 12.6 25.5 7.6

C

D

E

F

Grand 20–29 30–39 40–49 Total 2.5 0.5 2.5 12.6 12.0 10.0 12.3 29.1 34.0 32.3 32.1 11.1 19.8 27.5 13.9

Average Wait Time (min) 40.0 35.0 30.0 25.0 20.0 15.0 10.0 5.0 0.0

Good Quality Rating

Meal Price ($)

10–19 20–29 30–39 40–49 Very Good

Excellent

120

Chapter 3 Data Visualization

are different colors to differentiate the wait times at restaurants in the various meal price ranges. Figure 3.32 shows that Excellent restaurants have longer wait times than Good and Very Good restaurants. We also see that Excellent restaurants in the price range of $30–$39 have the longest wait times. The PivotChart displays the same information as that of the PivotTable in Figure 3.13, but the column chart used here makes it easier to compare the restaurants based on quality rating and meal price.

No t e s

+

C o m m e n t s

1. Excel assumes that line charts will be used to graph only time series data. The Line Chart tool in Excel is the most intuitive for creating charts that include text entries for the horizontal axis (e.g., the month labels of Jan, Feb, Mar, etc. for the monthly sales data in Figure 3.19). When the horizontal axis represents numerical values (1, 2, 3, etc.), then it is easiest to go to the Charts group under the Insert tab in the Ribbon, click the Insert Scatter (X,Y) or Bubble Chart , and then select the Scatter with Straight button Lines and Markers button . 2. Color is frequently used to differentiate elements in a chart. However, be wary of the use of color to differentiate for

several reasons: (1) Many people are color-blind and may not be able to differentiate colors. (2) Many charts are printed in black and white as handouts, which reduces or eliminates the impact of color. (3) The use of too many colors in a chart can make the chart appear too busy and distract or even confuse the reader. In many cases, it is preferable to differentiate chart elements with dashed lines, patterns, or labels. 3. Histograms and boxplots (discussed in Chapter 2 in relation to analyzing distributions) are other effective data-visualization tools for summarizing the distribution of data.

3.4 Advanced Data Visualization In this chapter, we have presented only some of the most basic ideas for using data visualization effectively both to analyze data and to communicate data analysis to others. The charts discussed so far are those most commonly used and will suffice for most data-visualization needs. However, many additional concepts, charts, and tools can be used to improve your data-visualization techniques. In this section we briefly mention some of them.

Advanced Charts Although line charts, bar charts, scatter charts, and bubble charts suffice for most data- visualization applications, other charts can be very helpful in certain situations. One type of helpful chart for examining data with more than two variables is the parallel-coordinates plot, which includes a different vertical axis for each variable. Each observation in the data set is represented by drawing a line on the parallel-coordinates plot connecting each vertical axis. The height of the line on each vertical axis represents the value taken by that observation for the variable corresponding to the vertical axis. For instance, Figure 3.33 displays a parallel coordinates plot for a sample of Major League Baseball players. The figure contains data for 10 players who play first base (1B) and 10 players who play second base (2B). For each player, the leftmost vertical axis plots his total number of home runs (HR). The center vertical axis plots the player’s total number of stolen bases (SB), and the rightmost vertical axis plots the player’s batting average. Various colors differentiate 1B players from 2B players (1B players are in blue and 2B players are in red). We can make several interesting statements upon examining Figure 3.33. The sample of 1B players tend to hit lots of HR but have very few SB. Conversely, the sample of 2B players steal more bases but generally have fewer HR, although some 2B players have many HR and many SB. Finally, 1B players tend to have higher batting averages (AVG) than 2B players. We may infer from Figure 3.33 that the traits of 1B players may be different from

121

3.4 Advanced Data Visualization

FIGURE 3.33

Parallel Coordinates Plot for Baseball Data 39

30

0.338

1B 2B

0 HR

1 SB

0.222 AVG

those of 2B players. In general, this statement is true. Players at 1B tend to be offensive stars who hit for power and average, whereas players at 2B are often faster and more agile in order to handle the defensive responsibilities of the position (traits that are not common in strong HR hitters). Parallel-coordinates plots, in which you can differentiate categorical variable values using color as in Figure 3.33, can be very helpful in identifying common traits across multiple dimensions. A treemap is useful for visualizing hierarchical data along multiple dimensions. SmartMoney’s Map of the Market, shown in Figure 3.34, is a treemap for analyzing stock market performance. In the Map of the Market, each rectangle represents a particular company (Apple, Inc. is highlighted in Figure 3.34). The color of the rectangle represents the overall performance of the company’s stock over the previous 52 weeks. The Map of the Market is also divided into market sectors (Health Care, Financials, Oil & Gas, etc.). The size of each company’s rectangle provides information on the company’s market capitalization size relative to the market sector and the entire market. Figure 3.34 shows that Apple has a very large market capitalization relative to other firms in the Technology sector and that it has performed exceptionally well over the previous 52 weeks. An investor can use the treemap in Figure 3.34 to quickly get an idea of the performance of individual companies relative to other companies in their market sector as well as the performance of entire market sectors relative to other sectors. Excel allows the user to create treemap charts. The step-by-step directions below explain how to create a treemap in Excel for the top-100 global companies based on 2014 market value using data in the file Global100. In this file we provide the continent where the company is headquartered in column A, the country headquarters in column B, the name of the company in column C, and the market value in column D. For the treemap to display properly in Excel, the data should be sorted by column A, “Continent,” which is the highest level of the hierarchy. Note that the treemap chart is not available in older versions of Excel.

Step 1. Select cells A1: D101 Step 2. Click Insert on the Ribbon Click on the Insert Hierarchy Chart button Select Treemap

in the Charts group

from the drop-down menu

122

Chapter 3 Data Visualization

The Map of the Market is based on work done by Professor Ben Shneiderman and students at the University of Maryland Human– Computer Interaction Lab.

FIGURE 3.34

Help

SmartMoney’s Map of the Market as an Example of a Treemap

The Marked

DJIA 12369.38-73.11-0.58%

6:50 pm May 19

Nasdaq 2778.79-34.90-1.24%

Sector view Color Key (% change)

Utilities

-55.6%

Oil & Gas Health Care

+55.6%

News

Financials

None

Icons

Show Change since 26 weeks

Close 52 weeks

YTD

Basic Materials Highlight Top 5 Gainers

Industrials

Apple Inc. +58.21%

Find (name or ticker)

(click for more details)

Consumer Goods

Losers

None

AAPL last: $530.34 chg: +$195.10

Technology Telecommunications

Consumer Services Color scheme red/green blue/yellow

Step 3. When the treemap chart appears, right-click on the treemap portion of the chart Select Format Data Series… in the pop-up menu When the Format Data Series task pane opens, select Banner

Global100

Figure 3.35 shows the completed treemap created with Excel. Selecting Banner in Step 3 places the name of each continent as a banner title within the treemap. Each continent is also FIGURE 3.35 A 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

Continent Asia Asia Asia Asia Asia Asia Asia Asia Asia Asia Asia Asia Asia Asia Australia Australia Australia Australia Europe Europe Europe Europe Europe Europe Europe Europe Europe Europe

B Country China China China China China China China Hong Kong Japan Japan Russia Saudi Arabia South Korea Taiwan Australia Australia Australia Australia Belgium Denmark France France France France Germany Germany Germany Germany

Treemap Created in Excel for Top 100 Global Companies Data C

D

Market Value Company (Billions US $) 141.1 Agricultural Bank of China 124.2 Bank of China 174.4 China Construction Bank 215.6 ICBC 202 PetroChina 94.7 Sinopec-China Petroleum 135.4 Tencent Holdings 184.6 China Mobile 91.2 Softbank 193.5 Toyota Motor 88.8 Gazprom 94.4 Saudi Basic Industries 186.5 Samsung Electronics 102 Taiwan Semiconductor 83.9 ANZ 182.3 BHP Billiton 114.5 Commonwealth Bank 99 Westpac Banking Group 171.2 Anheuser-Busch InBev 119.6 Novo Nordisk 98.6 BNP Paribas 98.7 L'OrÃ©al Group 137.1 Sanofi 149.8 Total 102.3 BASF 111.3 Bayer 83.4 BMW Group 102.9 Daimler

E

F

G

H

I

J

K

L

M

N

O

P

Q

Top 100 Global Companies by Market Value Asia

Australia

Europe

North America

North America

Berkshire Hathaway

Apple

Wal-Mart Stores

Johnson & Johnson

JPMorgan Chase

IBM

Chevron

Procter & Gamble

Qualco... Verizon CocaCommun... Cola

Oracle

Bank of America

Walt Disney

AT&T

Exxon Mobil

General Electric

Wells Fargo

Amazon.... Citigroup

Microsoft

South America

Europe

Merck & Co

Schlumb... PepsiCo

Philip Morris Gilead McD... Interna... Scie... Home Comcast Depot

Facebo... Intel

Pfizer

Visa Cisco Systems

Amer... Express Boeing

Amgen

CVS Mast... Care...

3M

Union Cono... Pacific

United United BristolTech... Parcel... Myer... AbbV...

HSBC Holdings

BP Volks... Group

British Ame... GlaxoS... Toba... Roche Lloyds Holding Bank...

Rio Tinto Vodaf...

Bayer

Astra...

Daimler

Sieme... BASF SAP

Nestle

Novartis

Ban... Royal San... In... Dutch Shell

Sanofi

L'Or... Group

BNP Pari...

Alli... Novo Irish Nor... Ba...

AnheuserBusch InBev Eni

B... Gr... Unilever

Asia

ICBC

Total

Sta...

Australia

PetroChina

Toyota Motor Softbank

Agricult... Bank of... China Con... Tencent Bank Holdings

Bank of China Sinop... China China... Mobile

BHP Co... Samsu... Billiton Bank Electr... West... Saudi Banki... ANZ Tai... Basic Sem... Ind...

Gazprom

Petro... Ecop...

123

3.4 Advanced Data Visualization

assigned a different color within the treemap. From this figure we can see that North America has more top-100 companies than any other continent, followed by Europe and then Asia. The size of the rectangles for each company in the treemap represents their relative market value. We can see that Apple, ExxonMobile, Google, and Microsoft have the four highest market values. Australia has only four companies in the top 100 and South America has two. Africa and Antarctica have no companies in the top 100. Hovering your pointer over one of the companies in the treemap will display the market value for that company.

Geographic Information Systems Charts

A GIS chart such as that shown in Figure 3.36 is an example of geoanalytics, the use of data by geographical area or some other form of spatial referencing to generate insights.

FIGURE 3.36

Consider the case of the Cincinnati Zoo & Botanical Garden, which derives much of its revenue from selling annual memberships to customers. The Cincinnati Zoo would like to better understand where its current members are located. Figure 3.36 displays a map of the Cincinnati, Ohio, metropolitan area showing the relative concentrations of Cincinnati Zoo members. The more darkly shaded areas represent areas with a greater number of members. Figure 3.36 is an example of the output from a geographic information system (GIS), which merges maps and statistics to present data collected over different geographic areas. Displaying geographic data on a map can often help in interpreting data and observing patterns. The GIS chart in Figure 3.36 combines a heat map and a geographical map to help thereader analyze this data set. From the figure we can see that a high concentration of zoomembers in a band to the northeast of the zoo that includes the cities of Mason and

GIS Chart for Cincinnati Zoo Member Data

Montgomery Greene Carlisle Springboro 4 Union Caesar 45042 Franklin 42 Waynesville Creek Oxford Middletown 73 Lake 71 127 45011 101 Trenton 45036 45177 27 Monroe 47012 New Miami Lebanon 63 Bitler Franklin Warren Hamilton 45013 45044 45113 Clinton 47036 Indian 52 47016 229 Springs South Lebanon Mason 45053 Highland 128 45152 Fairfield Oldenburg 45069 O H I O 45107 Center 45040 Landen 47035 Pleasant Run Blanchester Batesville 45241 47060 Greenhills 47006 Loveland Harrison I N D I A N A Woodlawn Northbrook Goshen Bright 74 Groesbeck Lockland Montgomery White Oak 1 Mount Repose Indian St. 126 Madeira Miami Dent Hill Mulberry Bernard Norwood Heights Cheviot Manchester Delaware Summerside 275 Mariemont Cincinnati Estates Ripley Milan Dearborn 50 Cincinnati Zoo 45244 Newport Sherwood Village Batavia Greendale Lawrenceburg Delhi Villa Sparta Greenbush Versailles Hebron Hills Covington Aurora Forestville Clermont Crestview Fort Mitchell Williamsburg Mount Dillsboro 50 45255 Burlington Hills Orab Edgewood Cold Spring Amelia Preble

Columbia Fayette

Boone Rising Sun Jefferson

Ohio

Alexandria

K E N T U C K Y Kenton

56

Pleasant Switzerland Warsaw

Florence

Independence

Gallatin

71

76

Campbell 41001

New Richmond 9

Crittenden

41006 Pendleton

Brown

125

Georgetown

27 Piner

Bethel

52

124

Chapter 3 Data Visualization

3D Maps is not available in older versions of Excel.

Hamilton (circled). Also, a high concentration of zoo members lies to the southwest of thezoo around the city of Florence. These observations could prompt the zoo manager to identify the shared characteristics of the populations of Mason, Hamilton, and Florence to learn what is leading them to be zoo members. If these characteristics can be identified, the manager can then try to identify other nearby populations that share these characteristics as potential markets for increasing the number of zoo members. More recent versions of Excel have a feature called 3D Maps that allows the user to create interactive GIS-type charts. This tool is quite powerful, and the full capabilities are beyond the scope of this text. The step-by-step directions below show an example using data from the World Bank on gross domestic product (GDP) for countries around the world. Step 1. Select cells A1:C191 Step 2. Click the Insert tab on the Ribbon

WorldGDP 2014

Click the 3D Map button

3D Map

in the Tours group

Select Open 3D Maps. This will open a new Excel window that displays a world map (see Figure 3.37) Step 3. Drag GDP 2014 (Billions US $) from the Field List to the Height box in the Data area of the Layer 1 task pane. Click the Change the visualization to Region button in the Data area of the Layer 1 task pane. Step 4. Click Layer Options in the Layer 1 task pane. Change the Color to a dark red color to give the countries more differentiation on the world map. FIGURE 3.37

Initial Window Opened by Clicking on 3D Map Button in Excel for World GDP Data

3.5 Data Dashboards

FIGURE 3.38

125

Completed 3D Map Created in Excel for World GDP Data

The completed GIS chart is shown in Figure 3.38. You can now click and drag the world map to view different parts of the world. Figure 3.38 shows much of Europe and Asia. The countries with the darker shading have higher GDPs. We can see that China has a very dark shading indicating very high GDP relative to other countries. Russia and Germany have slightly darker shadings than other countries shown indicating that Russia and China have higher GDPs than most other countries, but lower GDPs than China. If you hover over a country, it will display the Country Name and GDP 2014 (Billions US $) in a pop-up window. In Figure 3.38 we have hovered over China to display its GDP.

3.5 Data Dashboards A data dashboard is a data-visualization tool that illustrates multiple metrics and automatically updates these metrics as new data become available. It is like an automobile’s dashboard instrumentation that provides information on the vehicle’s current speed, fuel level, and engine temperature so that a driver can assess current operating conditions and take effective action. Similarly, a data dashboard provides the important metrics that managers need to quickly assess the performance of their organization and react accordingly. In this section we provide guidelines for creating effective data dashboards and an example application.

Principles of Effective Data Dashboards In an automobile dashboard, values such as current speed, fuel level, and oil pressure are displayed to give the driver a quick overview of current operating characteristics. In a

126

Key performance indicators are sometimes referred to as key performance metrics (KPMs).

Chapter 3 Data Visualization

business, the equivalent values are often indicative of the business’s current operating characteristics, such as its financial position, the inventory on hand, customer service metrics, and the like. These values are typically known as key performance indicators(KPIs). A data dashboard should provide timely summary information on KPIs that are important to the user, and it should do so in a manner that informs rather than overwhelms its user. Ideally, a data dashboard should present all KPIs as a single screen that a user can quickly scan to understand the business’s current state of operations. Rather than requiring the user to scroll vertically and horizontally to see the entire dashboard, it is better to create multiple dashboards so that each dashboard can be viewed on a single screen. The KPIs displayed in the data dashboard should convey meaning to its user and be related to the decisions the user makes. For example, the data dashboard for a marketing manager may have KPIs related to current sales measures and sales by region, while the data dashboard for a Chief Financial Officer should provide information on the current financial standing of the company, including cash on hand, current debt obligations, and so on. A data dashboard should call attention to unusual measures that may require attention, but not in an overwhelming way. Color should be used to call attention to specific values to differentiate categorical variables, but the use of color should be restrained. Too many different or too bright colors make the presentation distracting and difficult to read.

Applications of Data Dashboards To illustrate the use of a data dashboard in decision making, we discuss an application involving the Grogan Oil Company which has offices located in three cities in Texas: Austin (its headquarters), Houston, and Dallas. Grogan’s Information Technology (IT) call center, located in Austin, handles calls from employees regarding computer-related problems involving software, Internet, and e-mail issues. For example, if a Grogan employee in Dallas has a computer software problem, the employee can call the IT call center for assistance. The data dashboard shown in Figure 3.39, developed to monitor the performance of the call center, combines several displays to track the call center’s KPIs. The data presented are for the current shift, which started at 8:00 a.m. The stacked column chart in the upper left-hand corner shows the call volume for each type of problem (software, Internet, or email) over time. This chart shows that call volume is heavier during the first few hours of the shift, calls concerning email issues appear to decrease over time, and volume of calls regarding software issues are highest at midmorning. The column chart in the upper right-hand corner of the dashboard shows the percentage of time that call center employees spent on each type of problem or were idle (not working on a call). These top two charts are important displays in determining optimal staffing levels. For instance, knowing the call mix and how stressed the system is, as measured by percentage of idle time, can help the IT manager make sure that enough call center employees are available with the right level of expertise. The clustered-bar chart in the middle right of the dashboard shows the call volume by type of problem for each of Grogan’s offices. This allows the IT manager to quickly identify whether there is a particular type of problem by location. For example, the office in Austin seems to be reporting a relatively high number of issues with e-mail. If the source of the problem can be identified quickly, then the problem might be resolved quickly for many users all at once. Also, note that a relatively high number of software problems are coming from the Dallas office. In this case, the Dallas office is installing new software, resulting in more calls to the IT call center. Having been alerted to this by the Dallas office last week, the IT manager knew that calls coming from the Dallas office would spike, so the manager proactively increased staffing levels to handle the expected increase in calls. For each unresolved case that was received more than 15 minutes ago, the bar chart shown in the middle left of the data dashboard displays the length of time for which each

127

3.5 Data Dashboards

Data Dashboard for the Grogan Oil Information Technology Call Center Grogan Oil

IT Call Center

Shift 1

19–Sep–12

12:44:00 PM

Call Volume

Time Breakdown This Shift 50

Software

20

Percentage

Internet

15

10 5 8:00

9:00

10:00 Hour

11:00

30 20 10 0

12:00

W59 Case Number

Internet

Internet

W5

Email Austin 0

100

200 Minutes

300

400

5

10 15 Number of Calls

20

25

321

31–32

30–31

29–30

28–29

27–28

26–27

22–23

21–22

20–21

19–20

18–19

16–17

15–16

14–15

13–14

12–13

11–12

10–11

8–9

9–10

7–8

6–7

5–6

4–5

3–4

2–3

1–2

Time to Resolve a Case

,1

Frequency

Dallas

T57

14 12 10 8 6 4 2 0

Software

Houston

Software

17–18

Grogan

Idle

Call Volume by Office

Unresolved Cases Beyond 15 Minutes

W24

Internet Software Hour

25–26

40

24–25

Number of Calls

25

23–24

FIGURE 3.39

Minutes

Chapter 2 discusses the construction of frequency distributions for quantitative and categorical data.

case has been unresolved. This chart enables Grogan to quickly monitor the key problem cases and decide whether additional resources may be needed to resolve them. The worst case, T57, has been unresolved for over 300 minutes and is actually left over from the previous shift. Finally, the chart in the bottom panel shows the length of time required for resolved cases during the current shift. This chart is an example of a frequency distribution for quantitative data. Throughout the dashboard, a consistent color coding scheme is used for problem type (E-mail, Software, and Internet). Other dashboard designs are certainly possible, and improvements could certainly be made to the design shown in Figure 3.39. However, what is important is that information is clearly communicated so that managers can improve their decision making. The Grogan Oil data dashboard presents data at the operational level, is updated in real time, and is used for operational decisions such as staffing levels. Data dashboards may also be used at the tactical and strategic levels of management. For example, a sales manager could monitor sales by salesperson, by region, by product, and by customer. This would alert the sales manager to changes in sales patterns. At the highest level, a more strategic dashboard would allow upper management to quickly assess the financial health of the company by monitoring more aggregate financial, service-level, and capacity-utilization information.

128

No t e s

Chapter 3 Data Visualization

+

C o m m e n t s

1. The creation of data dashboards in Excel generally requires the use of macros written using Visual Basic for Applications (VBA). The use of VBA is beyond the scope of this textbook, but VBA is a powerful programming tool that can greatly increase the capabilities of Excel for analytics, including data visualization. Dedicated data visualization

software packages, such as Tableau, make it much easier to create data dashboards. 2. The appendix to this chapter provides instructions for creating basic data visualizations in Tableau. Online appendices available for this text provide instructions for creating visualizations in other common analytics software.

S u m m a ry In this chapter we covered techniques and tools related to data visualization. We discussed several important techniques for enhancing visual presentation, such as improving the clarity of tables and charts by removing unnecessary lines and presenting numerical values only to the precision necessary for analysis. We explained that tables can be preferable to charts for data visualization when the user needs to know exact numerical values. We introduced crosstabulation as a form of a table for two variables and explained how to use Excel to create a PivotTable. We presented many charts in detail for data visualization, including scatter charts, line charts, bar and column charts, bubble charts, and heat maps. We explained that pie charts and three-dimensional charts are almost never preferred tools for data visualization and that bar (or column) charts are usually much more effective than pie charts. We also discussed several advanced data-visualization charts, such as parallel-coordinates plots, treemaps, and GIS charts. We introduced data dashboards as a data-visualization tool that provides a summary of a firm’s operations in visual form to allow managers to quickly assess the current operating conditions and to aid decision making. Many other types of charts can be used for specific forms of data visualization, but we have covered many of the most-popular and most-useful ones. Data visualization is very important for helping someone analyze data and identify important relations and patterns. The effective design of tables and charts is also necessary to communicate data analysis to others. Tables and charts should be only as complicated as necessary to help the user understand the patterns and relationships in the data. G l ossa r y Bar chart A graphical presentation that uses horizontal bars to display the magnitude of quantitative data. Each bar typically represents a class of a categorical variable. bubble chart A graphical presentation used to visualize three variables in a two-dimensional graph. The two axes represent two variables, and the magnitude of the third variable is given by the size of the bubble. Chart A visual method for displaying data; also called a graph or a figure. clustered-column (or clustered-bar) chart A special type of column (bar) chart in which multiple bars are clustered in the same class to compare multiple variables; also known as a side-by-side-column (bar) chart. Column chart A graphical presentation that uses vertical bars to display the magnitude of quantitative data. Each bar typically represents a class of a categorical variable. crosstabulation A tabular summary of data for two variables. The classes of one variable are represented by the rows; the classes for the other variable are represented by the columns. data dashboard A data-visualization tool that updates in real time and gives multiple outputs. data-ink ratio The ratio of the amount of ink used in a table or chart that is necessary to convey information to the total amount of ink used in the table and chart. Ink used that is not necessary to convey information reduces the data-ink ratio.

129

Problems

geographic information system (GIS) A system that merges maps and statistics to present data collected over different geographies. heat map A two-dimensional graphical presentation of data in which color shadings indicate magnitudes. key performance indicator (KPI) A metric that is crucial for understanding the current performance of an organization; also known as a key performance metric (KPM). Line chart A graphical presentation of time series data in which the data points are connected by a line. parallel-coordinates plot A graphical presentation used to examine more than two variables in which each variable is represented by a different vertical axis. Each observation in a data set is plotted in a parallel-coordinates plot by drawing a line between the values of each variable for the observation. Pie chart A graphical presentation used to compare categorical data. Because of difficulties in comparing relative areas on a pie chart, these charts are not recommended. Bar or column charts are generally superior to pie charts for comparing categorical data. PivotChart A graphical presentation created in Excel that functions similarly to a PivotTable. PivotTable An interactive crosstabulation created in Excel. scatter chart A graphical presentation of the relationship between two quantitative variables. One variable is shown on the horizontal axis and the other on the vertical axis. scatter-chart matrix A graphical presentation that uses multiple scatter charts arranged as a matrix to illustrate the relationships among multiple variables. sparkline A special type of line chart that indicates the trend of data but not magnitude. A sparkline does not include axes or labels. stacked-column chart A special type of column (bar) chart in which multiple variables appear on the same bar. treemap A graphical presentation that is useful for visualizing hierarchical data along multiple dimensions. A treemap groups data according to the classes of a categorical variable and uses rectangles whose size relates to the magnitude of a quantitative variable. trendline A line that provides an approximation of the relationship between variables in a chart. P r o b l e ms 1. Sales Performance Bonuses. A sales manager is trying to determine appropriate sales performance bonuses for her team this year. The following table contains the data relevant to determining the bonuses, but it is not easy to read and interpret. Reformat the table to improve readability and to help the sales manager make her decisions about bonuses.

Salesperson Smith, Michael Yu, Joe SalesBonuses

Reeves, Bill Hamilton, Joshua Harper, Derek

Total Sales ($) 325,000.78

Average Performance Bonus Previous Customer Years with Years ($) Accounts Company 12,499.3452

124

14

13,678.21

239.9434

9

7

452,359.19

21,987.2462

175

21

87,423.91

7,642.9011

28

3 4

87,654.21

1,250.1393

21

Quinn, Dorothy

234,091.39

14,567.9833

48

9

Graves, Lorrie

379,401.94

27,981.4432

121

12

Sun, Yi Thompson, Nicole

31,733.59

672.9111

7

1

127,845.22

13,322.9713

17

3

130

Chapter 3 Data Visualization

2. Gross Domestic Product Values. The following table shows an example of gross domestic product values for five countries over six years in equivalent U.S. dollars ($). Gross Domestic Product (in US $) Country

Year 4

Year 5

11,592,303,225

10,781,921,975

10,569,204,154

Argentina 169,725,491,092 198,012,474,920 241,037,555,661 301,259,040,110 285,070,994,754

339,604,450,702

Albania

Year 1

Year 2

7,385,937,423

Australia

Year 3

8,105,580,293

9,650,128,750

Year 6

704,453,444,387 758,320,889,024 916,931,817,944 982,991,358,955 934,168,969,952 1,178,776,680,167

Austria

272,865,358,404 290,682,488,352 336,840,690,493 375,777,347,214 344,514,388,622

341,440,991,770

Belgium

335,571,307,765 355,372,712,266 408,482,592,257 451,663,134,614 421,433,351,959

416,534,140,346

a. How could you improve the readability of this table? b. The file GDPYears contains sample data from the United Nations Statistics Division on 30 countries and their GDP values from Year 1 to Year 6 in US $. Create a table that provides all these data for a user. Format the table to make it as easy to read as possible.

GDPYears

Hint: It is generally not important for the user to know GDP to an exact dollar figure. It is typical to present GDP values in millions or billions of dollars. 3. Monthly Revenue Data. The following table provides monthly revenue values for Tedstar, Inc., a company that sells valves to large industrial firms. The monthly revenue data have been graphed using a line chart in the following figure.

Month

Jan

Feb

Mar

Apr

May

Jun

Jul

Aug

Sep

Oct

Nov

Dec

Tedstar

Revenue ($)

Revenue ($) 145,869 123,576 143,298 178,505 186,850 192,850 134,500 145,286 154,285 148,523 139,600 148,235 210000 200000 190000 180000 170000 160000 150000 140000 130000 120000 110000 10000 90000 80000 70000 60000 50000 40000 30000 20000 10000 0

Jan

Feb

Mar

Apr

May

Jun

Jul

Aug

Sep

Oct

Nov

Dec

Months

MajorSalary

a. What are the problems with the layout and display of this line chart? b. Create a new line chart for the monthly revenue data at Tedstar, Inc. Format the chart to make it easy to read and interpret. 4. Business Graduates Salaries. In the file MajorSalary, data have been collected from 111 College of Business graduates on their monthly starting salaries. The graduates include students majoring in management, finance, accounting, information systems, and marketing. Create a PivotTable in Excel to display the number of graduates in each major and the average monthly starting salary for students in each major. a. Which major has the greatest number of graduates?

131

Problems

b. Which major has the highest average starting monthly salary? c. Use the PivotTable to determine the major of the student with the highest overall starting monthly salary. What is the major of the student with the lowest overall starting monthly salary? 5. Top U.S. Franchises. Entrepreneur magazine ranks franchises. Among the factors that the magazine uses in its rankings are growth rate, number of locations, start-up costs, and financial stability. A recent ranking listed the top 20 U.S. franchises and the number of locations as follows: Number of U.S. Locations

Franchise

Franchise

Hampton Inns

1,864

Jan-Pro Franchising Intl. Inc.

ampm

3,183

Hardee’s

Number of U.S. Locations 12,394 1,901

McDonald’s

32,805

Pizza Hut Inc.

13,281

7-Eleven Inc.

37,496

Kumon Math & Reading Centers

25,199

Supercuts

2,130

Dunkin’ Donuts

Days Inn

1,877

KFC Corp.

Vanguard Cleaning Systems

2,155

Jazzercise Inc.

7,683

Servpro

1,572

Anytime Fitness

1,618

Subway

34,871

Matco Tools

1,431

Stratus Building Solutions

5,018

Denny’s Inc.

Franchises

MutualFunds

Note that Excel may display the column headings as 0–10, 10–20, 20–30, etc., but they should be interpreted as 0–9.99, 10–19.99, 20–29.99,etc.

TaxData

1,668

9,947 16,224

These data can be found in the file Franchises. Create a PivotTable to summarize these data using classes 0–9,999, 10,000–19,999, 20,000–29,999, and 30,000–39,999 to answer the following questions. (Hint: Use Number of U.S. Locations as the COLUMNS, and use Count of Number of U.S. Locations as the VALUES in the PivotTable.) a. How many franchises have between 0 and 9,999 locations? b. How many franchises have more than 30,000 locations? 6. Mutual Funds Data. The file MutualFunds contains a data set with information for 45 mutual funds that are part of the Morningstar Funds 500. The data set includes the following five variables: Fund Type: The type of fund, labeled DE (Domestic Equity), IE (International Equity), and FI (Fixed Income) Net Asset Value ($): The closing price per share Five-Year Average Return (%): The average annual return for the fund over the past five years Expense Ratio (%): The percentage of assets deducted each fiscal year for fund expenses Morningstar Rank: The risk adjusted star rating for each fund; Morningstar ranks go from a low of 1 Star to a high of 5 Stars. a. Prepare a PivotTable that gives the frequency count of the data by Fund Type (rows) and the five-year average annual return (columns). Use classes of 0–9.99, 10–19.99, 20–29.99, 30–39.99, 40–49.99, and 50–59.99 for the Five-Year Average Return (%). b. What conclusions can you draw about the fund type and the average return over the past five years? 7. Tax Data by County. The file TaxData contains information from federal tax returns filed in 2007 for all counties in the United States (3,142 counties in total). Create a PivotTable in Excel to answer the questions below. The PivotTable should have State Abbreviation as Row Labels. The Values in the PivotTable should be the sum of adjusted gross income for each state. a. Sort the PivotTable data to display the states with the smallest sum of adjusted gross income on top and the largest on the bottom. Which state had the smallest sum of adjusted gross income? What is the total adjusted gross income for federal tax

132

Chapter 3 Data Visualization

FDICBankFailures

returns filed in this state with the smallest total adjusted gross income? (Hint: To sort data in a PivotTable in Excel, right-click any cell in the PivotTable that contains the data you want to sort, and select Sort.) b. Add the County Name to the Row Labels in the PivotTable. Sort the County Names by Sum of Adjusted Gross Income with the lowest values on the top and the highest values on the bottom. Filter the Row Labels so that only the state of Texas is displayed. Which county had the smallest sum of adjusted gross income in the state of Texas? Which county had the largest sum of adjusted gross income in the state of Texas? c. Click on Sum of Adjusted Gross Income in the Values area of the PivotTable in Excel. Click Value Field Settings…. Click the tab for Show Values As. In the Show values as box, select % of Parent Row Total. Click OK. This displays the adjusted gross income reported by each county as a percentage of the total state adjusted gross income. Which county has the highest percentage adjusted gross income in the state of Texas? What is this percentage? d. Remove the filter on the Row Labels to display data for all states. What percentage of total adjusted gross income in the United States was provided by the state of New York? 8. Federally Insured Bank Failures. The file FDICBankFailures contains data on failures of federally insured banks between 2000 and 2012. Create a PivotTable in Excel to answer the following questions. The PivotTable should group the closing dates of the banks into yearly bins and display the counts of bank closures each year in columns of Excel. Row labels should include the bank locations and allow for grouping the locations into states or viewing by city. You should also sort the PivotTable so that the states with the greatest number of total bank failures between 2000 and 2012 appear at the top of the PivotTable. a. Which state had the greatest number of federally insured bank closings between 2000 and 2012? b. How many bank closings occurred in the state of Nevada (NV) in 2010? In what cities did these bank closings occur? c. Use the PivotTable’s filter capability to view only bank closings in California (CA), Florida (FL), Texas (TX), and New York (NY) for the years 2009 through 2012. What is the total number of bank closings in these states between 2009 and 2012? d. Using the filtered PivotTable from part c, what city in Florida had the greatest number of bank closings between 2009 and 2012? How many bank closings occurred in this city? e. Create a PivotChart to display a column chart that shows the total number of bank closings in each year from 2000 through 2012 in the state of Florida. Adjust the formatting of this column chart so that it best conveys the data. What does this column chart suggest about bank closings between 2000 and 2012 in Florida? Discuss. (Hint: You may have to switch the row and column labels in the PivotChart to get the best presentation for your PivotChart.) 9. Scatter Chart and Trendline. The following 20 observations are for two quantitative variables, x and y.

Scatter

Observation

x

y

Observation

x

y

1

222

22

11

237

48

2

233

49

12

34

229

3

2

8

13

9

218

4

29

216

14

233

31

5

213

10

15

20

216

6

21

228

16

23

14

7

213

27

17

215

18

8

223

35

18

12

17

9

14

25

19

220

211

10

3

23

20

27

222

133

Problems

Fortune500

a. Create a scatter chart for these 20 observations. b. Fit a linear trendline to the 20 observations. What can you say about the relationship between the two quantitative variables? 10. Profits and Market Capitalizations. The file Fortune500 contains data for profits and market capitalizations from a recent sample of firms in the Fortune 500. a. Prepare a scatter diagram to show the relationship between the variables Market Capitalization and Profit in which Market Capitalization is on the vertical axis and Profit is on the horizontal axis. Comment on any relationship between the variables. b. Create a trendline for the relationship between Market Capitalization and Profit. What does the trendline indicate about this relationship? 11. Vehicle Production Data. The International Organization of Motor Vehicle Manufacturers (officially known as the Organisation Internationale des Constructeurs d’Automobiles, OICA) provides data on worldwide vehicle production by manufacturer. The following table shows vehicle production numbers for four different manufacturers for five recent years. Data are in millions of vehicles. Production (Millions of vehicles) Manufacturer

AutoProduction

Year 1

Year 2

Year 3

Year 4

Year 5

Toyota

8.04

8.53

9.24

7.23

8.56

GM

8.97

9.35

8.28

6.46

8.48

Volkswagen

5.68

6.27

6.44

6.07

7.34

Hyundai

2.51

2.62

2.78

4.65

5.76

a. Construct a line chart for the time series data for years 1 through 5 showing the number of vehicles manufactured by each automotive company. Show the time series for all four manufacturers on the same graph. b. What does the line chart indicate about vehicle production amounts from years 1 through 5? Discuss. c. Construct a clustered-bar chart showing vehicles produced by automobile manufacturer using the year 1 through 5 data. Represent the years of production along the horizontal axis, and cluster the production amounts for the four manufacturers in each year. Which company is the leading manufacturer in each year? 12. Price of Gasoline. The following table contains time series data for regular gasoline prices in the United States for 36 consecutive months:

Month

GasPrices

Price ($)

Month

Price ($)

Month

Price ($)

1

2.27

13

2.84

25

3.91

2

2.63

14

2.73

26

3.68

3

2.53

15

2.73

27

3.65

4

2.62

16

2.73

28

3.64

5

2.55

17

2.71

29

3.61

6

2.55

18

2.80

30

3.45

7

2.65

19

2.86

31

3.38

8

2.61

20

2.99

32

3.27

9

2.72

21

3.10

33

3.38

10

2.64

22

3.21

34

3.58

11

2.77

23

3.56

35

3.85

12

2.85

24

3.80

36

3.90

134

Chapter 3 Data Visualization

a. Create a line chart for these time series data. What interpretations can you make about the average price per gallon of conventional regular gasoline over these 36 months? b. Fit a linear trendline to the data. What does the trendline indicate about the price of gasoline over these 36 months? 13. Term Life Insurance. The following table contains sales totals for the top six term life insurance salespeople at American Insurance. Salesperson

Contracts Sold

Harish

24

David

41

Kristina

19

Steven

23

Tim

53

Mona

39

a. Create a column chart to display the information in the table above. Format the column chart to best display the data by adding axes labels, a chart title, etc. b. Sort the values in Excel so that the column chart is ordered from most contracts sold to fewest. c. Insert data labels to display the number of contracts sold for each salesperson above the columns in the column chart created in part a. 14. Pie Chart Alternatives. The total number of term life insurance contracts sold in Problem 13 is 199. The following pie chart shows the percentages of contracts sold by each salesperson. 12.1%

19.6%

Harish David 20.6% 26.6%

Kristina Steven Tim

9.5%

Mona

11.6%

a. What are the problems with using a pie chart to display these data? b. What type of chart would be preferred for displaying the data in this pie chart? c. Use a different type of chart to display the percentage of contracts sold by each salesperson that conveys the data better than the pie chart. Format the chart and add data labels to improve the chart’s readability. 15. Engine Type Preference. An automotive company is considering the introduction of a new model of sports car that will be available in four-cylinder and six-cylinder engine types. A sample of customers who were interested in this new model were asked to indicate their preference for an engine type for the new model of automobile. The customers were also asked to indicate their preference for exterior color from four choices: red, black, green, and white. Consider the following data regarding the customer responses:

NewAuto

Four Cylinders

Six Cylinders

Red

143

857

Black

200

800

Green

321

679

White

420

580

135

Problems

a. Construct a clustered-column chart with exterior color as the horizontal variable. b. What can we infer from the clustered-bar chart in part a? 16. Smartphone Ownership. Consider the following survey results regarding smartphone ownership by age:

SmartPhone

Age Category

Smartphone (%)

Other Cell Phone (%)

No Cell Phone (%)

18–24

49

46

5

25–34

58

35

7

35–44

44

45

11

45–54

28

58

14

55–64

22

59

19

65+

11

45

44

a. Construct a stacked-column chart to display the survey data on type of cell-phone ownership. Use Age Category as the variable on the horizontal axis. b. Construct a clustered column chart to display the survey data. Use Age Category as the variable on the horizontal axis. c. What can you infer about the relationship between age and smartphone ownership from the column charts in parts a and b? Which column chart (stacked or clustered) is best for interpreting this relationship? Why? 17. Store Manager Tasks. The Northwest regional manager of Logan Outdoor Equipment Company has conducted a study to determine how her store managers are allocating their time. A study was undertaken over three weeks that collected the following data related to the percentage of time each store manager spent on the tasks of attending required meetings, preparing business reports, customer interaction, and being idle. The results of the data collection appear in the following table: Attending Required Meetings (%)

Logan

Locations

Tasks Preparing Business Reports (%)

Customer Interaction (%)

Idle (%)

Seattle

32

17

37

14

Portland

52

11

24

13

Bend

18

11

52

19

Missoula

21

6

43

30

Boise

12

14

64

10

Olympia

17

12

54

17

a. Create a stacked-bar chart with locations along the vertical axis. Reformat the bar chart to best display these data by adding axis labels, a chart title, and so on. b. Create a clustered-bar chart with locations along the vertical axis and clusters of tasks. Reformat the bar chart to best display these data by adding axis labels, a chart title, and the like. c. Create multiple bar charts in which each location becomes a single bar chart showing the percentage of time spent on tasks. Reformat the bar charts to best display these data by adding axis labels, a chart title, and so forth. d. Which form of bar chart (stacked, clustered, or multiple) is preferable for these data? Why? e. What can we infer about the differences among how store managers are allocating their time at the different locations? 18. R&D Project Portfolio. The Ajax Company uses a portfolio approach to manage their research and development (R&D) projects. Ajax wants to keep a mix of projects to balance the expected return and risk profiles of their R&D activities. Consider a situation in which Ajax has six R&D projects as characterized in the table. Each project is given an expected rate of return and a risk assessment, which is a value between 1 and 10, where Copyright 2021 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.

136

Chapter 3 Data Visualization

1 is the least risky and 10 is the most risky. Ajax would like to visualize their current R&D projects to keep track of the overall risk and return of their R&D portfolio. Project

Ajax

SurveyResults

Expected Rate of Return (%)

Capital Invested (Millions $)

Risk Estimate

1

12.6

6.8

6.4

2

14.8

6.2

45.8

3

9.2

4.2

9.2

4

6.1

6.2

17.2

5

21.4

8.2

34.2

6

7.5

3.2

14.8

a. Create a bubble chart in which the expected rate of return is along the horizontal axis, the risk estimate is on the vertical axis, and the size of the bubbles represents the amount of capital invested. Format this chart for best presentation by adding axis labels and labeling each bubble with the project number. b. The efficient frontier of R&D projects represents the set of projects that have the highest expected rate of return for a given level of risk. In other words, any project that has a smaller expected rate of return for an equivalent, or higher, risk estimate cannot be on the efficient frontier. From the bubble chart in part a, which projects appear to be located on the efficient frontier? 19. Marketing Survey Results. Heat maps can be very useful for identifying missing data values in moderate to large data sets. The file SurveyResults contains the responses from a marketing survey: 108 individuals responded to the survey of 10 questions. Respondents provided answers of 1, 2, 3, 4, or 5 to each question, corresponding to the overall satisfaction on 10 different dimensions of quality. However, not all respondents answered every question. a. To find the missing data values, create a heat map in Excel that shades the empty cells a different color. Use Excel’s Conditional Formatting function to create this heat map. Hint: Click on Conditional Formatting in the Styles group in the Home tab. Select Highlight Cells Rules and click More Rules…. Then enter Blanks in the Format only cells with: box. Select a format for these blank cells that will make them obviously stand out. b. For each question, which respondents did not provide answers? Which question has the highest nonresponse rate? 20. Revenues of Web Development Companies. The following table shows monthly revenue for six different web development companies. Revenue ($)

WebDevelop

Company

Jan

Feb

Mar

Apr

May

Jun

Blue Sky Media

8,995

9,285

11,555

9,530

11,230

13,600

18,250

16,870

19,580

17,260

18,290

16,250

8,480

7,650

7,023

6,540

5,700

4,930

28,325

27,580

23,450

22,500

20,800

19,800

4,580

6,420

6,780

7,520

8,370

10,100

17,500

16,850

20,185

18,950

17,520

18,580

Innovate Technologies Timmler Company Accelerate, Inc. Allen and Davis, LLC Smith Ventures

a. Use Excel to create sparklines for sales at each company. b. Which companies have generally decreasing revenues over the six months? Which company has exhibited the most consistent growth over the six months? Which companies have revenues that are both increasing and decreasing over the six months? c. Use Excel to create a heat map for the revenue of the six companies. Do you find the heat map or the sparklines to be better at communicating the trend of revenues over the six months for each company? Why?

137

Problems

21. NFL Attendance. Below is a sample of the data in the file NFLAttendance which contains the 32 teams in the National Football League, their conference affiliation, their division, and their average home attendance. Conference

NFLAttendance

Division

Team

Average Home Attendance

AFC

West

Oakland

54,584

AFC

West

Los Angeles Chargers

57,024

NFC

North

Chicago

60,368

AFC

North

Cincinnati

60,511

NFC

South

Tampa Bay

60,624

NFC

North

Detroit

60,792

AFC

South

Jacksonville

61,915

a. Create a treemap using these data that separates the teams into their conference affiliations (NFC and AFC) and uses size to represent each team’s average home attendance. Note that you will need to sort the data in Excel by Conference to properly create a treemap. b. Create a sorted bar chart that compares the average home attendance for each team. c. Comment on the advantages and disadvantages of each type of chart for these data. Which chart best displays these data and why? 22. Global 100 Companies. For this problem we will use the data in the file Global100 that was referenced in Section 3.4 as an example for creating a treemap. Here we will use these data to create a GIS chart. A portion of the data contained in Global100 is shown below. Continent

Global100

Country

Company

Market Value (Billions US $)

Asia

China

Agricultural Bank of China

141.1

Asia

China

Bank of China

124.2

Asia

China

China Construction Bank

174.4

Asia

China

ICBC

215.6 202.0

Asia

China

PetroChina

Asia

China

Sinopec-China Petroleum

Asia

China

Tencent Holdings

135.4 184.6

Asia

Hong Kong

China Mobile

Asia

Japan

Softbank

Asia

Japan

Toyota Motor

94.7

91.2 193.5

Use Excel to create a GIS chart that (1) displays the Market Value of companies in different countries as a heat map; (2) allows you to filter the results so that you can choose to add and remove specific continents in your GIS chart; and (3) uses text labels to display which companies are located in each country. To do this you will need to create a 3D Map in Excel. You will then need to click the Change the visualization to Region button, and then add Country to the Location box (and remove Continent from the Location box if it appears there), add Continent to the Filters box and add Market Value (Billions US $) to the Value box. Under Layer Options, you will also need to Customize the Data Card to include Company as a Field for the CustomTooltip. a. Display the results of the GIS chart for companies in Europe only. Which country in Europe has the highest total Market Value for Global 100 companies in that country? What is the total market value for Global 100 companies in that country? b. Add North America in addition to Europe for continents to be displayed. How does the heat map for Europe change? Why does it change in this way? 23. Online Customers versus In-Store Customers. Zeitler’s Department Stores sells its products online and through traditional brick-and-mortar stores. The following

138

Chapter 3 Data Visualization

parallel-coordinates plot displays data from a sample of 20 customers who purchased clothing from Zeitler’s either online or in-store. The data include variables for the customer’s age, annual income, and the distance from the customer’s home to the nearest Zeitler’s store. According to the parallel-coordinates plot, how are online customers differentiated from in-store customers? 64

154

120

In-store Online

23 Age

Problem 24 requires the use of software outside native Excel.

ZeitlersElectronics

Problem 26 requires the use of software outside native Excel.

Bravman

14 Annual Income ($000)

6 Distance from Nearest Store (miles)

24. Customers Who Purchase Electronic Equipment. The file ZeitlersElectronics contains data on customers who purchased electronic equipment either online or in-store from Zeitler’s Department Stores. a. Create a parallel-coordinates plot for these data. Include vertical axes for the customer’s age, annual income, and distance from nearest store. Color the lines by the type of purchase made by the customer (online or in-store). b. How does this parallel-coordinates plot compare to the one shown in Problem 23 for clothing purchases? Does the division between online and in-store purchasing habits for customers buying electronics equipment appear to be the same as for customers buying clothing? c. Parallel-coordinates plots are very useful for interacting with your data to perform analysis. Filter the parallel-coordinates plot so that only customers whose homes are more than 40 miles from the nearest store are displayed. What do you learn from the parallel-coordinates plot about these customers? 25. Radiological Imaging Services Clinics. Aurora Radiological Services is a health care clinic that provides radiological imaging services (such as MRIs, X-rays, and CAT scans) to patients. It is part of Front Range Medical Systems that operates clinics throughout the state of Colorado. a. What type of key performance indicators and other information would be appropriate to display on a data dashboard to assist the Aurora clinic’s manager in making daily staffing decisions for the clinic? b. What type of key performance indicators and other information would be appropriate to display on a data dashboard for the CEO of Front Range Medical Systems who oversees the operation of multiple radiological imaging clinics? 26. Customers Ordering by Phone. Bravman Clothing sells high-end clothing products online and through phone orders. Bravman Clothing has taken a sample of 25 customers who placed orders by phone. The file Bravman contains data for each customer purchase, including the wait time the customer experienced when he or she called, the customer’s purchase amount, the customer’s age, and the customer’s credit score. Bravman Clothing would like to analyze these data to try to learn more about their phone customers. a. Create a scatter-chart matrix for these data. Include the variables wait time, purchase amount, customer age, and credit score. b. What can you infer about the relationships between these variables from the scatter-chart matrix?

139

Case Problem 1: Pelican Stores

C as e

P r ob l e m

1 :

P e l i ca n

sto r e s

Pelican Stores, a division of National Clothing, is a chain of women’s apparel stores operating throughout the country. The chain recently ran a promotion in which discount coupons were sent to customers of other National Clothing stores. Data collected for a sample of 100 in-store credit card transactions at Pelican Stores during one day while the promotion was running are contained in the file PelicanStores. Table 3.13 shows a portion of the data set. The Proprietary Card method of payment refers to charges made using a National Clothing charge card. Customers who made a purchase using a discount coupon are referred to as promotional customers and customers who made a purchase but did not use a discount coupon are referred to as regular customers. Because the promotional coupons were not sent to regular Pelican Stores customers, management considers the sales made to people presenting the promotional coupons as sales it would not otherwise make. Of course, Pelican also hopes that the promotional customers will continue to shop at its stores. Most of the variables shown in Table 3.13 are self-explanatory, but two of the variables require some clarification. Items The total number of items purchased Net Sales The total amount ($) charged to the credit card Pelican’s management would like to use this sample data to learn about its customer base and to evaluate the promotion involving discount coupons.

Table 3.13 Customer 1

PelicanStores

Data for a Sample of 100 Credit Card Purchases at Pelican Stores Type of Customer

Items

Net Sales

Method of Payment

Gender

Marital Status

Age

Regular

1

39.50

Discover

Male

Married

32

2

Promotional

1

102.40

Proprietary Card

Female

Married

36

3

Regular

1

22.50

Proprietary Card

Female

Married

32

4

Promotional

5

100.40

Proprietary Card

Female

Married

28

5

Regular

2

54.00

MasterCard

Female

Married

34

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

MasterCard

Female

Married

44

Proprietary Card

Female

Married

30

96

Regular

1

39.50

97

Promotional

9

253.00

98

Promotional

10

287.59

Proprietary Card

Female

Married

52

99

Promotional

2

47.60

Proprietary Card

Female

Married

30

100

Promotional

1

28.44

Proprietary Card

Female

Married

44

Managerial Report

Use the tabular and graphical methods of descriptive statistics to help management develop a customer profile and to evaluate the promotional campaign. At a minimum, your report should include the following: Percent frequency distributions were introduced in Section 2.4. You can use Excel to create percent frequency distributions using PivotTables.

1. Percent frequency distributions for each of the key variables: number of items purchased, net sales, method of payment, gender, marital status, and age. 2. A sorted bar chart showing the number of customer purchases attributable to the method of payment. 3. A crosstabulation of type of customer (regular or promotional) versus net sales. Comment on any similarities of differences observed.

140

Chapter 3 Data Visualization

4. A scatter chart to explore the relationship between net sales and customer age. 5. A chart to examine whether the relationship between net sales and age depends on the marital status of the customer. 6. A side-by-side bar chart to examine the method of payment by customer type (regular or promotional). Comment on any differences you observe between the methods of payments used by the different types of customers. C as e

P r ob l e m

2 :

Mov i e

T h e at e r

R e l e as e s

The movie industry is a competitive business. More than 50 studios produce hundreds of new movies for theater release each year, and the financial success of each movie varies considerably. The opening weekend gross sales ($ millions), the total gross sales ($ millions), the number of theaters the movie was shown in, and the number of weeks the movie was in release are common variables used to measure the success of a movie released to theaters. Data collected for the top 100 theater movies released in 2016 are contained in the file Movies2016 (Box Office Mojo website). Table 3.14 shows the data for the first 10 movies in this file. Managerial Report

Use the tabular and graphical methods of descriptive statistics to learn how these variables contribute to the success of a motion picture. Include the following in your report. 1. T abular and graphical summaries for each of the four variables along with a discussion of what each summary tells us about the movies that are released to theaters. 2. A scatter diagram to explore the relationship between Total Gross Sales and Opening Weekend Gross Sales. Discuss. 3. A scatter diagram to explore the relationship between Total Gross Sales and Number of Theaters. Discuss. 4. A scatter diagram to explore the relationship between Total Gross Sales and Number of Weeks in Release. Discuss. Table 3.14

Performance Data for Ten 2016 Movies Released to Theaters Opening Gross Sales ($ Million)

Total Gross Sales ($ Million)

Rogue One: A Star Wars Story

155.08

532.18

4,157

20

Finding Dory

135.06

486.30

4,305

25

Captain America: Civil War

179.14

408.08

4,226

20

The Secret Life of Pets

104.35

368.38

4,381

25

The Jungle Book

103.26

364.00

4,144

24

Deadpool

132.43

363.07

3,856

18

Movie Title

Movies2016

Zootopia

Number of Theaters

Weeks in Release

75.06

341.27

3,959

22

166.01

330.36

4,256

12

Suicide Squad

133.6

325.10

4,255

14

Sing

35.26

270.40

4,029

20

Batman v Superman: Dawn of Justice

141

Data Visualization in Tableau Appendix In this appendix, we introduce the use of Tableau Desktop software for visualizing data. Tableau allows for easy creation of a variety of charts and interactive visualizations of data. Tableau is particularly useful for creating interactive visualizations that allow a user to sort, filter, and otherwise explore data.

Connecting to a Data File in Tableau Tableau can connect with many different types of data files for use in creating data visualizations. When you open Tableau Desktop, you should see a screen similar to Figure Tableau 3.1. This is the Tableau Desktop Home Screen. The Connect section allows you to connect to many different data file types. The Open section allows you to open sample data sets provided by Tableau and the Discover section provides information on ways to use Tableau. Tableau can open different types of data files including Excel files, text files, database files and many others. We will use the steps below and the file Electronics to illustrate how we can connect to an Excel file in Tableau.

FIGURE TABLEAU 3.1

Tableau Desktop Home Screen

142

Data Visualization in Tableau Appendix

FIGURE TABLEAU 3.2

Tableau Data Source Screen for the Electronics Data

Each worksheet contained in the Excel file will be listed in the Sheets area on the left of the dialog box. Tableau can connect to multiple data locations, but you must combine the data locations into a table using the New Union function.

Step 1. Click the File tab in the Tableau Ribbon and select Open… Step 2. When the Open dialog box appears, navigate to the location of the Electronics.xlsx file. Select the Electronics.xlsx file, and click Open

Electronics

Steps 1 and 2 above will open the Tableau Data Source screen shown in Figure Tableau 3.2. This screen shows a preview of the data file to which Tableau is currently connected. From the top of Figure Tableau 3.2 we see that Tableau is connected to the data file Electronics using a Live Connection. This means that as the data file is updated, these updates will be reflected in any visualizations created using Tableau. Alternatively, Tableau can use an Extract Connection, in which case all data from the file would be extracted, and any visualizations created in Tableau would not be updated as the data in the file is changed. The lower portion of the Tableau Data Source screen shows a preview of the data file to which Tableau is connected. From Figure Tableau 3.2, we see the columns from the file Electronics are titled “Week”, “No. of Commercials,” and “Sales Volume,” and we see the first 8 observations for these data. NOTES

+

C OMMENTS

1. To update the visualizations created in Tableau from a Live Connection, first save any changes in the original data file. Then click Data in the Tableau Ribbon and select Data (name of data file), and click Refresh. 2. Tableau saves files as Tableau Workbooks with the file extension .twb. To save a file in Tableau, click the File tab

in the Ribbon and select Save. Tableau workbooks can be viewed by users who either have a licensed copy of Tableau or the free Tableau Reader that allows users to only view visualizations created with Tableau.

Creating a Scatter Chart in Tableau

143

Creating a Scatter Chart in Tableau We will use the file Electronics and the steps below to create a scatter chart similar to the one shown in Figure 3.17. Step 1. Click the Sheet 1 tab at the bottom of Tableau Data Source screen (see Figure Tableau 3.2). This will open a Tableau sheet as shown in Figure Tableau 3.3 Step 2. Drag No. of Commercials from the Measures area to the Columns area to set number of commercials as the horizontal axis value Drag Sales Volume from the Measures area to the Rows area to set sales volume as the vertical axis value Note that after Step 2, Tableau will display only a single circle in the chart as shown in Figure Tableau 3.4. This is because Tableau is currently plotting the SUM of the No. of Commercials versus the SUM of the Sales Volume. In Step 3, we will force Tableau to plot the values for No. of Commercials and Sales Volume by week. We do this by indicating that the level of detail that Tableau should plot is given by the Dimension of Week. Step 3. Drag Week from the Dimensions area to the Detail box area to set the level of detail to weeks Step 4. Right-click the scatter chart for Sheet 1 Select Trend Lines and click Show Trend Lines

in the Marks

Steps 1 through 4 will create the scatter chart with trendline shown in Figure Tableau 3.5.

FIGURE TABLEAU 3.3

Blank Tableau Sheet for the Electronics Data

The green # signs next to Week, No. of Commercials, Sales Volume, etc. indicate that Tableau has identified these variables as continuous numerical values. Blue # signs indicate discrete numerical values; Abc indicates text values. These value types can be changed by right-clicking on the variable and selecting Convert or Change Data Type.

144

Data Visualization in Tableau Appendix

FIGURE TABLEAU 3.4

Scatter Plot of Sum of No. of Commercials Versus Sum of Sales Volume

FIGURE TABLEAU 3.5

Scatter Chart with Trendline for the Electronics Data

Note the setting of Automatic in the Marks area. Tableau attempts to choose the best type of visualization for your data based on the Dimensions and Measures that you include. Tableau generally does a very good job of choosing the best visualization, but you can change this by clicking on Automatic in the Marks area and choosing a different type of visualization from the drop-down menu.

Creating a Line Chart in Tableau

145

Creating a Line Chart in Tableau We will now show how to create a line chart in Tableau similar to that shown in Figure 3.21 using the file KirklandRegional and the steps below. You will first need to connect to the KirklandRegional Excel file by opening a new Tableau sheet (click File in the Tableau Ribbon and select New) and then following the steps in the Connecting to a Data File in Tableau section at the beginning of this chapter appendix to connect to the KirklandRegional Excel file. Once you connect to the KirklandRegional Excel file, the Tableau Data Source screen will appear similar to Figure Tableau 3.6. Note that the current column titles are incorrect. It shows “F1” as the title of Column 1, “Sales ($100s)” as the title of Column 2 and “F3” as the title of Column 3. The actual column titles are shown in the second row: “Month”, “North” and “South”. Tableau provides an easy-to-use tool known as Data Interpreter that can clean many common data errors such as these. Click the check box for Use Data Interpreter in the Sheets area. This will alter the column names to be “Month”, “North” and “South”. We can then use the steps below to create a line chart similar to Figure 3.21.

KirklandRegional

Step 1. Click the Sheet 1 tab at the bottom of the Tableau Data Source screen to open a new Tableau Sheet Step 2. Drag Month from the Dimensions area to the Columns area to set month as the horizontal axis value Drag North from the Measures area to the Rows area to set sales amounts in the North region as the vertical axis value Drag South from the Measures area to the Rows area to set sales amounts in the South region as the vertical axis value Step 3. Change the chart type in the drop-down menu of the Marks area from Automatic to Line This creates the line charts shown in Figure Tableau 3.7. Step 4 will put both line charts on the same axis. Step 4. Drag SUM(South) from the Rows area to the North vertical axis area of the line chart (see Figure Tableau 3.7) to put both North and South on the same axis

FIGURE TABLEAU 3.6

Tableau Data Source Screen for the KirklandRegional Data

146

Data Visualization in Tableau Appendix

FIGURE TABLEAU 3.7

Separate Line Charts Created for the KirklandRegional Data

Step 5. Right-click on the Value label on the vertical axis of the line chart Select Edit Axis… Step 6. When the Edit Axis [Measure Values] dialog box appears, change the Title in the Axis Titles area to Sales ($100s) Click Apply and then click OK Steps 1 through 6 create the line chart shown in Figure Tableau 3.8 which is similar to Figure 3.21. FIGURE TABLEAU 3.8

Line Chart Created in Tableau for the KirklandRegional Data

Creating a Bubble Chart in Tableau

147

Creating a Bar Chart in Tableau We will now show how to create a bar chart in Tableau similar to that shown in Figure 3.25 using the file AccountsManaged and the steps below. You will first need to connect to the AccountsManaged Excel file by opening a new Tableau sheet (click File in the Tableau Ribbon and select New) and then following the steps in the Connecting to a Data File in Tableau section at the beginning of this chapter appendix to connect to the AccountsManaged Excel file.

AccountsManaged

If you reverse Step 2 to put Manager in the Columns area and Accounts Managed in the Rows area, this will create a vertical column chart rather than the horizontal bar chart.

Step 1. Click the Sheet 1 tab at the bottom of the Tableau Data Source screen to open a new Tableau Sheet Step 2. Drag Accounts Managed from the Measures area to the Columns area to set this as the horizontal axis value Drag Manager from the Dimensions area to the Rows area to set this as the vertical axis value Step 3. Click the Sort Manager descending by Accounts Managed button just below the Tableau Ribbon to sort the bar chart by decreasing number of accounts managed Step 4. Drag Accounts Managed from the Measures area to the Label button in the Marks area to add data labels to the bars corresponding to the number of accounts managed for each manager Steps 1 through 4 create the bar chart shown in Figure Tableau 3.9.

FIGURE TABLEAU 3.9

Sorted Bar Chart Created in Tableau for the AccountsManaged Data

Creating a Bubble Chart in Tableau We will now show how to create a bubble chart in Tableau similar to that shown in Figure 3.27 using the file Billionaires and the steps below. You will first need to connect to the Billionaires Excel file by opening a new Tableau sheet (click File in the Tableau Ribbon Copyright 2021 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.

148

Data Visualization in Tableau Appendix

FIGURE TABLEAU 3.10

Bubble Chart Created in Tableau for the Billionaires Data

You may notice that Tableau has added variables in the Measures section for Latitude (generated) and Longitude (generated). Whenever Tableau recognizes a geographic variable (country name, state name, etc.) it automatically generates the corresponding latitudes and longitudes so that the values can be plotted on maps.

and select New) and then following the steps in the Connecting to a Data File in Tableau section at the beginning of this chapter appendix to connect to the Billionaires Excel file.

Billionaires

Step 1. Click the Sheet 1 tab at the bottom of the Tableau Data Source screen to open a new Tableau Sheet Step 2. Drag Billionaires per 10M Residents from the Measures area to the Columns area to set this as the horizontal axis value Drag Per Capita Income from the Measures area to the Rows area to set this as the vertical axis value Step 3. Drag Country from the Dimensions area to the Detail button in the Marks area to set the level of detail to countries Drag Number of Billionaires from the Measures area to the Size button in the Marks area to size the bubbles based on the number of billionaires in each country Drag Country from the Dimensions area to the Label button in the Marks area to label each bubble by name of country Steps 1 through 3 create the bubble chart shown in Figure Tableau 3.10 which is similar to the chart shown in Figure 3.27.

Creating a Clustered and Stacked Column Charts in Tableau We will now show how to create clustered and stacked column (or bar) charts in Tableau similar to those shown in Figure 3.30 using the file KirklandRegional and the steps below. You will first need to connect to the KirklandRegional Excel file by opening a new Tableau sheet (click File in the Tableau Ribbon and select New) and then following the steps in the Connecting to a Data File in Tableau section at the beginning of this chapter appendix to connect to the KirklandRegional Excel file.

KirklandRegional

Step 1. Click the check box for Use Data Interpreter in the Sheets area of the Tableau Data Source screen to alter the column names to be “Month”, “North” and “South”.

Creating a Clustered and Stacked Column Charts in Tableau

If you reverse Step 3 to put Month in the Rows area and North and South in the Columns area, this will create a horizontal bar chart rather than a vertical column chart.

149

Step 2. Click the Sheet 1 tab at the bottom of the Tableau Data Source screen to open a new Tableau Sheet Step 3. Drag Month from the Dimensions area to the Columns area to set month as the horizontal axis value Drag North from the Measures area to the Rows area to set sales amounts in the North region as the vertical axis value Drag South from the Measures area to the Rows area to set sales amounts in the South region as the vertical axis value Step 4. Drag SUM(South) from the Rows area to the North label of the vertical bar chart (see Figure Tableau 3.11) to put both column charts on the same axis Step 5. Drag Measure Names from the Dimensions area to the Color button

in the Marks area to change the color of the columns based on the Measure names Step 6. Right-click on the Value label on the vertical axis of the column chart Select Edit Axis… Step 7. When the Edit Axis [Measure Values] dialog box appears: Change the Title in the Axis Titles area to Sales ($100s) Click Apply and then click OK This creates the clustered column chart as shown in Figure Tableau 3.12. Step 8 changes the clustered column chart to a stacked column chart. Step 8. Drag Measure Names from the Columns area back to the Dimensions area Step 8 removes Measure Names from Columns and forces Tableau to display both North and South regions on the same column as shown in Figure Tableau 3.13.

FIGURE TABLEAU 3.11

Creating Clustered and Stacked Column Charts in Tableau

150

Data Visualization in Tableau Appendix

FIGURE TABLEAU 3.12

Clustered Column Chart Created in Tableau for the KirklandRegional Data

FIGURE TABLEAU 3.13

Stacked Column Chart Created in Tableau for the KirklandRegional Data

151

Creating a Treemap in Tableau

Creating a Scatter-Chart Matrix in Tableau We will now show how to create a scatter-chart matrix in Tableau similar to that shown in Figure 3.31 using the file NYCityData and the steps below. You will first need to connect to the NYCityData Excel file by opening a new Tableau sheet (click File in the Tableau Ribbon and select New) and then following the steps in the Connecting to a Data File in Tableau section at the beginning of this chapter appendix to connect to the NYCityData Excel file. NYCityData

Step 1. Click the Sheet 1 tab at the bottom of the Tableau Data Source screen to open a new Tableau Sheet Step 2. Drag Median Monthly Rent ($), Percentage College Graduates (%), Poverty Rate (%), and Travel Time (min) from the Measures area to the Columns area to add each of these variables to the horizontal axis Drag Median Monthly Rent ($), Percentage College Graduates (%), Poverty Rate (%), and Travel Time (min) from the Measures area to the Rows area to add each of these variables to the vertical axis Step 3. Drag Area (Sub-Borough) from the Dimensions area to the Detail button

in the Marks area to set the level of detail to Area

Step 4. Click the Shape button

in the Marks area and select the filled circle

to replace the empty circles with filled circles in the scatter charts

Click the Size button

in the Marks area and adjust the slider

to make the filled circles smaller in the scatter charts Steps 1 through 4 create the scatter-chart matrix shown in Figure Tableau 3.14 which is similar to that shown in Figure 3.31. FIGURE TABLEAU 3.14

Scatter-Chart Matrix Created in Tableau using the NYCityData

Creating a Treemap in Tableau We will now show how to create a treemap in Tableau, similar to that shown in Figure 3.35 using the file Global100 and the steps below. You will first need to connect to the Global100 Excel file by opening a new Tableau sheet (click File in the Tableau Ribbon and Copyright 2021 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.

152

Data Visualization in Tableau Appendix

select New) and then following the steps in the Connecting to a Data File in Tableau section at the beginning of this chapter appendix to connect to the Global100 Excel file.

Global100

Step 1. Click the Sheet 1 tab at the bottom of the Tableau Data Source screen to open a new Tableau Sheet Step 2. Drag Market Value (Billions US $) from the Measures area to the Size

button

in the Marks area to use the relative market value as the

rectangle size measure in the treemap Step 3. Drag Company from the Dimensions area to the Label button Marks area to label each rectangle with the name of the company Step 4. Drag Continent from the Dimensions area to the Color button color the treemap by continent location of each company You can edit the appearance of the Tooltip by clicking on the Tooltip button area.

in the Marks

in the to

Steps 1 through 4 create the treemap shown in Figure Tableau 3.15 which is similar to that shown in Figure 3.35. Note that hovering the pointer over any rectangle (even those where space constraints prevent the company name from appearing) shows the Tooltip which contains the Company Name, Continent of Location, and Market Value. FIGURE TABLEAU 3.15

Treemap Created in Tableau for the Global100 Data

Creating a GIS Chart in Tableau We will now show how to create a GIS chart (map) in Tableau similar to that shown in Figure 3.38 using the file WorldGDP2014 and the steps below. You will first need to connect to the WorldGDP2014 Excel file by opening a new Tableau sheet (click File in the Tableau Ribbon and select New) and then following the steps in the Connecting to a Data File in Tableau section at the beginning of this chapter appendix to connect to the WorldGDP2014 Excel file. WorldGDP2014

Step 1. Click the Sheet 1 tab at the bottom of the Tableau Data Source screen to open a new Tableau Sheet Step 2. Drag Longitude (generated) from the Measures area to the Columns area Drag Latitude (generated) from the Measures area to the Rows area This will generate a map in the chart area.

Creating Histograms and Boxplots in Tableau

FIGURE TABLEAU 3.16

153

GIS Chart Created in Tableau for the WorldGDP2014 Data

You can also generate a map using the Tableau Show Me tool by dragging Country Name from the Dimensions area to the Rows area, clicking the Show Me button and then selecting the Filled Map icon .

Step 3. Drag Country Name from the Dimensions area to the Detail button the Marks area to set the level of detail to Country Step 4. Drag GDP 2014 (Billions US $) from the Measures area to the Color button

in

in the Marks area to color the map based on the relative GDP 2014 values in each country Step 5. Click the Color button and choose Edit Colors… Change the color drop-down from Automatic to Red to more closely match the shadings in Figure 3.38 Click Apply and then click OK Steps 1 through 5 create the GIS chart shown in Figure Tableau 3.16 which is similar to Figure 3.38.

Creating Histograms and Boxplots in Tableau We can also use Tableau to create several of the visualizations introduced in Chapter 2 for descriptive statistics, namely histograms and boxplots. We will begin by showing how to create a histogram in Tableau using the file AuditTime and the steps below. You will first need to connect to the AuditTime Excel file by opening a new Tableau sheet (click File in the Tableau Ribbon and select New) and then following the steps in the Connecting to a Data File in Tableau section at the beginning of this chapter appendix to connect to the AuditTime Excel file.

AuditTime

Step 1. Click the Sheet 1 tab at the bottom of the Tableau Data Source screen to open a new Tableau Sheet Step 2. Drag Audit Times (in Days) from the Measures area to the Rows area Step 3. Click the Show Me button and select the Histogram icon to generate the default histogram Step 4. Right-click Audit Times (in Days) (bin) in the Dimensions area and select Edit… Step 5. When the Edit Bins [Audit Times (in Days)] dialog box appears: Change the Size of bins: to 5 Click OK

154

Data Visualization in Tableau Appendix

FIGURE TABLEAU 3.17

Histogram Created in Tableau for the AuditTime Data

Steps 1 through 5 create the histogram shown in Figure Tableau 3.17 which matches the histogram in Figure 2.12. We will now show how to create boxplots in Tableau for multiple variables similar to those shown in Figure 2.25 using the file HomeSalesStacked and the steps below. Note that we are using what is known as a “stacked” version of the home sales comparison data to Connect to Tableau.1 You will first need to connect to the HomeSalesStacked Excel file by opening a new Tableau sheet (click File in the Tableau Ribbon and select New) and then following the steps in the Connecting to a Data File in Tableau section at the beginning of this chapter appendix to connect to the HomeSalesStacked Excel file. Step 1. Click the Sheet 1 tab at the bottom of the Tableau Data Source screen to open a new Tableau Sheet Step 2. Drag Selling Price ($) from the Measures area to the Rows area

HomeSalesStacked

Steps 1 and 2 create the bar chart shown in Figure Tableau 3.18. To create a boxplot, we need to disaggregate the data, which is what we do in Step 3. Step 3. Click the Analysis tab in the Tableau Ribbon and uncheck Aggregate Measures to disaggregate the Selling Price ($) values (see Figure Tableau 3.19) We can now use Tableau’s Show Me tool to create the boxplot. Step 4. Click the Show Me button

and select the box-and-whisker icon

to generate the default boxplot Step 5. Drag Location from the Dimensions area to the Columns area Steps 1 through 5 create the completed multiple variable boxplot shown in Figure Tableau 3.20. This figure is similar to the boxplots shown in Figure 2.25, but you may notice that the whiskers are different in Figure Tableau 3.19 than in Figure 2.25. This is because Tableau uses slightly different definitions for these values than Excel. A “stacked” data file means that the values for all groups (for example, locations in these data) are in a single column and each row represents a single observation (or record). This type of data file is common in databases, and most statistical analysis software expects data in this format. However, to create the multiple boxplots in Excel in Chapter 2 we had to use the “unstacked” data file HomeSalesComparison. The data is the same in files HomeSalesComparison and HomeSalesStacked, but arranged differently. 1

Creating Histograms and Boxplots in Tableau

155

FIGURE TABLEAU 3.18

First Step in Creating a Multiple Variable Boxplot in Tableau for the HomeSalesStacked Data

FIGURE TABLEAU 3.19

Disaggregating the HomeSalesStacked Data

156

Data Visualization in Tableau Appendix

FIGURE TABLEAU 3.20

NOTES

+

Completed Multiple Variable Boxplot Created in Tableau for the HomeSalesStacked Data

C OMMENTS

1. Tableau can create many additional visualizations for data including bullet graphs, violin plots, Gantt charts, and many others. These are beyond the scope of this textbook, but the References section in the textbook contains several excellent references for learning more about using Tableau for data visualization. 2. Tableau makes it very easy to create Data Dashboards from multiple charts. To create a Dashboard in Tableau, click Dashboard in the Tableau Ribbon and select New

Dashboard. You can then drag individual Tableau Sheets from the Sheets area to the Drop sheets here area and arrange them into a single Dashboard. 3. Tableau includes a Presentation Mode that provides better visualizations of the charts and Dashboards created. To access Presentation Mode, click Window in the Tableau Ribbon and select Presentation Mode. To exit Presentation Mode, press the Esc key.

Chapter 4 Probability: An Introduction to Modeling Uncertainty CONTENTS Analytics in Action: National Aeronautics andSpace Administration 4.1 EVENTS AND PROBABILITIES 4.2 SOME BASIC RELATIONSHIPS OF PROBABILITY Complement of an Event Addition Law 4.3 CONDITIONAL PROBABILITY Independent Events Multiplication Law Bayes’ Theorem 4.4 RANDOM VARIABLES Discrete Random Variables Continuous Random Variables 4.5 DISCRETE PROBABILITY DISTRIBUTIONS Custom Discrete Probability Distribution Expected Value and Variance Discrete Uniform Probability Distribution Binomial Probability Distribution Poisson Probability Distribution 4.6 CONTINUOUS PROBABILITY DISTRIBUTIONS Uniform Probability Distribution Triangular Probability Distribution Normal Probability Distribution Exponential Probability Distribution Summary 198 Glossary 198 Problems 200 Available in the MindTap Reader: Appendix: Discrete Probability Distributions WITH R Appendix: Continuous Probability Distributions WITH R

158

Chapter 4 Probability: An Introduction to Modeling Uncertainty

An a l y t i cs

i n

Act i on

National Aeronautics and Space Administration* Washington, D.C.

The National Aeronautics and Space Administration (NASA) is the U.S. government agency that is responsible for the U.S. civilian space program and for aeronautics and aerospace research. NASA is best known for its manned space exploration; its mission statement is to “drive advances in science, technology, aeronautics, and space exploration to enhance knowledge, education, innovation, economic vitality and stewardship of Earth.” With more than 17,000 employees, NASA oversees many different space-based missions including work on the International Space Station, exploration beyond our solar system with the Hubble telescope, and planning for possible future astronaut missions to the moon and Mars. Although NASA’s primary mission is space exploration, its expertise has been called on in assisting countries and organizations throughout the world in nonspace endeavors. In one such situation, the San José copper and gold mine in Copiapó, Chile, caved in, trapping 33 men more than 2,000 feet underground. It was important to bring the men safely to the surface as quickly as possible, but it was also imperative that the rescue effort be carefully designed and implemented to save as many miners as possible. The Chilean government asked NASA to provide assistance in developing a rescue method. NASA sent a four- person team consisting of an engineer with expertise

The concept of identifying uncertainty in data was introduced in Chapters2 and 3 through descriptive statistics and datavisualization techniques, respectively. In this chapter, we expand on our discussion of modeling uncertainty by formalizing the concept of probability and introducing the concept of probability distributions.

in vehicle design, two physicians, and a psychologist with knowledge about issues of long-term confinement. The probability of success and the failure of various other rescue methods was prominent in the thoughts of everyone involved. Since no historical data were available to apply to this unique rescue situation, NASA scientists developed subjective probability estimates for the success and failure of various rescue methods based on similar circumstances experienced by astronauts returning from short- and long-term space missions. The probability estimates provided by NASA guided officials in the selection of a rescue method and provided insight as to how the miners would survive the ascent in a rescue cage.Therescue method designed by the Chilean officials in consultation with the NASA team resulted in the construction of 13-foot-long, 924-pound steel rescue capsule that would be used to bring up the miners one at a time. All miners were rescued, withthe last emerging 68days after the cave-in occurred. In this chapter, you will learn about probability as well as how to compute and interpret probabilities for a variety of situations. The basic relationships of probability, conditional probability, and Bayes’ theorem will be covered. We will also discuss the concepts of random variables and probability distributions and illustrate the use of some of the more common discrete and continuous probability distributions. *The authors are indebted to Dr. Michael Duncan and Clinton Cragg at NASA for providing this Analytics in Action.

Uncertainty is an ever-present fact of life for decision makers, and much time and effort are spent trying to plan for, and respond to, uncertainty. Consider the CEO who has to make decisions about marketing budgets and production amounts using forecasted demands. Or consider the financial analyst who must determine how to build a client’s portfolio of stocks and bonds when the rates of return for these investments are not known with certainty. In many business scenarios, data are available to provide information on possible outcomes for some decisions, but the exact outcome from a given decision is almost never known with certainty because many factors are outside the control of the decision maker (e.g., actions taken by competitors, the weather). Probability is the numerical measure of the likelihood that an event will occur.1 Therefore, it can be used as a measure of the uncertainty associated with an event. This measure of uncertainty is often communicated through a probability distribution. Probability distributions are extremely helpful in providing additional information about an Note that there are several different possible definitions of probability, depending on the method used to assign probabilities. This includes the classical definition, the relative frequency definition, and the subjective definition of probability. In this text, we most often use the relative frequency definition of probability, which assumes that probabilities are based on empirical data. For a more thorough discussion of the different possible definitions of probability see Chapter 4 of Anderson, Sweeney, Williams, Camm, Cochran, Fry, and Ohlmann, An Introduction to Statistics for Business and Economics, 14e (2020).

1

159

4.1 Events and Probabilities

event, and as we will see in later chapters in this textbook, they can be used to help a decision maker evaluate possible actions and determine the best course of action.

4.1 Events and Probabilities In discussing probabilities, we start by defining a random experiment as a process that generates well-defined outcomes. Several examples of random experiments and their associated outcomes are shown in Table 4.1. By specifying all possible outcomes, we identify the sample space for a random experiment. Consider the first random experiment in Table 4.1—a coin toss. The possible outcomes are head and tail. If we let S denote the sample space, we can use the following notation to describe the sample space. S 5 {Head, Tail} Suppose we consider the second random experiment in Table 4.1—rolling a die. The possible experimental outcomes, defined as the number of dots appearing on the upward face of the die, are the six points in the sample space for this random experiment. S 5 {1, 2, 3, 4, 5, 6} Outcomes and events form the foundation of the study of probability. Formally, an event is defined as a collection of outcomes. For example, consider the case of an expansion project being undertaken by California Power & Light Company (CP&L). CP&L is starting a project designed to increase the generating capacity of one of its plants in Southern California. An analysis of similar construction projects indicates that the possible completion times for the project are 8, 9, 10, 11, and 12 months. Each of these possible completion times represents a possible outcome for this project. Table 4.2 shows the number of past construction projects that required 8, 9, 10, 11, and 12 months. Let us assume that the CP&L project manager is interested in completing the project in 10 months or less. Referring to Table 4.2, we see that three possible outcomes (8 months, 9months, and 10 months) provide completion times of 10 months or less. Letting C denote the event that the project is completed in 10 months or less, we write: C 5 {8, 9, 10} Event C is said to occur if any one of these outcomes occurs. A variety of additional events can be defined for the CP&L project: L 5 The event that the project is completed in less than 10 months 5 {8, 9} M 5 The event that the project is completed in more than 10 months 5 {11, 12} In each case, the event must be identified as a collection of outcomes for the random experiment. Table 4.1

Random Experiments and Experimental Outcomes

Random Experiment

Experimental Outcomes

Toss a coin

Head, tail

Roll a die

1, 2, 3, 4, 5, 6

Conduct a sales call

Purchase, no purchase

Hold a particular share of stock for one year

Price of stock goes up, price of stock goes down, nochange in stock price

Reduce price of product

Demand goes up, demand goes down, no change in demand

160

Chapter 4 Probability: An Introduction to Modeling Uncertainty

Completion Times for 40 CP&L Projects

Table 4.2

Completion Time (months)

No. of Past Projects Having This Completion Time

Probability ofOutcome

8

6

6/40 5 0.15

9

10

10/40 5 0.25

10

12

12/40 5 0.30

11

6

6/40 5 0.15

12 Total

6 40

6/40 5 0.15 1.00

The probability of an event is equal to the sum of the probabilities of outcomes for the event. Using this definition and given the probabilities of outcomes shown in Table 4.2, we can now calculate the probability of the event C 5 {8, 9, 10} . The probability of event C, denoted P(C), is given by P(C ) 5 P(8) 1 P(9) 1 P(10) 5 0.15 1 0.25 1 0.30 5 0.70 Similarly, because the event that the project is completed in less than 10 months is given by L 5 {8, 9} , the probability of this event is given by P( L ) 5 P(8) 1 P(9) 5 0.15 1 0.25 5 0.40 Finally, for the event that the project is completed in more than 10 months, we have M 5 {11, 12} and thus P ( M ) 5 P (11) 1 P (12) 5 0.15 1 0.15 5 0.30 Using these probability results, we can now tell CP&L management that there is a 0.70 probability that the project will be completed in 10 months or less, a 0.40 probability that it will be completed in less than 10 months, and a 0.30 probability that it will be completed in more than 10 months.

4.2 Some Basic Relationships of Probability Complement of an Event The complement of event A is sometimes written asA or A9 in other textbooks.

Given an event A, the complement of A is defined to be the event consisting of all outcomes that are not in A. The complement of A is denoted by AC . Figure 4.1 shows what is known as a Venn diagram, which illustrates the concept of a complement. The rectangular area represents the sample space for the random experiment and, as such, contains all possible outcomes. The circle represents event A and contains only the outcomes that belong to A. The shaded region of the rectangle contains all outcomes not in event A and is by definition the complement of A. In any probability application, either event A or its complement AC must occur. Therefore, we have P( A) 1 P( AC ) 5 1 Solving for P(A), we obtain the following result: Computing Probability Using the Complement

P( A) 5 1 2 P( AC )

(4.1)

161

4.2 Some Basic Relationships of Probability

FIGURE 4.1

Venn Diagram for Event A

Sample Space S

Event A

AC Complement of Event A

Equation (4.1) shows that the probability of an event A can be computed easily if the probability of its complement, P( AC ), is known. As an example, consider the case of a sales manager who, after reviewing sales reports, states that 80% of new customer contacts result in no sale. By allowing A to denote the event of a sale and AC to denote the event of no sale, the manager is stating that P( AC ) 5 0.80 . Using equation (4.1), we see that P( A) 5 1 2 P( AC ) 5 1 2 0.80 5 0.20 We can conclude that a new customer contact has a 0.20 probability of resulting in asale.

Addition Law The addition law is helpful when we are interested in knowing the probability that at least one of two events will occur. That is, with events A and B we are interested in knowing the probability that event A or event B occurs or both events occur. Before we present the addition law, we need to discuss two concepts related to the combination of events: the union of events and the intersection of events. Given two events A and B, the union of A and B is defined as the event containing all outcomes belonging to A or B or both. The union of A and B is denoted by A ø B. The Venn diagram in Figure 4.2 depicts the union of A and B. Note that one circle contains all the outcomes in A and the other all the outcomes in B. The fact that the circles overlap indicates that some outcomes are contained in both A and B.

FIGURE 4.2

Venn Diagram for the Union of Events A and B

Sample Space S

Event A

Event B

162

Chapter 4 Probability: An Introduction to Modeling Uncertainty

The definition of the intersection of A and B is the event containing the outcomes that belong to both A and B. The intersection of A and B is denoted by A ù B. The Venn diagram depicting the intersection of A and B is shown in Figure 4.3. The area in which the two circles overlap is the intersection; it contains outcomes that are in both A and B. The addition law provides a way to compute the probability that event A or event B occurs or both events occur. In other words, the addition law is used to compute the probability ofthe union of two events. The addition law is written as follows: Addition Law

(4.2)

P( A ø B) 5 P ( A) 1 P ( B) 2 P ( A ù B)

To understand the addition law intuitively, note that the first two terms in the addition law, P( A) 1 P( B), account for all the sample points in A ø B. However, because the sample points in the intersection A ù B are in both A and B, when we compute P ( A) 1 P( B), we are in effect counting each of the sample points in A ù B twice. We correct for this double counting by subtracting P ( A ù B). As an example of the addition law, consider a study conducted by the human resources manager of a major computer software company. The study showed that 30% of the employees who left the firm within two years did so primarily because they were dissatisfied with their salary, 20% left because they were dissatisfied with their work assignments, and 12% of the former employees indicated dissatisfaction with both their salary and their work assignments. What is the probability that an employee who leaves within two years does so because of dissatisfaction with salary, dissatisfaction with the work assignment, or both? Let S 5 the event that the employee leaves because of salary W 5 the event that the employee leaves because of work assignment We can also think of this probability in the following manner: What proportion of employees either left because of salary or left because of work assignment?

From the survey results, we have P ( S ) 5 0.30, P (W ) 5 0.20 , and P ( S ù W ) 5 0.12. Using the addition law from equation (4.2), we have P ( S ø W ) 5 P ( S ) 1 P (W ) 2 P ( S ù W ) 5 0.30 1 0.20 2 0.12 5 0.38 This calculation tells us that there is a 0.38 probability that an employee will leave for salary or work assignment reasons. Before we conclude our discussion of the addition law, let us consider a special case that arises for mutually exclusive events. Events A and B are mutually exclusive if the occurrence of one event precludes the occurrence of the other. Thus, a requirement for A and B FIGURE 4.3

Venn Diagram for the Intersection of Events A and B

Sample Space S

Event A

Event B

163

4.3 Conditional Probability

FIGURE 4.4

Venn Diagram for Mutually Exclusive Events

Sample Space S

Event A

Event B

to be mutually exclusive is that their intersection must contain no sample points. The Venn diagram depicting two mutually exclusive events A and B is shown in Figure 4.4. In this case P( A ù B) 5 0 and the addition law can be written as follows: Addition Law for Mutually Exclusive Events

P ( A ø B) 5 P ( A) 1 P ( B)

More generally, two events are said to be mutually exclusive if the events have no outcomes in common. N otes

+

C o m m ents

The addition law can be extended beyond two events. For example, the addition law for three events A, B, and C is P ( A ø B ø C ) 5 P ( A ) 1 P (B) 1 P (C ) 2 P ( A ù B) 2 P ( A ù C ) 2

P (B ù C ) 1 P ( A ù B ù C ). Similar logic can be used to derive the expressions for the addition law for more than three events.

4.3 Conditional Probability Often, the probability of one event is dependent on whether some related event has already occurred. Suppose we have an event A with probability P(A). If we learn that a related event, denoted by B, has already occurred, we take advantage of this information by calculating a new probability for event A. This new probability of event A is called a conditional probability and is written P( A | B). The notation | indicates that we are considering the probability of event A given the condition that event B has occurred. Hence, the notation P( A | B) reads “the probability of A given B.” To illustrate the idea of conditional probability, consider a bank that is interested in the mortgage default risk for its home mortgage customers. Table 4.3 shows the first 25 records of the 300 home mortgage customers at Lancaster Savings and Loan, a company that specializes in high-risk subprime lending. Some of these home mortgage customers have defaulted on their mortgages and others have continued to make on-time payments. These data include the age of the customer at the time of mortgage origination, the marital status of the customer (single or married), the annual income of the customer, the mortgage amount, the number of payments made by the customer per year on the mortgage, the total amount paid by the customer over the lifetime of the mortgage, and whether or not the customer defaulted on her or his mortgage.

164

Chapter 4 Probability: An Introduction to Modeling Uncertainty

Table 4.3

Subset of Data from 300 Home Mortgages of Customers at Lancaster Savings andLoan

Customer No.

Age

1

37

2

31

3

Marital Status

Annual Income

Mortgage Amount

Payments per Year

Total Amount Paid

Default on Mortgage?

Single

$ 172,125.70

$ 473,402.96

24

$ 581,885.13

Yes

Single

$ 108,571.04

$ 300,468.60

12

$ 489,320.38

No

37

Married

$ 124,136.41

$ 330,664.24

24

$ 493,541.93

Yes

4

24

Married

$ 79,614.04

$ 230,222.94

24

$ 449,682.09

Yes

5

27

Single

$ 68,087.33

$ 282,203.53

12

$ 520,581.82

No

6

30

Married

$ 59,959.80

$ 251,242.70

24

$ 356,711.58

Yes

7

41

Single

$ 99,394.05

$ 282,737.29

12

$ 524,053.46

No

8

29

Single

$ 38,527.35

$ 238,125.19

12

$ 468,595.99

No

9

31

Married

$ 112,078.62

$ 297,133.24

24

$ 399,617.40

Yes

10

36

Single

$ 224,899.71

$ 622,578.74

12

$1,233,002.14

No

11

31

Married

$ 27,945.36

$ 215,440.31

24

$ 285,900.10

Yes

12

40

Single

$ 48,929.74

$ 252,885.10

12

$ 336,574.63

No

13

39

Married

$ 82,810.92

$ 183,045.16

12

$ 262,537.23

No

14

31

Single

$ 68,216.88

$ 165,309.34

12

$ 253,633.17

No

15

40

Single

$ 59,141.13

$ 220,176.18

12

$ 424,749.80

No

16

45

Married

$ 72,568.89

$ 233,146.91

12

$ 356,363.93

No

17

32

Married

$ 101,140.43

$ 245,360.02

24

$ 388,429.41

Yes

18

37

Married

$ 124,876.53

$ 320,401.04

4

$ 360,783.45

Yes

19

32

Married

$ 133,093.15

$ 494,395.63

12

$ 861,874.67

No

20

32

Single

$ 85,268.67

$ 159,010.33

12

$ 308,656.11

No

21

37

Single

$ 92,314.96

$ 249,547.14

24

$ 342,339.27

Yes

22

29

Married

$ 120,876.13

$ 308,618.37

12

$ 472,668.98

No

23

24

Single

$ 86,294.13

$ 258,321.78

24

$ 380,347.56

Yes

24

32

Married

$ 216,748.68

$ 634,609.61

24

$ 915,640.13

Yes

25

44

Single

$ 46,389.75

$ 194,770.91

12

$ 385,288.86

No

Lancaster Savings and Loan is interested in whether the probability of a customer defaulting on a mortgage differs by marital status. Let S M D DC

Chapter 3 discusses PivotTables in more detail.

5 event that a customer is single 5 event that a customer is married 5 event that a customer defaulted on his or her mortgage 5 event that a customer did not default on his or her mortgage

Table 4.4 shows a crosstabulation for two events that can be derived from the Lancaster Savings and Loan mortgage data. Note that we can easily create Table 4.4 in Excel using a PivotTable by using the following steps: Step 1. In the Values worksheet of MortgageDefaultData file Click the Insert tab on the Ribbon Step 2. Click PivotTable in the Tables group Step 3. When the Create PivotTable dialog box appears: Choose Select a Table or Range Enter A1:H301 in the Table/Range: box

165

4.3 Conditional Probability

MortgageDefaultData

Select New Worksheet as the location for the PivotTable Report Click OK Step 4. In the PivotTable Fields area go to Drag fields between areas below: Drag the Marital Status field to the ROWS area Drag the Default on Mortgage? field to the COLUMNS area Drag the Customer Number field to the VALUES area Step 4. Click on Sum of Customer Number in the VALUES area and select Value Field Settings Step 6. When the Value Field Settings dialog box appears: Under Summarize value field by, select Count These steps produce the PivotTable shown in Figure 4.5.

Table 4.4 Marital Status

FIGURE 4.5

Crosstabulation of Marital Status and if Customer Defaults onMortgage No Default

Default

Total

Married

64

79

143

Single

116

41

157

Total

180

120

300

PivotTable for Marital Status and Whether Customer Defaults on Mortgage

166

We can also think of this joint probability in the following manner: What proportion of all customers is both married and defaulted on their loans?

Chapter 4 Probability: An Introduction to Modeling Uncertainty

From Table 4.4 or Figure 4.5, the probability that a customer defaults on his or her mortgage is 120/300 5 0.4 . The probability that a customer does not default on his or her mortgage is 1 2 0.4 5 0.6 (or 180/300 5 0.6). But is this probability different for married customers as compared with single customers? Conditional probability allows us to answer this question. But first, let us answer a related question: What is the probability that a randomly selected customer does not default on his or her mortgage and the customer is married? The probability that a randomly selected customer is married and the customer defaults on his or her mort79 gage is written as P( M ù D). This probability is calculated as P( M ù D) 5 300 5 0.2633. Similarly, 64 P ( M ù DC ) 5 300 5 0.2133 is the probability that a randomly selected customer is married and that the customer does not default on his or her mortgage. 41 P( S ù D) 5 300 5 0.1367 is the probability that a randomly selected customer is single and that the customer defaults on his or her mortgage. P( S ù DC ) 5 116 300 5 0.3867 is the probability that a randomly selected customer is single and that the customer does not default on his or her mortgage.

Because each of these values gives the probability of the intersection of two events, these probabilities are called joint probabilities. Table 4.5, which provides a summary of the probability information for customer defaults on mortgages, is referred to as a joint probability table. The values in the Total column and Total row (the margins) of Table 4.5 provide theprobabilities of each event separately. That is, P ( M ) 5 0.4766, P ( S ) 5 0.5234, P( DC ) 5 0.6000 , and P ( D) 5 0.4000. These probabilities are referred to as marginal probabilities because of their location in the margins of the joint probability table. The marginal probabilities are found by summing the joint probabilities in the corresponding row or column of the joint probability table. From the marginal probabilities, we see that 60% of customers do not default on their mortgage, 40% of customers default on their mortgage, 47.66% of customers are married, and 52.34% of customers are single. Let us begin the conditional probability analysis by computing the probability that a customer defaults on his or her mortgage given that the customer is married. In conditional probability notation, we are attempting to determine P(D | M), which is read as “the probability that the customer defaults on the mortgage given that the customer is married.” To calculate P(D | M), first we note that we are concerned only with the 143 customers who are married (M). Because 79 of the 143 married customers defaulted on their mortgages, the probability of a customer defaulting given that the customer is married is 79/143 5 0.5524. In other words, given that a customer is married, there is a 55.24% chance that he or she will default. Note also that the conditional probability P(D | M) can be computed as the ratio of the joint probability P( D ù M ) to the marginal probability P(M). P( D | M ) 5 We can use the PivotTable from Figure 4.5 to easily create the joint probability table in Excel. To do so, right-click on any of the numerical values in the PivotTable, select Show Values As, and choose % of Grand Total. The resulting values, which are percentages of the total, can then be divided by 100 to create the probabilities in the joint probability table.

TABLE 4.5

P( D ù M ) 0.2633 5 5 0.5524 P( M ) 0.4766

Joint Probability Table for Customer Mortgage Prepayments

Joint Probabilities

No Default (DC)

Default (D)

Total

Married (M)

0.2133

0.2633

0.4766

Single (S)

0.3867

0.1367

0.5234

Total

0.6000

0.4000

1.0000 Marginal Probabilities

167

4.3 Conditional Probability

The fact that a conditional probability can be computed as the ratio of a joint probability to a marginal probability provides the following general formula for conditional probability calculations for two events A and B. Conditional Probability

P ( A | B) 5

P ( A ù B) P ( B)

(4.3)

P( B | A) 5

P ( A ù B) P( A)

(4.4)

or

We have already determined that the probability a customer who is married will default is 0.5524. How does this compare to a customer who is single? That is, we want to find P(D | S). From equation (4.3), we can compute P(D | S) as P( D ù S ) 0.1367 5 5 0.2611 P(S ) 0.5234 In other words, the chance that a customer will default if the customer is single is26.11%. This is substantially less than the chance of default if the customer is married. Note that we could also answer this question using the Excel PivotTable in Figure 4.5. We can calculate these conditional probabilities by right-clicking on any numerical value in the body of the PivotTable and then selecting Show Values As and choosing % of Row Total. The modified Excel PivotTable is shown in Figure 4.6. P( D | S ) 5

FIGURE 4.6

Using Excel PivotTable to Calculate Conditional Probabilities

168

Chapter 4 Probability: An Introduction to Modeling Uncertainty

By calculating the % of Row Total, the Excel PivotTable in Figure 4.6 shows that 55.24% of married customers defaulted on mortgages, but only 26.11% of single customers defaulted.

Independent Events Note that in our example, P ( D) 5 0.4000, P( D | M ) 5 0.5524 , and P( D | S ) 5 0.2611. So the probability that a customer defaults is influenced by whether the customer is married or single. Because P( D | M ) ± P ( D), we say that events D and M are dependent. However, if the probability of event D is not changed by the existence of event M—that is, if P( D | M ) 5 P( D) —then we would say that events D and M are independent events. This is summarized for two events A and B as follows: Independent Events

Two events A and B are independent if

P( A | B) 5 P( A)

(4.5)

P( B | A) 5 P ( B)

(4.6)

or Otherwise, the events are dependent.

Multiplication Law The multiplication law can be used to calculate the probability of the intersection of two events. The multiplication law is based on the definition of conditional probability. Solving equations (4.3) and (4.4) for P ( A ù B), we obtain the multiplication law. Multiplication Law

P( A ù B) 5 P( B) P( A | B)

(4.7)

P ( A ù B) 5 P ( A) P ( B | A)

(4.8)

or

To illustrate the use of the multiplication law, we will calculate the probability that a customer defaults on his or her mortgage and the customer is married, P( D ù M ). From equation (4.7), this is calculated as P( D ù M ) 5 P ( M ) P( D | M ). From Table 4.5 we know that P ( M ) 5 0.4766, and from our previous calculations we know that the conditional probability P( D | M ) 5 0.5524 . Therefore, P ( D ù M ) 5 P ( M ) P( D | M ) 5 (0.4766)(0.5524) 5 0.2633 This value matches the value shown for P( D ù M ) in Table 4.5. The multiplication law is useful when we know conditional probabilities but do not know the joint probabilities. Consider the special case in which events A and B are independent. From equations(4.5) and (4.6), P( A | B) 5 P( A) and P( B | A) 5 P( B). Using these equations to simplify equations (4.7) and (4.8) for this special case, we obtain the following multiplication law for independent events. Multiplication Law for Independent Events

P ( A ù B) 5 P ( A) P ( B)

(4.9)

To compute the probability of the intersection of two independent events, we simply multiply the probabilities of each event.

169

4.3 Conditional Probability

Bayes’ Theorem

Bayes’ theorem is also discussed in Chapter 15 in the context of decision analysis.

Revising probabilities when new information is obtained is an important aspect of probability analysis. Often, we begin the analysis with initial or prior probability estimates for specific events of interest. Then, from sources such as a sample survey or a product test, we obtain additional information about the events. Given this new information, we update the prior probability values by calculating revised probabilities, referred to as posterior probabilities. Bayes’ theorem provides a means for making these probability calculations. As an application of Bayes’ theorem, consider a manufacturing firm that receives shipments of parts from two different suppliers. Let A1 denote the event that the part is from supplier 1 and let A2 denote the event that a part is from supplier 2. Currently, 65% of the parts purchased by the company are from supplier 1 and the remaining 35% are from supplier 2. Hence, if a part is selected at random, we would assign the prior probabilities P( A1 ) 5 0.65 and P ( A2 ) 5 0.35. The quality of the purchased parts varies according to their source. Historical data suggest that the quality ratings of the two suppliers are as shown in Table 4.6. If we let G be the event that a part is good and we let B be the event that a part is bad, the information in Table 4.6 enables us to calculate the following conditional probability values: P(G | A1) 5 0.98 P(B | A1) 5 0.02 P(G | A2) 5 0.95 P(B | A2) 5 0.05 Figure 4.7 shows a diagram that depicts the process of the firm receiving a part from one of the two suppliers and then discovering that the part is good or bad as a two-step random experiment. We see that four outcomes are possible; two correspond to the part being good and two correspond to the part being bad. Each of the outcomes is the intersection of two events, so we can use the multiplication rule to compute the probabilities. For instance, P ( A1 , G ) 5 P ( A1 ù G ) 5 P( A1 ) P (G | A1 ) The process of computing these joint probabilities can be depicted in what is called a probability tree (see Figure 4.8). From left to right through the tree, the probabilities for each branch at step 1 are prior probabilities and the probabilities for each branch at step2 are conditional probabilities. To find the probability of each experimental outcome, simply multiply the probabilities on the branches leading to the outcome. Each of these joint probabilities is shown in Figure 4.8 along with the known probabilities for each branch. Now suppose that the parts from the two suppliers are used in the firm’s manufacturing process and that a machine breaks down while attempting the process using a bad part. Given the information that the part is bad, what is the probability that it came from supplier 1 and what is the probability that it came from supplier 2? With the information in the probability tree (Figure 4.8), Bayes’ theorem can be used to answer these questions. For the case in which there are only two events ( A1 and A2), Bayes’ theorem can be written as follows: Bayes’ Theorem (Two-Event Case)

P( A1 | B) 5

P( A1 ) P( B | A1 ) P ( A1 ) P ( B | A1 ) 1 P ( A2 ) P ( B | A2 )

(4.10)

P( A2 | B) 5

P( A2 ) P( B | A2 ) P( A1 ) P( B | A1 ) 1 P ( A2 ) P ( B | A2 )

(4.11)

170

Chapter 4 Probability: An Introduction to Modeling Uncertainty

Historical Quality Levels for Two Suppliers

Table 4.6

FIGURE4.7

% Good Parts

% Bad Parts

Supplier 1

98

2

Supplier 2

95

5

Diagram for Two-Supplier Example

Step 1 Supplier

Step 2 Condition

Outcome

G

(A1, G )

B

A1

(A1, B)

A2

(A2, G )

G B

(A2, B) Note: Step 1 shows that the part comes from one of two suppliers and Step 2 shows whether the part is good or bad.

FIGURE 4.8

Probability Tree for Two-Supplier Example Step 1 Supplier

Step 2 Condition P(G | A1)

Probability of Outcome P(A1 > G ) 5 P(A1)P(G | A1) 5 (0.65)(0.98) 5 0.6370

0.98 P(A1)

P(B | A1) 0.02

P(A1 > B) 5 P(A1)P( B | A1) 5 (0.65)(0.02) 5 0.0130

P(G | A2)

P(A2 > G) 5 P(A2)P(G | A2) 5 (0.35)(0.95) 5 0.3325

0.65 P(A2) 0.35

0.95 P(B | A2) 0.05

P(A2 > B) 5 P(A2)P( B | A2) 5 (0.35)(0.05) 5 0.0175

171

4.4 Random Variables

Using equation (4.10) and the probability values provided in Figure 4.8, we have P ( A1 ) P ( B | A1 ) P ( A1 ) P ( B | A1 ) 1 P ( A2 ) P( B | A2 ) (0.65)(0.02) 0.0130 5 5 (0.65)(0.02) 1 (0.35)(0.05) 0.0130 1 0.0175 0.0130 5 5 0.4262 0.0305

P ( A1 | B) 5

Using equation (4.11), we find P( A2 | B) as P( A2 ) P( B | A2 ) P( A2 ) P( B | A1 ) 1 P( A2 ) P( B | A2 ) (0.35)(0.05) 0.0175 5 5 (0.65)(0.02) 1 (0.35)(0.05) 0.0130 1 0.0175 0.0175 5 5 0.5738 0.0305

P( A2 | B) 5

If the union of events is the entire sample space, the events are said to be collectively exhaustive.

Note that in this application we started with a probability of 0.65 that a part selected at random was from supplier 1. However, given information that the part is bad, the probability that the part is from supplier 1 drops to 0.4262. In fact, if the part is bad, the chance is better than 50–50 that it came from supplier 2; that is, P ( A2 | B) 5 0.5738. Bayes’ theorem is applicable when events for which we want to compute posterior probabilities are mutually exclusive and their union is the entire sample space. For the case of n mutually exclusive events A1 , A2 , … , An, whose union is the entire sample space, Bayes’ theorem can be used to compute any posterior probability P( Ai | B) as shown in equation (4.12). Bayes’ Theorem

N otes

+

P( Ai | B) 5

P( Ai ) P( B | Ai ) P( A1 ) P( B | A1 ) 1 P ( A2 ) P ( B | A2 ) 1 1 P ( An ) P ( B | An )

(4.12)

C o m m ents

By applying basic algebra we can derive the multiplication law from the definition of conditional probability. For two events A P ( A ù B) and B, the probability of A given B is P ( A | B ) 5 . If we P (B )

multiply both sides of this expression by P(B), the P(B) in the numerator and denominator on the right side of the expression will cancel and we are left with P ( A | B )P (B ) 5 P ( A ù B ), which is the multiplication law.

4.4 Random Variables Chapter 2 introduces the concept of random variables and the use of data to describe them.

In probability terms, a random variable is a numerical description of the outcome of a random experiment. Because the outcome of a random experiment is not known with certainty, a random variable can be thought of as a quantity whose value is not known with certainty. A random variable can be classified as being either discrete or continuous depending on the numerical values it can assume.

Discrete Random Variables A random variable that can take on only specified discrete values is referred to as a discrete random variable. Table 4.7 provides examples of discrete random variables. Returning to our example of Lancaster Savings and Loan, we can define a random variable x to indicate whether or not a customer defaults on his or her mortgage. As previously

172

Table 4.7

Chapter 4 Probability: An Introduction to Modeling Uncertainty

Examples of Discrete Random Variables

Random Experiment

Random Variable (x)

Possible Values for the Random Variable

Flip a coin

Face of coin showing

1 if heads; 0 if tails

Roll a die

Number of dots showing on top of die

1, 2, 3, 4, 5, 6

Contact five customers

Number of customers who place anorder

0, 1, 2, 3, 4, 5

Operate a health care clinic for one day

Number of patients who arrive

0, 1, 2, 3, …

Offer a customer the choice of two products

Product chosen by customer

0 if none; 1 if choose product A; 2 if choose product B

stated, the values of a random variable must be numerical, so we can define random variable x such that x 5 1 if the customer defaults on his or her mortgage and x 5 0 if the customer does not default on his or her mortgage. An additional random variable, y, could indicate whether the customer is married or single. For instance, we can define random variable y such that y 5 1 if the customer is married and y 5 0 if the customer is single. Yet another random variable, z, could be defined as the number of mortgage payments per year made by the customer. For instance, a customer who makes monthly payments would make z 5 12 payments per year, a customer who makes payments quarterly would make z 5 4 payments per year. Table 4.8 repeats the joint probability table for the Lancaster Savings and Loan data, but this time with the values labeled as random variables.

Continuous Random Variables A random variable that may assume any numerical value in an interval or collection of intervals is called a continuous random variable. Technically, relatively few random variables are truly continuous; these include values related to time, weight, distance, and temperature. An example of a continuous random variable is x 5 the time between consecutive incoming calls to a call center. This random variable can take on any value x . 0 such as x 5 1.26 minutes , x 5 2.571 minutes, or x 5 4.3333 minutes. Table 4.9 provides examples of continuous random variables. As illustrated by the final example in Table 4.9, many discrete random variables have a large number of potential outcomes and so can be effectively modeled as continuous random variables. Consider our Lancaster Savings and Loan example. We can define a random variable x 5 total amount paid by customer over the lifetime of the mortgage. Because we typically measure financial values only to two decimal places, one could consider this a discrete random variable. However, because in any practical interval there are many possible values for this random variable, then it is usually appropriate to model the amount as a continuous random variable.

Table 4.8

Joint Probability Table for Customer Mortgage Prepayments No Default (x 5 0)

Default (x 5 1)

f ( y)

Married ( y 5 1)

0.2133

0.2633

0.4766

Single ( y 5 0)

0.3867

0.1367

0.5234

f (x)

0.6000

0.4000

1.0000

173

4.5 Discrete Probability Distributions

Table 4.9

Examples of Continuous Random Variables Possible Values for the Random Variable

Random Experiment

Random Variable (x)

Customer visits a web page

Time customer spends on web page in minutes

x$0

Fill a soft drink can (max capacity 5 12.1 ounces)

Number of ounces

0 # x # 12.1

Test a new chemical process

Temperature when the desired reaction takes place 150 # x # 212 (min temperature 5 1508F ; max temperature 5 2128F)

Invest $10,000 in the stock market

Value of investment after one year

N otes

+

x$0

C o m m ents

1. In this section we again use the relative frequency method to assign probabilities for the Lancaster Savings and Loan example. Technically, the concept of random variables applies only to populations; probabilities that are found using sample data are only estimates of the true probabilities. However, larger samples generate more reliable estimated probabilities, so if we have a large enough data set (as we are assuming here for the

Lancaster Savings and Loan data), then we can treat the data as if they are from a population and the relative frequency method is appropriate to assign probabilities to the outcomes. 2. Random variables can be used to represent uncertain future values. Chapter 11 explains how random variables can be used in simulation models to evaluate business decisions in the presence of uncertainty.

4.5 Discrete Probability Distributions The probability distribution for a random variable describes the range and relative likelihood of possible values for a random variable. For a discrete random variable x, the probability distribution is defined by a probability mass function, denoted by f(x). The probability mass function provides the probability for each value of the random variable. Returning to our example of mortgage defaults, consider the data shown in Table4.3 for Lancaster Savings and Loan and the associated joint probability table in Table 4.8. From Table 4.8, we see that f (0) 5 0.6 and f (1) 5 0.4 . Note that these values satisfy the required conditions of a discrete probability distribution that (1) f ( x ) $ 0 and (2)Sf ( x ) 5 1. We can also present probability distributions graphically. In Figure 4.9, the values of the random variable x are shown on the horizontal axis and the probability associated with these values is shown on the vertical axis.

Custom Discrete Probability Distribution A probability distribution that is generated from observations such as that shown in Figure4.9 is called an empirical probability distribution. This particular empirical probability distribution is considered a custom discrete distribution because it is discrete and the possible values of the random variable have different values. A custom discrete probability distribution is very useful for describing different possible scenarios that have different probabilities of occurring. The probabilities associated with each scenario can be generated using either the subjective method or the relative frequency method. Using a subjective method, probabilities are based on experience or intuition when little relevant data are available. If sufficient data exist, the relative frequency method can be used to determine probabilities. Consider the random variable describing the number of payments made per year by a randomly chosen customer. Table 4.10 presents a summary of the number of payments made per year by the 300 home mortgage

174

Chapter 4 Probability: An Introduction to Modeling Uncertainty

FIGURE 4.9

Graphical Representation of the Probability Distribution for Whether a Customer Defaults on a Mortgage f (x) 0.6

Probability

0.5 0.4 0.3 0.2 0.1 x

0 1 Mortgage Default Random Variable

Table 4.10

Summary Table of Number of Payments Made per Year Number of Payments Made per Year

Number of observations f (x)

x54

x 5 12

x 5 24

Total

45

180

75

300

0.15

0.60

0.25

customers. This table shows us that 45 customers made quarterly payments ( x 5 4) , 180 customers made monthly payments ( x 5 12), and 75 customers made two payments each month ( x 5 24) . We can then calculate f (4) 5 45/300 5 0.15, f (12) 5 180/300 5 0.60, and f (24) 5 75/300 5 0.25. In other words, the probability that a randomly selected customer makes 4 payments per year is 0.15, the probability that a randomly selected customer makes 12 payments per year is 0.60, and the probability that a randomly selected customer makes 24 payments per year is 0.25. We can write this probability distribution as a function in the following manner: 0.15 if x 5 4 0.60 if x 5 12 f (x) 5 0.25 if x 5 24 0 otherwise This probability mass function tells us in a convenient way that f ( x ) 5 0.15 when x 5 4 (the probability that the random variable x 5 4 is 0.15); f ( x ) 5 0.60 when x 5 12 (the probability that the random variable x 5 12 is 0.60); f ( x ) 5 0.25 when x 5 24 (the probability that the random variable x 5 24 is 0.25); and f ( x ) 5 0 when x is any other value (there is zero probability that the random variable x is some value other than 4, 12, or24). Note that we can also create Table 4.10 in Excel using a PivotTable as shown in Figure4.10.

175

4.5 Discrete Probability Distributions

FIGURE 4.10

Excel PivotTable for Number of Payments Made per Year

Expected Value and Variance Chapter 2 discusses the computation of the mean of a random variable based on data.

The expected value, or mean, of a random variable is a measure of the central location forthe random variable. It is the weighted average of the values of the random variable, where the weights are the probabilities. The formula for the expected value of a discrete random variable x follows: Expected Value of a Discrete Random Variable

E ( x ) 5 m 5 Sxf ( x )

(4.13)

Both the notations E(x) and m are used to denote the expected value of a random variable. Equation (4.13) shows that to compute the expected value of a discrete random variable, we must multiply each value of the random variable by the corresponding probability f(x) and then add the resulting products. Table 4.11 calculates the expected value of the number of payments made by a mortgage customer in a year. The sum of the entries in the xf(x) column shows that the expected value is 13.8 payments per year. Therefore, if Lancaster Savings and Loan signs up a new mortgage customer, the expected number of payments per year made by this new customer is 13.8. Obviously, no customer will make exactly 13.8 payments per year, but this value represents our expectation for the number of payments per year made by a new customer absent any other information about the new customer. Some customers will make fewer payments (4 or 12 per year), some customers will make more payments (24 per year), but 13.8 represents the expected number of payments per year based on the probabilities calculated in Table 4.10. The SUMPRODUCT function in Excel can easily be used to calculate the expected value for a discrete random variable. This is illustrated in Figure 4.11. We can also

176

Chapter 4 Probability: An Introduction to Modeling Uncertainty

TABLE 4.11

Calculation of the Expected Value for Number of Payments Made per Year by a Lancaster Savings and Loan Mortgage Customer

x

f(x)

xf(x)

(4 )(0.15) 5 0.6 (12)(0.60 ) 5 7.2 (24 )(0.25) 5 6.0

4

0.15

12

0.60

24

0.25

13.8

E ( x ) 5 m 5 ∑ xf ( x )

calculate the expected value of the random variable directly from the Lancaster Savings and Loan data using the Excel function AVERAGE, as shown in Figure 4.12. Column F contains the data on the number of payments made per year by each mortgage customer in the data set. Using the Excel formula 5AVERAGE(F2:F301) gives us a value of 13.8 for the expected value, which is the same as the value we calculated in Table 4.11. Note that we cannot simply use the AVERAGE function on the x values for a custom discrete random variable. If we did, this would give us a calculated value of (4 1 12 1 24)/3 5 13.333, which is not the correct expected value in this scenario. This is because using the AVERAGE function in this way assumes that each value of the random variable x is equally likely. But in this case, we know that x 5 12 is much more likely than x 5 4 or x 5 24. Therefore, we must use equation (4.13) to calculate the expected value of a custom discrete random variable, or we can use the Excel function AVERAGE on the entire data set, as shown in Figure 4.12.

FIGURE4.11

Using Excel SUMPRODUCT Function to Calculate the Expected Value for Number of Payments Made per Year by a Lancaster Savings and Loan Mortgage Customer

177

4.5 Discrete Probability Distributions

FIGURE4.12

Excel Calculation of the Expected Value for Number of Payments Made per Year by a Lancaster Savings and Loan Mortgage Customer

Chapter 2 discusses the computation of the variance of a random variable based on data.

Variance is a measure of variability in the values of a random variable. It is a weighted average of the squared deviations of a random variable from its mean where the weights are the probabilities. Below we define the formula for calculating the variance of a discrete random variable. Variance of a Discrete Random Variable

Chapter 2 discusses the computation of the standard deviation of a random variable based on data.

Var( x ) 5 s 2 5 S( x 2 m )2 f ( x )

(4.14)

As equation (4.14) shows, an essential part of the variance formula is the deviation, x 2 m, which measures how far a particular value of the random variable is from the expected value, or mean, m . In computing the variance of a random variable, the deviations are squared and then weighted by the corresponding value of the probability mass function. The sum of these weighted squared deviations for all values of the random variable is referred to as the variance. The notations Var(x) and s 2 are both used to denote the variance of a random variable. The calculation of the variance of the number of payments made per year by a mortgage customer is summarized in Table 4.12. We see that the variance is 42.360. The standard deviation, s , is defined as the positive square root of the variance. Thus, the standard deviation for the number of payments made per year by a mortgage customer is 42.360 5 6.508. The Excel function SUMPRODUCT can be used to easily calculate equation (4.14) for a custom discrete random variable. We illustrate the use of the SUMPRODUCT function to calculate variance in Figure 4.13. We can also use Excel to find the variance directly from the data when the values in the data occur with relative frequencies that correspond to the probability distribution of the random variable. Cell F305 in Figure 4.12 shows that we use the Excel formula 5VAR.P(F2:F301)

178

Chapter 4 Probability: An Introduction to Modeling Uncertainty

Table 4.12

Calculation of the Variance for Number of Payments Made per Year by a Lancaster Savings and Loan Mortgage Customer

x

x2m

f (x )

(x 2 m)2 f (x)

4

4 2 13.8 5 29.8

0.15

12

12 2 13.8 5 21.8

0.60

21

21 2 13.8 5 10.2

0.25

(29.8)2* 0.15 5 15.606 (21.8)2* 0.60 5 2.904 (10.2)2* 0.25 5 24.010 42.360

FIGURE4.13

s 2 5 ∑ (x 2 m) f (x ) 2

Excel Calculation of the Variance for Number of Payments Made per Year by a Lancaster Savings and Loan Mortgage Customer

Note that here we are using the Excel functions VAR.P and STDEV.P rather than VAR.S and STDEV.S. This is because we are assuming that the sample of 300 Lancaster Savings and Loan mortgage customers is a perfect representation of the population.

tocalculate the variance from the complete data. This formula gives us a value of 42.360, which is the same as that calculated in Table 4.12 and Figure 4.13. Similarly, we can use the formula 5STDEV.P(F2:F301) to calculate the standard deviation of 6.508. As with the AVERAGE function and expected value, we cannot use the Excel functions VAR.P and STDEV.P directly on the x values to calculate the variance and standard deviation of a custom discrete random variable if the x values are not equally likely to occur. Instead we must either use the formula from equation (4.14) or use the Excel functions on the entire data set as shown in Figure 4.12.

Discrete Uniform Probability Distribution When the possible values of the probability mass function, f(x), are all equal, then the probability distribution is a discrete uniform probability distribution. For instance, the values that result from rolling a single fair die is an example of a discrete uniform distribution

179

4.5 Discrete Probability Distributions

because the possible outcomes y 5 1, y 5 2 , y 5 3, y 5 4 , y 5 5, and y 5 6 all have the same values f (1) 5 f (2) 5 f (3) 5 f (4) 5 f (5) 5 f (6) 5 1/6 . The general form of the probability mass function for a discrete uniform probability distribution is as follows: Discrete Uniform Probability Mass Function

f ( x ) 5 1/n

(4.15)

where n 5 the number of unique values that may be assumed by the random variable .

Binomial Probability Distribution

Whether or not a customer clicks on the link is an example of what is known as a Bernoulli trial—a trial in which (1) there are two possible outcomes, success or failure, and (2)the probability of success is the same every time the trial is executed. The probability distribution related to the number of successes in a set of n independent Bernoulli trials can be described by a binomial probability distribution.

As an example of the use of the binomial probability distribution, consider an online specialty clothing company called Martin’s. Martin’s commonly sends out targeted e-mails to its best customers notifying them about special discounts that are available only to the recipients of the e-mail. The e-mail contains a link that takes the customer directly to a web page for the discounted item. The exact number of customers who will click on the link is obviously unknown, but from previous data, Martin’s estimates that the probability that a customer clicks on the link in the e-mail is 0.30. Martin’s is interested in knowing more about the probabilities associated with one, two, three, etc. customers clicking on the link in the targeted e-mail. The probability distribution related to the number of customers who click on the targeted e-mail link can be described using a binomial probability distribution. A binomial probability distribution is a discrete probability distribution that can be used to describe many situations in which a fixed number (n) of repeated identical and independent trials has two, and only two, possible outcomes. In general terms, we refer to these two possible outcomes as either a success or a failure. A success occurs with probability p in each trial and a failure occurs with probability 1 2 p in each trial. In the Martin’s example, the “trial” refers to a customer receiving the targeted e-mail. We will define a success as a customer clicking on the e-mail link ( p 5 0.30) and a failure as a customer not clicking on the link (1 2 p 5 0.70). The binomial probability distribution can then be used to calculate the probability of a given number of successes (customers who click on the e-mail link) out of a given number of independent trials (number of e-mails sent to customers). Other examples that can often be described by a binomial probability distribution include counting the number of heads resulting from flipping a coin 20 times, the number of customers who click on a particular advertisement link on web site in a day, the number of days on which a particular financial stock increases in value over a month, and the number of nondefective parts produced in a batch. Equation (4.16) provides the probability mass function for a binomial random variable that calculates the probability of x successes in n independent events. Binomial Probability Mass Function

where n! is read as “n factorial,” and n! 5 n 3 n 2 1 3 n 2 2 3 3 2 3 1. For example, 4 ! 5 4 3 3 3 2 3 1 5 24 . The Excel formula 5FACT(n) can be used to calculate n factorial.

n x f (x) 5 p (1 2 p)( n 2 x ) x x 5 the number of successes p 5 the probability of a success on one trial n 5 the number of trials f ( x ) 5 the probability of x successes in n trials

(4.16)

and

n n! x 5 x !(n 2 x )!

180

Chapter 4 Probability: An Introduction to Modeling Uncertainty

In the Martin’s example, use equation (4.16) to compute the probability that out of three customers who receive the e-mail (1) no customer clicks on the link; (2) exactly one customer clicks on the link; (3) exactly two customers click on the link; and (4) all three customers click on the link. The calculations are summarized in Table 4.13, which gives the probability distribution of the number of customers who click on the targeted e-mail link. Figure 4.14 is a graph of this probability distribution. Table 4.13 and Figure 4.14 show that the highest probability is associated with exactly one customer clicking on the Martin’s targeted e-mail link and the lowest probability is associated with all three customers clicking on the link. Because the outcomes in the Martin’s example are mutually exclusive, we can easily use these results to answer interesting questions about various events. For example, using the information in Table 4.13, the probability that no more than one customer clicks on the link is P ( x # 1) 5 P ( x 5 0) 1 P( x 5 1) 5 0.343 1 0.441 5 0.784 . Probability Distribution for the Number of Customers Who Click on the Link in the Martin’s Targeted E-Mail

Table 4.13

x

f (x) 3! (0.30)0(0.70)3 5 0.343 0!3! 3! (0.30)1(0.70)2 5 0.441 1!2! 3! (0.30)2 (0.70)1 5 0.189 2!1!

0 1 2

3! 0.027 (0.30)3 (0.70)0 5 3!0! 1.000

3

FIGURE4.14

Graphical Representation of the Probability Distribution for the Number of Customers Who Click on the Link in the Martin’s Targeted E-Mail

f (x)

.50

Probability

.40 .30 .20 .10 .00

1 2 3 Number of Customers Who Click on Link

x

4.5 Discrete Probability Distributions

181

If we consider a scenario in which 10 customers receive the targeted e-mail, the binomial probability mass function given by equation (4.16) is still applicable. If we want to find the probability that exactly 4 of the 10 customers click on the link and p 5 0.30, then we calculate: f (4) 5

10! (0.30)4 (0.70)6 5 0.2001 4!6!

In Excel we can use the BINOM.DIST function to compute binomial probabilities. Figure4.15 reproduces the Excel calculations from Table 4.13 for the Martin’s problem with three customers. The BINOM.DIST function in Excel has four input values: the first is the value ofx, the second is the value of n, the third is the value of p, and the fourth is FALSE or TRUE. We choose FALSE for the fourth input if a probability mass function value f(x) is desired, and TRUE if a cumulative probability is desired. The formula 5BINOM.DIST(A5,$D$1:$D$2,FALSE) has been entered into cell B5 to compute the probability of 0 successes in three trials, f(0). Figure 4.15 shows that this value is 0.343, thesame as in Table4.13. Cells C5:C8 show the cumulative probability distribution values for this example. Note that these values are computed in Excel by entering TRUE as the fourth input in the BINOM.DIST. The cumulative probability for x using a binomial distribution is the probability of x or fewer successes out of n trials. Cell C5 computes the cumulative probability for x 5 0, which is the same as the probability for x 5 0 because the probability of 0 successes is the same as the probability of 0 or fewer successes. Cell C7 computes the cumulative probability for x 5 2 using the formula 5BINOM.DIST(A7,$D$1,$D$2,TRUE). This value is 0.973, meaning that the probability that two or fewer customers click on the targeted e-mail link is 0.973. Note that the value 0.973 simply corresponds to f (0) 1 f (1) 1 f (2) 5 0.343 1 0.441 1 0.189 5 0.973 because it is the probability of two or fewer customers clicking on the link, which could be zero customers, one customer, or two customers. FIGURE4.15

Excel Worksheet for Computing Binomial Probabilities of the Number ofCustomers Who Make a Purchase at Martin’s

182

Chapter 4 Probability: An Introduction to Modeling Uncertainty

Poisson Probability Distribution In this section, we consider a discrete random variable that is often useful in estimating the number of occurrences of an event over a specified interval of time or space. For example, the random variable of interest might be the number of patients who arrive at a health care clinic in 1 hour, the number of computer-server failures in a month, the number of repairs needed in 10 miles of highway, or the number of leaks in 100 miles of pipeline. If the following two properties are satisfied, the number of occurrences is a random variable that is described by the Poisson probability distribution: (1) the probability of an occurrence is the same for any two intervals (of time or space) of equal length; and (2) the occurrence or nonoccurrence in any interval (of time or space) is independent of the occurrence or nonoccurrence in any other interval. The Poisson probability mass function is defined by equation (4.17). Poisson Probability Mass Function The number e is a mathematical constant that is the base of the natural logarithm. Although it is an irrational number, 2.71828 is a sufficient approximation for our purposes.

f (x) 5

m x e2m x!

(4.17)

where

f ( x ) 5 the probability of x occurrences in an interval m 5 expected value or mean number of occurrences in an interval e

For the Poisson probability distribution, x is a discrete random variable that indicates the number of occurrences in the interval. Since there is no stated upper limit for the number of occurrences, the probability mass function f(x) is applicable for values x 5 0,1, 2,… without limit. In practical applications, x will eventually become large enough so that f(x) is approximately zero and the probability of any larger values of x becomes negligible. Suppose that we are interested in the number of patients who arrive at the emergency room of a large hospital during a 15-minute period on weekday mornings. Obviously, we do not know exactly how many patients will arrive at the emergency room in any defined interval of time, so the value of this variable is uncertain. It is important for administrators at the hospital to understand the probabilities associated with the number of arriving patients, as this information will have an impact on staffing decisions such as how many nurses and doctors to hire. It will also provide insight into possible wait times for patients to be seen once they arrive at the emergency room. If we can assume that the probability of a patient arriving is the same for any two periods of equal length during this 15-minute period and that the arrival or nonarrival of a patient in any period is independent of the arrival or nonarrival in any other period during the 15-minute period, the Poisson probability mass function is applicable. Suppose these assumptions are satisfied and an analysis of historical data shows that the average number of patients arriving during a 15-minute period of time is 10; in this case, the following probability mass function applies: f (x) 5

10 x e210 x!

The random variable here is x 5 number of patients arriving at the emergency room during any 15-minute period. If the hospital’s management team wants to know the probability of exactly five arrivals during 15 minutes, we would set x 5 5 and obtain: 10 5 e210 Probability of exactly 5 arrivals in 15 minutes 5 f (5) 5 5 0.0378 5! In the preceding example, the mean of the Poisson distribution is m 5 10 arrivals per 15-minute period. A property of the Poisson distribution is that the mean of the distribution

183

4.5 Discrete Probability Distributions

and the variance of the distribution are always equal. Thus, the variance for the number of arrivals during all 15-minute periods is s 2 5 10, and so the standard deviation is s 5 10 5 3.16. Our illustration involves a 15-minute period, but other amounts of time can be used. Suppose we want to compute the probability of one arrival during a 3-minute period. Because 10 is the expected number of arrivals during a 15-minute period, we see that 10/15 5 2/3 is the expected number of arrivals during a 1-minute period and that (2/3)(3minutes) 5 2 is the expected number of arrivals during a 3-minute period. Thus, the probability of x arrivals during a 3-minute period with m 5 2 is given by the following Poisson probability mass function: 2 x e22 x! The probability of one arrival during a 3-minute period is calculated as follows: f (x) 5

21 e22 5 0.2707 1! One might expect that because (5arrivals)/5 5 1arrival and (15minutes)/5 5 3minutes, we would get the same probability for one arrival during a 3-minute period as we do for five arrivals during a 15-minute period. Earlier we computed the probability of five arrivals during a 15-minute period as 0.0378. However, note that the probability of one arrival during a 3-minute period is 0.2707, which is not the same. When computing a Poisson probability for a different time interval, we must first convert the mean arrival rate to the period of interest and then compute the probability. In Excel we can use the POISSON.DIST function to compute Poisson probabilities. Figure 4.16 shows how to calculate the probabilities of patient arrivals at the emergency room if patients arrive at a mean rate of 10 per 15-minute interval. Probability of exactly1arrival in 3minutes 5 f (1) 5

FIGURE4.16

B

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Number of Arrivals 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Probability, f(x) =POISSON.DIST(A4,$D$1,FALSE) =POISSON.DIST(A5,$D$1,FALSE) =POISSON.DIST(A6,$D$1,FALSE) =POISSON.DIST(A7,$D$1,FALSE) =POISSON.DIST(A8,$D$1,FALSE) =POISSON.DIST(A9,$D$1,FALSE) =POISSON.DIST(A10,$D$1,FALSE) =POISSON.DIST(A11,$D$1,FALSE) =POISSON.DIST(A12,$D$1,FALSE) =POISSON.DIST(A13,$D$1,FALSE) =POISSON.DIST(A14,$D$1,FALSE) =POISSON.DIST(A15,$D$1,FALSE) =POISSON.DIST(A16,$D$1,FALSE) =POISSON.DIST(A17,$D$1,FALSE) =POISSON.DIST(A18,$D$1,FALSE) =POISSON.DIST(A19,$D$1,FALSE) =POISSON.DIST(A20,$D$1,FALSE) =POISSON.DIST(A21,$D$1,FALSE) =POISSON.DIST(A22,$D$1,FALSE) =POISSON.DIST(A23,$D$1,FALSE) =POISSON.DIST(A24,$D$1,FALSE)

C Mean Number of Occurrences: 10

Cumulative Probability =POISSON.DIST(A4,$D$1,TRUE) =POISSON.DIST(A5,$D$1,TRUE) =POISSON.DIST(A6,$D$1,TRUE) =POISSON.DIST(A7,$D$1,TRUE) =POISSON.DIST(A8,$D$1,TRUE) =POISSON.DIST(A9,$D$1,TRUE) =POISSON.DIST(A10,$D$1,TRUE) =POISSON.DIST(A11,$D$1,TRUE) =POISSON.DIST(A12,$D$1,TRUE) =POISSON.DIST(A13,$D$1,TRUE) =POISSON.DIST(A14,$D$1,TRUE) =POISSON.DIST(A15,$D$1,TRUE) =POISSON.DIST(A16,$D$1,TRUE) =POISSON.DIST(A17,$D$1,TRUE) =POISSON.DIST(A18,$D$1,TRUE) =POISSON.DIST(A19,$D$1,TRUE) =POISSON.DIST(A20,$D$1,TRUE) =POISSON.DIST(A21,$D$1,TRUE) =POISSON.DIST(A22,$D$1,TRUE) =POISSON.DIST(A23,$D$1,TRUE) =POISSON.DIST(A24,$D$1,TRUE)

D

A

B

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

C Mean Number of Occurrences:

D

E

F

G

H

I

10

Number of Arrivals Probability, f(x) Cumulative Probability 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

0.0000 0.0005 0.0023 0.0076 0.0189 0.0378 0.0631 0.0901 0.1126 0.1251 0.1251 0.1137 0.0948 0.0729 0.0521 0.0347 0.0217 0.0128 0.0071 0.0037 0.0019

0.0000 0.0005 0.0028 0.0103 0.0293 0.0671 0.1301 0.2202 0.3328 0.4579 0.5830 0.6968 0.7916 0.8645 0.9165 0.9513 0.9730 0.9857 0.9928 0.9965 0.9984

Poisson Probabilities Probability, f(x)

A

Excel Worksheet for Computing Poisson Probabilities of the Number of Patients Arriving at the Emergency Room

0.1400 0.1200 0.1000 0.0800 0.0600 0.0400 0.0200 0.0000 0 1 2 3 4 5 6 7 8 9 10 1112 1314 1516 171819 20

Number of Arrivals

184

Chapter 4 Probability: An Introduction to Modeling Uncertainty

The POISSON.DIST function in Excel has three input values: the first is the value of x, the second is the mean of the Poisson distribution, and the third is FALSE or TRUE. We choose FALSE for the third input if a probability mass function value f(x) is desired, and TRUE if a cumulative probability is desired. The formula 5POISSON.DIST(A4,$D$1,FALSE) has been entered into cell B4 to compute the probability of 0 occurrences, f(0). Figure 4.16 shows that this value (to four decimal places) is 0.0000, which means that it is highly unlikely (probability near 0) that we will have 0 patient arrivals during a 15-minute interval. The value in cell B12 shows that the probability that there will be exactly eight arrivals during a 15-minute interval is 0.1126. The cumulative probability for x using a Poisson distribution is the probability of x or fewer occurrences during the interval. Cell C4 computes the cumulative probability for x 5 0, which is the same as the probability for x 5 0 because the probability of 0 occurrences is the same as the probability of 0 or fewer occurrences. Cell C12 computes the cumulative probability for x 5 8 using the formula 5POISSON.DIST(A12,$D$1,TRUE). This value is 0.3328, meaning that the probability that eight or fewer patients arrive during a 15-minute interval is 0.3328. This value corresponds to f (0) 1 f (1) 1 f (2) 1 1 f (7) 1 f (8) 5 0.0000 1 0.0005 1 0.0023 1 1 0.0901 1 0.1126 5 0.3328 Let us illustrate an application not involving time intervals in which the Poisson distribution is useful. Suppose we want to determine the occurrence of major defects in a highway one month after it has been resurfaced. We assume that the probability of a defect is the same for any two highway intervals of equal length and that the occurrence or nonoccurrence of a defect in any one interval is independent of the occurrence or nonoccurrence of a defect in any other interval. Hence, the Poisson distribution can be applied. Suppose we learn that major defects one month after resurfacing occur at the average rate of two per mile. Let us find the probability of no major defects in a particular 3-mile section of the highway. Because we are interested in an interval with a length of 3 miles, m 5 (2 defects/mile)(3miles) 5 6 represents the expected number of major defects over the 3-mile section of highway. Using equation (4.17), the probability of no major defects is 60 e26 f (0) 5 5 0.0025. Thus, it is unlikely that no major defects will occur in the 3-mile 0! section. In fact, this example indicates a 1 2 0.0025 5 0.9975 probability of at least one major defect in the 3-mile highway section.

N otes

+

C o m m ents

1. If sample data are used to estimate the probabilities of a custom discrete distribution, equation (4.13) yields the sample mean x rather than the population mean m. However, as the sample size increases, the sample generally becomes more representative of the population and the sample mean x converges to the population mean m. In this chapter we have assumed that the sample of 300 Lancaster Savings and Loan mortgage customers is sufficiently large to be representative of the population of mortgage customers at Lancaster Savings and Loan. 2. We can use the Excel function AVERAGE only to compute the expected value of a custom discrete random variable when the values in the data occur with relative frequencies that correspond to the probability distribution of the

random variable. If this assumption is not satisfied, then the estimate of the expected value with the AVERAGE function will be inaccurate. In practice, this assumption is satisfied with an increasing degree of accuracy as the size of the sample is increased. Otherwise, we must use equation(4.13) to calculate the expected value for a custom discrete random variable. 3. If sample data are used to estimate the probabilities for a custom discrete distribution, equation (4.14) yields the sample variance s 2 rather than the population variance

s 2. However, as the sample size increases the sample generally becomes more representative of the population and the sample variance s 2 converges to the population variances 2.

185

4.6 Continuous Probability Distributions

4.6 Continuous Probability Distributions In the preceding section we discussed discrete random variables and their probability distributions. In this section we consider continuous random variables. Specifically, we discuss some of the more useful continuous probability distributions for analytics models: the uniform, the triangular, the normal, and the exponential. A fundamental difference separates discrete and continuous random variables in terms of how probabilities are computed. For a discrete random variable, the probability mass function f(x) provides the probability that the random variable assumes a particular value. With continuous random variables, the counterpart of the probability mass function is the probability density function, also denoted by f(x). The difference is that the probability density function does not directly provide probabilities. However, the area under the graph of f(x) corresponding to a given interval does provide the probability that the continuous random variable x assumes a value in that interval. So when we compute probabilities for continuous random variables, we are computing the probability that the random variable assumes any value in an interval. Because the area under the graph of f(x) at any particular point is zero, one of the implications of the definition of probability for continuous random variables is that the probability of any particular value of the random variable is zero.

Uniform Probability Distribution Consider the random variable x representing the flight time of an airplane traveling from Chicago to New York. The exact flight time from Chicago to New York is uncertain because it can be affected by weather (headwinds or storms), flight traffic patterns, and other factors that cannot be known with certainty. It is important to characterize the uncertainty associated with the flight time because this can have an impact on connecting flights and how we construct our overall flight schedule. Suppose the flight time can be any value in the interval from 120 minutes to 140 minutes. Because the random variable x can assume any value in that interval, x is a continuous rather than a discrete random variable. Let us assume that sufficient actual flight data are available to conclude that the probability of a flight time within any interval of a given length is the same as the probability of a flight time within any other interval of the same length that is contained in the larger interval from 120 to 140 minutes. With every interval of a given length being equally likely, the random variable x is said to have a uniform probability distribution. The probability density function, which defines the uniform distribution for the flight-time random variable, is: 1/20 for 120 # x # 140 f (x) 5 0 elsewhere Figure 4.17 shows a graph of this probability density function. Uniform Probability Distribution for Flight Time

FIGURE 4.17 f(x)

1 20

120

125

130 Flight Time in Minutes

135

140

x

186

Chapter 4 Probability: An Introduction to Modeling Uncertainty

In general, the uniform probability density function for a random variable x is defined by the following formula:

Uniform Probability Density Function

1 for a # x # b f (x) 5 b 2 a 0 elsewhere

(4.18)

For the flight-time random variable, a 5 120 and b 5 140. For a continuous random variable, we consider probability only in terms of the likelihood that a random variable assumes a value within a specified interval. In the flight time example, an acceptable probability question is: What is the probability that the flight time is between 120 and 130 minutes? That is, what is P(120 # x # 130)? To answer this question, consider the area under the graph of f(x) in the interval from 120 to 130 (see Figure 4.18). The area is rectangular, and the area of a rectangle is simply the width multiplied by the height. With the width of the interval equal to 130 2 120 5 10 and the height equal to the value of the probability density function f ( x ) 5 1/20, we have area 5 width 3 height 5 10(1/20) 5 10/20 5 0.50. The area under the graph of f(x) and probability are identical for all continuous random variables. Once a probability density function f(x) is identified, the probability that x takes a value between some lower value x1 and some higher value x 2 can be found by computing the area under the graph of f(x) over the interval from x1 to x 2. Given the uniform distribution for flight time and using the interpretation of area as probability, we can answer any number of probability questions about flight times. For example: What is the probability of a flight time between 128 and 136 minutes? The width of the interval is 136 2 128 5 8. With the uniform height of f ( x ) 5 1/20, we see that P(128 # x # 136) 5 8(1/20) 5 0.40. What is the probability of a flight time between 118 and 123 minutes? The width of the interval is 123 2 118 5 5, but the height is f ( x ) 5 0 for 118 # x , 120 and f ( x ) 5 1/20 for 120 # x # 123, so we have that P (118 # x # 123) 5 P (118 # x , 120) 1 P (120 # x # 123) 5 2(0) 1 3(1/20) 5 0.15.

●

●

FIGURE4.18

The Area Under the Graph Provides the Probability of a Flight Time Between 120 and 130 Minutes

f(x) P(120 # x #130) 5 Area 5 1/20(10) 5 10/20 5 0.50 1 20 10 120

125

130

135

140

x

Flight Time in Minutes

187

4.6 Continuous Probability Distributions

Note that P(120 # x # 140) 5 20(1/20) 5 1; that is, the total area under the graph of f(x) is equal to 1. This property holds for all continuous probability distributions and is the analog of the condition that the sum of the probabilities must equal 1 for a discrete probability mass function. Note also that because we know that the height of the graph of f(x) for a uniform distribu1 tion is for a # x # b, then the area under the graph of f(x) for a uniform distribution b 2a evaluated from a to a point x 0 when a # x 0 # b is width 3 height 5 ( x 0 2 a) 3 (b 2 a). This value provides the cumulative probability of obtaining a value for a uniform random variable of less than or equal to some specific value denoted by x 0 and the formula is given in equation (4.19). Uniform Distribution: Cumulative Probabilities

P( x # x0 ) 5

x0 2 a for a # x 0 # b b 2a

(4.19)

The calculation of the expected value and variance for a continuous random variable is analogous to that for a discrete random variable. However, because the computational procedure involves integral calculus, we do not show the calculations here. For the uniform continuous probability distribution introduced in this section, the formulas for the expected value and variance are as follows: a 1b 2 (b 2 a) 2 Var( x ) 5 12 E(x) 5

In these formulas, a is the minimum value and b is the maximum value that the random variable may assume. Applying these formulas to the uniform distribution for flight times from Chicago to New York, we obtain (120 1 140) 5 130 2 (140 2 120)2 5 33.33 Var( x ) 5 12 E(x) 5

The standard deviation of flight times can be found by taking the square root of the variance. Thus, for flight times from Chicago to New York, s 5 33.33 5 5.77 minutes.

Triangular Probability Distribution The triangular probability distribution is useful when only subjective probability estimates are available. There are many situations for which we do not have sufficient data and only subjective estimates of possible values are available. In the triangular probability distribution, we need only to specify the minimum possible value a, the maximum possible value b, and the most likely value (or mode) of the distribution m. If these values can be knowledgeably estimated for a continuous random variable by a subject-matter expert, then as an approximation of the actual probability density function, we can assume that the triangular distribution applies. Consider a situation in which a project manager is attempting to estimate the time that will be required to complete an initial assessment of the capital project of constructing anew corporate headquarters. The assessment process includes completing environmental-impact studies, procuring the required permits, and lining up all the contractors and

188

Chapter 4 Probability: An Introduction to Modeling Uncertainty

subcontractors needed to complete the project. There is considerable uncertainty regarding the duration of these tasks, and generally little or no historical data are available to help estimate the probability distribution for the time required for this assessment process. Suppose that we are able to discuss this project with several subject-matter experts who have worked on similar projects. From these expert opinions and our own experience, we estimate that the minimum required time for the initial assessment phase is six months and that the worst-case estimate is that this phase could require 24 months if we are delayed in the permit process or if the results from the environmental-impact studies require additional action. While a time of six months represents a best case and 24 months a worst case, the consensus is that the most likely amount of time required for the initial assessment phase of the project is 12 months. From these estimates, we can use a triangular distribution as an approximation for the probability density function for the time required for the initial assessment phase of constructing a new corporate headquarters. Figure 4.19 shows the probability density function for this triangular distribution. Note that the probability density function is a triangular shape. The general form of the triangular probability density function is as follows:

Triangular Probability Density Function

2( x 2 a) for a # x # m (b 2 a)(m 2 a) f (x) 5 2(b − x ) for m , x # b (b 2 a)(b 2 m)

(4.20)

where a = minimum value b = maximum value m = mode

In the example of the time required to complete the initial assessment phase of constructing a new corporate headquarters, the minimum value a is six months, the maximum value b is 24 months, and the mode m is 12 months. As with the explanation given for the uniform distribution above, we can calculate probabilities by using the area under the graph of f(x). We can calculate the probability that the time required is less than 12 months by finding the area under the graph of f(x) from x 5 6 to x 5 12 as shown in Figure 4.19.

FIGURE4.19

Triangular Probability Distribution for Time Required for Initial Assessment of Corporate Headquarters Construction

f(x)

P(6 # x # 12) 1/9

a56

m 5 12

b 5 24

x

189

4.6 Continuous Probability Distributions

Thegeometry required to find this area for any given value is slightly more complex than that required to find the area for a uniform distribution, but the resulting formula for a triangular distribution is relatively simple: Triangular Distribution: Cumulative Probabilities

( x 0 2 a) 2 for a # x 0 # m (b 2 a)(m 2 a) P( x # x0 ) 5 (b 2 x 0 ) 2 12 for m , x 0 # b (b 2 a)(b 2 m)

(4.21)

Equation (4.21) provides the cumulative probability of obtaining a value for a triangular random variable of less than or equal to some specific value denoted by x 0. To calculate P( x # 12) we use equation (4.20) with a 5 6, b 5 24 , m 5 12, and x 0 5 12. P( x # 12) 5

(12 2 6)2 5 0.3333 (24 2 6)(12 2 6)

Thus, the probability that the assessment phase of the project requires less than 12 months is 0.3333. We can also calculate the probability that the project requires more than 10 months, but less than or equal to 18 months by subtracting P( x # 10) from P( x # 18). This is shown graphically in Figure 4.20. The calculations are as follows: (24 2 18)2 (10 2 6)2 P( x # 18) 2 P( x # 10) 5 1 2 2 5 0.6111 (24 2 6)(24 2 12) (24 2 6)(10 2 6)

Thus, the probability that the assessment phase of the project requires at least 10 months but less than 18 months is 0.6111.

Normal Probability Distribution One of the most useful probability distributions for describing a continuous random variable is the normal probability distribution. The normal distribution has been used in a wide variety of practical applications in which the random variables are heights and weights of people, test scores, scientific measurements, amounts of rainfall, and other similar values. It is also widely used in business applications to describe uncertain quantities such as demand for products, the rate of return for stocks and bonds, and the time it takes to manufacture a part or complete many types of service-oriented activities such as medical surgeries and consulting engagements. FIGURE4.20

Triangular Distribution to Determine p (10 # x # 18) 5 p ( x # 18) 2 p ( x # 10)

f(x)

P(10 # x # 18) 1/9

a56

10 m 5 12

18

b 5 24

x

190

Chapter 4 Probability: An Introduction to Modeling Uncertainty

The form, or shape, of the normal distribution is illustrated by the bell-shaped normal curve in Figure 4.21. The probability density function that defines the bell-shaped curve of the normal distribution follows. Normal Probability Density Function

Although p and e are irrational numbers, 3.14159 and 2.71828, respectively, are sufficient approximations for our purposes.

f (x) 5

1 2 2 e2( x 2 m ) / 2s s 2p

(4.22)

where

m 5 mean s 5 standard deviation p

We make several observations about the characteristics of the normal distribution. 1. The entire family of normal distributions is differentiated by two parameters: the mean m and the standard deviation s . The mean and standard deviation are often referred to as the location and shape parameters of the normal distribution, respectively. 2. The highest point on the normal curve is at the mean, which is also the median and mode of the distribution. 3. The mean of the distribution can be any numerical value: negative, zero, or positive. Three normal distributions with the same standard deviation but three different means (210, 0, and 20) are shown in Figure 4.22. FIGURE 4.21

Bell-Shaped Curve for the Normal Distribution

Standard deviation s

x

m Mean

FIGURE4.22

Three Normal Distributions with the Same Standard Deviation but Different Means (m 5 210, m 5 0, m 5 20)

–10

20

x

These percentages are the basis for the empirical rule discussed in Section 2.7.

191

4.6 Continuous Probability Distributions

4. The normal distribution is symmetric, with the shape of the normal curve to the left of the mean a mirror image of the shape of the normal curve to the right of the mean. 5. The tails of the normal curve extend to infinity in both directions and theoretically never touch the horizontal axis. Because it is symmetric, the normal distribution is not skewed; its skewness measure is zero. 6. The standard deviation determines how flat and wide the normal curve is. Larger values of the standard deviation result in wider, flatter curves, showing more variability in the data. More variability corresponds to greater uncertainty. Two normal distributions with the same mean but with different standard deviations are shown in Figure 4.23. 7. Probabilities for the normal random variable are given by areas under the normal curve. The total area under the curve for the normal distribution is 1. Because the distribution is symmetric, the area under the curve to the left of the mean is 0.50 and the area under the curve to the right of the mean is 0.50. 8. The percentages of values in some commonly used intervals are as follows: a. 68.3% of the values of a normal random variable are within plus or minus one standard deviation of its mean. b. 95.4% of the values of a normal random variable are within plus or minus two standard deviations of its mean. c. 99.7% of the values of a normal random variable are within plus or minus three standard deviations of its mean. Figure 4.24 shows properties (a), (b), and (c) graphically. We turn now to an application of the normal probability distribution. Suppose Grear Aircraft Engines sells aircraft engines to commercial airlines. Grear is offering a new performance-based sales contract in which Grear will guarantee that its engines will provide a certain amount of lifetime flight hours subject to the airline purchasing a preventive-maintenance service plan that is also provided by Grear. Grear believes that this performance-based contract will lead to additional sales as well as additional income from providing the associated preventive maintenance and servicing. From extensive flight testing and computer simulations, Grear’s engineering group has estimated that if their engines receive proper parts replacement and preventive maintenance, the mean lifetime flight hours achieved is normally distributed with a mean m 5 36,500 hours and standard deviation s 5 5, 000 hours . Grear would like to know what percentage of its aircraft engines will be expected to last more than 40,000 hours. In other words, what is the probability that the aircraft lifetime flight hours x will exceed 40,000? This question can be answered by finding the area of the darkly shaded region in Figure4.25. FIGURE4.23

Two Normal Distributions with the Same Mean but Different Standard Deviations (s 5 5, s 5 10)

=5

= 10

µ

x

192

Chapter 4 Probability: An Introduction to Modeling Uncertainty

FIGURE 4.24

Areas Under the Curve for Any Normal Distribution 99.7% 95.4% 68.3%

2 1

2 3

1 1

2 2

FIGURE 4.25

1 3

x

1 2

Grear Aircraft Engines Lifetime Flight Hours Distribution

P(x

= 5,000

P(x ≥ 40,000) = ?

40,000 µ = 36,500

x

The Excel function NORM.DIST can be used to compute the area under the curve for a normal probability distribution. The NORM.DIST function has four input values. The first is the value of interest corresponding to the probability you want to calculate, the second is the mean of the normal distribution, the third is the standard deviation of the normal distribution, and the fourth is TRUE or FALSE. We enter TRUE for the fourth input if we want the cumulative distribution function and FALSE if we want the probability density function. Figure 4.26 shows how we can answer the question of interest for Grear using Excel— in cell B5, we use the formula 5NORM.DIST(40,000, $B$1, $B$2, TRUE). Cell B1 contains the mean of the normal distribution and cell B2 contains the standard deviation. Because we want to know the area under the curve, we want the cumulative distribution function, so we use TRUE as the fourth input value in the formula. This formula provides a value of 0.7580 in cell B5. But note that this corresponds to P( x # 40,000) 5 0.7580 . In other words, this gives us the area under the curve to the left of x 5 40,000 in Figure4.25, and we are interested in the area under the curve to the right of x 5 40,000 . To find this value, we simply use 1 2 0.7580 5 0.2420 (cell B6). Thus, 0.2420 is the probability that x will exceed 40,000 hours. We can conclude that about 24.2% of aircraft engines will exceed 40,000 lifetime flight hours.

193

4.6 Continuous Probability Distributions

FIGURE 4.26

Excel Calculations for Grear Aircraft Engines Example

Let us now assume that Grear is considering a guarantee that will provide a discount on a replacement aircraft engine if the original engine does not meet the lifetime-flight-hour guarantee. How many lifetime flight hours should Grear guarantee if Grear wants no more than 10% of aircraft engines to be eligible for the discount guarantee? This question is interpreted graphically in Figure 4.27. According to Figure 4.27, the area under the curve to the left of the unknown guarantee on lifetime flight hours must be 0.10. To find the appropriate value using Excel, we use the function NORM.INV. The NORM.INV function has three input values. The first is the probability of interest, the second is mean of the normal distribution, and the third is the standard deviation of the normal distribution. Figure 4.26 shows how we can use Excel to answer Grear’s question about a guarantee on lifetime flight hours. In cell B8 we use the

FIGURE 4.27

Grear’s Discount Guarantee

= 5,000 10% of engines eligible for discount guarantee

x µ = 36,500 Guaranteed lifetime flight hours = ?

194

Chapter 4 Probability: An Introduction to Modeling Uncertainty

With the guarantee set at30,000 hours, the actual percentage eligible forthe guarantee will be 5NORM.DIST(30000,36500, 5000,TRUE ) 5 0.0968,or 9.68%

Note that we can calculate P(30,000 # x # 40,000)

in a single cell using the formula 5NORM.DIST(40000, $B$1, $B$2, TRUE) – NORM.DIST(30000, $B$1, $B$2, TRUE).

formula 5NORM.INV(0.10, $B$1, $B$2), where the mean of the normal distribution is contained in cell B1 and the standard deviation in cell B2. This provides a value of 30,092.24. Thus, a guarantee of 30,092 hours will meet the requirement that approximately 10% of the aircraft engines will be eligible for the guarantee. This information could be used by Grear’s analytics team to suggest a lifetime flight hours guarantee of 30,000 hours. Perhaps Grear is also interested in knowing the probability that an engine will have a lifetime of flight hours greater than 30,000 hours but less than 40,000 hours. How do we calculatethis probability? First, we can restate this question as follows. What is P(30, 000 # x # 40, 000)? Figure 4.28 shows the area under the curve needed to answer this question. The area that corresponds to P(30,000 # x # 40.000) can be found by subtracting the area corresponding toP( x # 30,000) from the area corresponding to P( x # 40,000). In other words, P(30,000 # x # 40,000) 5 P( x # 40,000) 2 P( x # 30,000). Figure 4.29 shows how we can find the value for P(30,000 # x # 40,000) using Excel. We calculate P( x # 40,000) in cell B5 and P( x # 30,000) in cell B6 using the NORM.DIST function. We then calculate P(30,000 # x # 40,000) in cell B8 by subtracting the value in cell B6 from the value in cell B5. This tells us that P(30,000 # x # 40,000) 5 0.7580 2 0.0968 5 0.6612. In other words, the probability that the lifetime flight hours for an aircraft engine will be between 30,000 hours and 40,000 hours is 0.6612.

Exponential Probability Distribution The exponential probability distribution may be used for random variables such as the time between patient arrivals at an emergency room, the distance between major defects in a highway, and the time until default in certain credit-risk models. The exponential probability density function is as follows: FIGURE4.28

Graph Showing the Area Under the Curve Corresponding to p (30,000 # x # 40,000) in the Grear Aircraft Engines Example P(30,000 # x # 40,000)

P(x # 40,000)

P(x # 30,000)

5 5,000

x 5 30,000

m 5 36,500

x 5 40,000

x

195

4.6 Continuous Probability Distributions

FIGURE 4.29

Using Excel to Find p (30,000 # x # 40.000) in the Grear Aircraft Engines Example A

1 2 3 4 5 6 7 8

B

C

Mean: 36500 Standard Deviation: 5000

P (x ≤ 40,000) = =NORM.DIST(40000, $B$1, $B$2,TRUE) P (x ≤ 30,000) = =NORM.DIST(30000, $B$1, $B$2,TRUE) P (30,000 ≤ x ≤ 40,000) = P (x ≤ 40,000) – P (x ≤ 30,000) = =B5-B6 A 1 2 3 4 5 6 7 8

B

C

Mean: Standard Deviation:

36500 5000

P (x ≤ 40,000) = P (x ≤ 30,000) =

0.7580 0.0968

P (30,000 ≤ x ≤ 40,000) = P (x ≤ 40,000) – P (x ≤ 30,000) =

0.6612

Exponential Probability Density Function

f (x) 5

1 2x / m e m

for x $ 0

(4.23)

where

m 5 expected value or mean e 5 2.71828

As an example, suppose that x represents the time between business loan defaults for a particular lending agency. If the mean, or average, time between loan defaults is 15 months (m 5 15), the appropriate density function for x is f (x) 5

1 2x /15 e 15

Figure 4.30 is the graph of this probability density function. As with any continuous probability distribution, the area under the curve corresponding to an interval provides the probability that the random variable assumes a value in that interval. In the time between loan defaults example, the probability that the time between two defaults is six months or less, P( x # 6) , is defined to be the area under the curve in Figure4.30 from x 5 0 to x 5 6 . Similarly, the probability that the time between defaults will be 18 months or less, P( x # 18), is the area under the curve from x 5 0 to x 5 18. Note also that the probability that the time between defaults will be between 6 months and 18 months, P(6 # x # 18), is given by the area under the curve from x 5 6 to x 5 18.

196

Chapter 4 Probability: An Introduction to Modeling Uncertainty

FIGURE4.30

Exponential Distribution for the Time Between Business Loan Defaults Example f (x) .07 P(x # 6) .05 P(6 # x # 18) .03 .01 0

6

12 18 24 Time Between Defaults

30

x

To compute exponential probabilities such as those just described, we use the following formula, which provides the cumulative probability of obtaining a value for the exponential random variable of less than or equal to some specific value denoted by x 0.

Exponential Distribution: Cumulative Probabilities

P( x # x 0 ) 5 1 2 e2x0 /m

(4.24)

For the time between defaults example, x 5 time between business loan defaults in months and m 5 15 months. Using equation(4.24), P( x # x 0 ) 5 1 2 e2x0 /15 Hence, the probability that the time between two defaults is six months or less is: P( x # 6) 5 1 2 e26 /15 5 0.3297 Using equation (4.24), we calculate the probability that the time between defaults is 18 months or less: P( x # 18) 5 1 2 e218 /15 5 0.6988

We can calculate P( 6 # x # 18) in a single

cell using the formula 5EXPON.DIST(18, 1/$B$1, TRUE) EXPON.DIST(6, 1/$B$1, TRUE).

Thus, the probability that the time between two business loan defaults is between 6 months and 18 months is equal to 0.6988 2 0.3297 5 0.3691. Probabilities for any other interval can be computed similarly. Figure 4.31 shows how we can calculate these values for an exponential distribution in Excel using the function EXPON.DIST. The EXPON.DIST function has three inputs: the first input is x, the second input is 1/m, and the third input is TRUE or FALSE. An input of TRUE for the third input provides the cumulative distribution function value and FALSE provides the probability density function value. Cell B3 calculates P( x # 18) using the formula 5EXPON.DIST(18, 1/$B$1, TRUE), where cell B1 contains the mean of the exponential distribution. Cell B4 calculates the value for P( x # 6) and cell B5 calculates the value for P(6 # x # 18) 5 P( x # 18) 2 P( x # 6) by subtracting the value in cell B4 from the value in cell B3.

197

4.6 Continuous Probability Distributions

FIGURE4.31

Using Excel to Calculate p (6 # x # 18) for the Time Between Business Loan Defaults Example A

1 2 3 4 5

B

C

Mean, µ = 15 P (x ≤ 18) = =EXPON.DIST(18,1/$B$1, TRUE) P (x ≤ 6) = =EXPON.DIST(6,1/$B$1, TRUE) P (6 ≤ x ≤ 18) = P (x ≤ 18) – P (x ≤ 6) = =B3-B4

A 1 2 3 4 5

N otes

+

B

C

Mean, µ =

15

P (x ≤ 18) =

0.6988 0.3297 0.3691

P (x ≤ 6) = P (6 ≤ x ≤ 18) = P (x ≤ 18) – P (x ≤ 6) =

C o m m ents

1. The way we describe probabilities is different for a discrete random variable than it is for a continuous random variable. For discrete random variables, we can talk about the probability of the random variable assuming a particular value. For continuous random variables, we can only talk about the probability of the random variable assuming a value within a given interval. 2. To see more clearly why the height of a probability density function is not a probability, think about a random variable with the following uniform probability distribution: 2 f (x) 5 0

for 0 # x # 0.5 elsewhere

The height of the probability density function, f(x), is 2 for values of x between 0 and 0.5. However, we know that probabilities can never be greater than 1. Thus, we see that f(x) cannot be interpreted as the probability of x. 3. The standard normal distribution is the special case of the normal distribution for which the mean is 0 and the standard deviation is 1. This is useful because probabilities for all normal distributions can be computed using the standard normal distribution. We can convert any normal random variable x with mean m and standard deviation s to the standard normal random variable z by using x 2m the formula z 5 . We interpret z as the number of s standard deviations that the normal random variable x is from its mean m. Then we can use a table of standard normal probability distributions to find the area under the

curve using z and the standard normal probability table. Excel contains special functions for the standard normal distribution: NORM.S.DIST and NORM.S.INV. The function NORM.S.DIST is similar to the function NORM.DIST, but it requires only two input values: the value of interest for calculating the probability and TRUE or FALSE, depending on whether you are interested in finding the probability density or the cumulative distribution function. NORM.S.INV is similar to the NORM.INV function, but it requires only the single input of the probability of interest. Both NORM.S.DIST and NORM.S.INV do not need the additional parameters because they assume a mean of 0 and standard deviation of 1 for the standard normal distribution. 4. A property of the exponential distribution is that the mean and the standard deviation are equal to each other. 5. The continuous exponential distribution is related to the discrete Poisson distribution. If the Poisson distribution provides an appropriate description of the number of occurrences per interval, the exponential distribution provides a description of the length of the interval between occurrences. This relationship often arises in queueing applications in which, if arrivals follow a Poisson distribution, the time between arrivals must follow an exponential distribution. 6. Chapter 11 explains how values for discrete and continuous random variables can be generated in Excel for use in simulation models.

198

Chapter 4 Probability: An Introduction to Modeling Uncertainty

S u m m a ry In this chapter we introduced the concept of probability as a means of understanding and measuring uncertainty. Uncertainty is a factor in virtually all business decisions, thus an understanding of probability is essential to modeling such decisions and improving the decision-making process. We introduced some basic relationships in probability including the concepts of outcomes, events, and calculations of related probabilities. We introduced the concept of conditional probability and discussed how to calculate posterior probabilities from prior probabilities using Bayes’ theorem. We then discussed both discrete and continuous random variables as well as some of the more common probability distributions related to these types of random variables. These probability distributions included the custom discrete, discrete uniform, binomial, and Poisson probability distributions for discrete random variables, as well as the uniform, triangular, normal, and exponential probability distributions for continuous random variables. We also discussed the concepts of the expected value (mean) and variance of a random variable. Probability is used in many chapters that follow in this textbook. In Chapter 5, various measures for showing the strength of association rules are based on probability and conditional probability concepts. Random variables and probability distributions will be seen again in Chapter 6 when we discuss the use of statistical inference to draw conclusions about a population from sample data. In Chapter 7, we will see that the normal distribution is fundamentally involved when we discuss regression analysis as a way of estimating relationships between variables. Chapter 11 demonstrates the use of a variety of probability distributions in simulation models to evaluate the impact of uncertainty on decision-making. Conditional probability and Bayes’ theorem will be discussed again in Chapter 15 in the context of decision analysis. It is very important to have a basic understanding of probability, such as is provided in this chapter, as you continue to improve your skills in business analytics.

G l o ss a r y Addition law A probability law used to compute the probability of the union of events. For two events A and B, the addition law is P( A ø B) 5 P( A) 1 P ( B) 2 P( A ù B). For two mutually exclusive events, P( A ù B) 5 0, so P ( A ø B) 5 P( A) 1 P ( B). Bayes’ theorem A method used to compute posterior probabilities. Binomial probability distribution A probability distribution for a discrete random variable showing the probability of x successes in n trials. Complement of A The event consisting of all outcomes that are not in A. Conditional probability The probability of an event given that another event has already P ( A ù B) occurred. The conditional probability of A given B is P( A | B) 5 . P ( B) Continuous random variable A random variable that may assume any numerical value inan interval or collection of intervals. An interval can include negative and positive infinity. Custom discrete probability distribution A probability distribution for a discrete random variable for which each value xi that the random variable assumes is associated with a defined probability f ( xi ). Discrete random variable A random variable that can take on only specified discrete values. Discrete uniform probability distribution A probability distribution in which each possible value of the discrete random variable has the same probability. Empirical probability distribution A probability distribution for which the relative frequency method is used to assign probabilities. Event A collection of outcomes.

Glossary

199

Expected value A measure of the central location, or mean, of a random variable. Exponential probability distribution A continuous probability distribution that is useful in computing probabilities for the time it takes to complete a task or the time between arrivals. The mean and standard deviation for an exponential probability distribution are equal to each other. Independent events Two events A and B are independent if P( A | B) 5 P( A) or P( B | A) 5 P ( B); the events do not influence each other. Intersection of A and B The event containing the outcomes belonging to both A and B. The intersection of A and B is denoted A ù B. Joint probabilities The probability of two events both occurring; in other words, the probability of the intersection of two events. Marginal probabilities The values in the margins of a joint probability table that provide the probabilities of each event separately. Multiplication law A law used to compute the probability of the intersection of events. For two events A and B, the multiplication law is P( A ù B) 5 P( B) P( A | B) or P (A ù B) 5 P (A)P (B | A). For two independent events, it reduces to P (A ù B) 5 P (A)P(B). Mutually exclusive events Events that have no outcomes in common; A ù B is empty and P( A ù B) 5 0. Normal probability distribution A continuous probability distribution in which the probability density function is bell shaped and determined by its mean m and standard deviation s . Poisson probability distribution A probability distribution for a discrete random variable showing the probability of x occurrences of an event over a specified interval of time or space. Posterior probabilities Revised probabilities of events based on additional information. Prior probability Initial estimate of the probabilities of events. Probability A numerical measure of the likelihood that an event will occur. Probability density function A function used to compute probabilities for a continuous random variable. The area under the graph of a probability density function over an interval represents probability. Probability distribution A description of how probabilities are distributed over the values of a random variable. Probability mass function A function, denoted by f(x), that provides the probability that x assumes a particular value for a discrete random variable. Probability of an event Equal to the sum of the probabilities of outcomes for the event. Random experiment A process that generates well-defined experimental outcomes. Onany single repetition or trial, the outcome that occurs is determined by chance. Random variables A numerical description of the outcome of an experiment. Sample space The set of all outcomes. Standard deviation Positive square root of the variance. Triangular probability distribution A continuous probability distribution in which the probability density function is shaped like a triangle defined by the minimum possible value a, the maximum possible value b, and the most likely value m. A triangular probability distribution is often used when only subjective estimates are available for the minimum, maximum, and most likely values. Uniform probability distribution A continuous probability distribution for which the probability that the random variable will assume a value in any interval is the same for each interval of equal length. Union of A and B The event containing the outcomes belonging to A or B or both. The union of A and B is denoted by A ø B. Variance A measure of the variability, or dispersion, of a random variable. Venn diagram A graphical representation of the sample space and operations involving events, in which the sample space is represented by a rectangle and events are represented as circles within the sample space.

200

Chapter 4 Probability: An Introduction to Modeling Uncertainty

P r o b l e ms 1. Airline Performance Measures. On-time arrivals, lost baggage, and customer complaints are three measures that are typically used to measure the quality of service being offered by airlines. Suppose that the following values represent the on-time arrival percentage, amount of lost baggage, and customer complaints for 10 U.S. airlines.

Airline

On-Time Arrivals (%)

Mishandled Baggage per 1,000 Passengers

Customer Complaints per 1,000 Passengers

Virgin America

83.5

0.87

1.50

JetBlue

79.1

1.88

0.79

AirTran Airways

87.1

1.58

0.91

Delta Air Lines

86.5

2.10

0.73

Alaska Airlines

87.5

2.93

0.51

Frontier Airlines

77.9

2.22

1.05

Southwest Airlines

83.1

3.08

0.25

US Airways

85.9

2.14

1.74

American Airlines

76.9

2.92

1.80

United Airlines

77.4

3.87

4.24

a. Based on the data above, if you randomly choose a Delta Air Lines flight, what is the probability that this individual flight will have an on-time arrival? b. If you randomly choose 1 of the 10 airlines for a follow-up study on airline quality ratings, what is the probability that you will choose an airline with less than two mishandled baggage reports per 1,000 passengers? c. If you randomly choose 1 of the 10 airlines for a follow-up study on airline quality ratings, what is the probability that you will choose an airline with more than one customer complaint per 1,000 passengers? d. What is the probability that a randomly selected AirTran Airways flight will not arrive on time? 2. Rolling a Pair of Dice. Consider the random experiment of rolling a pair of dice. Suppose that we are interested in the sum of the face values showing on the dice. a. How many outcomes are possible? b. List the outcomes. c. What is the probability of obtaining a value of 7? d. What is the probability of obtaining a value of 9 or greater? 3. Ivy League College Admissions. Suppose that for a recent admissions class, an Ivy League college received 2,851 applications for early admission. Of this group, it admitted 1,033 students early, rejected 854 outright, and deferred 964 to the regular admission pool for further consideration. In the past, this school has admitted 18% of the deferred early admission applicants during the regular admission process. Counting the students admitted early and the students admitted during the regular admission process, the total class size was 2,375. Let E, R, and D represent the events that a student who applies for early admission is admitted early, rejected outright, or deferred to the regular admissions pool. a. Use the data to estimate P(E), P(R), and P(D). b. Are events E and D mutually exclusive? Find P ( E ù D). c. For the 2,375 students who were admitted, what is the probability that a randomly selected student was accepted during early admission? d. Suppose a student applies for early admission. What is the probability that the student will be admitted for early admission or be deferred and later admitted during the regular admission process?

201

Problems

4. Two Events, A and B. Suppose that we have two events, A and B, with P( A) 5 0.50, P( B) 5 0.60, and P( A ù B) 5 0.40. a. Find P( A | B). b. Find P( B | A). c. Are A and B independent? Why or why not? 5. Intent to Pursue MBA. Students taking the Graduate Management Admissions Test (GMAT) were asked about their undergraduate major and intent to pursue their MBA as a full-time or part-time student. A summary of their responses is as follows: Undergraduate Major Business Intended Enrollment Status

Engineering

Other

Totals

Full-Time

352

197

251

800

Part-Time

150

161

194

505

Totals

502

358

445

1,305

a. Develop a joint probability table for these data. b. Use the marginal probabilities of undergraduate major (business, engineering, or other) to comment on which undergraduate major produces the most potential MBA students. c. If a student intends to attend classes full time in pursuit of an MBA degree, what is the probability that the student was an undergraduate engineering major? d. If a student was an undergraduate business major, what is the probability that the student intends to attend classes full time in pursuit of an MBA degree? e. Let F denote the event that the student intends to attend classes full time in pursuit of an MBA degree, and let B denote the event that the student was an undergraduate business major. Are events F and B independent? Justify your answer. 6. Student Loans and College Degrees. More than 40 million Americans are estimated to have at least one outstanding student loan to help pay college expenses (CNNMoney web site). Not all of these graduates pay back their debt in satisfactory fashion. Suppose that the following joint probability table shows the probabilities of student loan status and whether or not the student had received a college degree. College Degree Loan Status

OptilyticsLLC

Yes

No

Satisfactory

0.26

0.24

0.50

Delinquent

0.16

0.34

0.50

0.42

0.58

a. What is the probability that a student with a student loan had received a college degree? b. What is the probability that a student with a student loan had not received a college degree? c. Given that the student has received a college degree, what is the probability that the student has a delinquent loan? d. Given that the student has not received a college degree, what is the probability that the student has a delinquent loan? e. What is the impact of dropping out of college without a degree for students who have a student loan? 7. Senior Data Scientist Position Applicants. The Human Resources Manager for Optilytics LLC is evaluating applications for the position of Senior Data Scientist. The file OptilyticsLLC presents summary data of the applicants for the position.

202

Chapter 4 Probability: An Introduction to Modeling Uncertainty

a. Use a PivotTable in Excel to create a joint probability table showing the probabilities associated with a randomly selected applicant’s sex and highest degree achieved. Use this joint probability table to answer the questions below. b. What are the marginal probabilities? What do they tell you about the probabilities associated with the sex of applicants and highest degree completed by applicants? c. If the applicant is female, what is the probability that the highest degree completed by the applicant is a PhD? d. If the highest degree completed by the applicant is a bachelor’s degree, what is the probability that the applicant is male? e. What is the probability that a randomly selected applicant will be a male whose highest completed degree is a PhD? 8. U.S. Household Incomes. The U.S. Census Bureau is a leading source of quantitative data related to the people and economy of the United States. The crosstabulation below represents the number of households (thousands) and the household income by the highest level of education for the head of household (U.S. Census Bureau web site). Use this crosstabulation to answer the following questions. Household Income Highest Level of Education High school graduate Bachelor’s degree Master’s degree Doctoral degree Total

Under $25,000

$25,000 to $49,999

$50,000 to $99,999

$100,000 and Over

Total

9,880

9,970

9,441

3,482

32,773

2,484

4,164

7,666

7,817

22,131

685

1,205

3,019

4,094

9,003

79

160

422

1,076

1,737

13,128

15,499

20,548

16,469

65,644

a. Develop a joint probability table. b. What is the probability the head of one of these households has a master’s degree or higher education? c. What is the probability a household is headed by someone with a high school diploma earning $100,000 or more? d. What is the probability one of these households has an income below $25,000? e. What is the probability a household is headed by someone with a bachelor’s degree earning less than $25,000? f. Are household income and educational level independent? 9. Probability of Homes Selling. Cooper Realty is a small real estate company located in Albany, New York, that specializes primarily in residential listings. The company recently became interested in determining the likelihood of one of its listings being sold within a certain number of days. An analysis of company sales of 800 homes in previous years produced the following data. Days Listed Until Sold Under 30 31–90 Over 90 Initial Asking Price

Total

Under $150,000

50

40

10

100

$150,000–$199,999

20

150

80

250

$200,000–$250,000

20

280

100

400

Over $250,000 Total

10

30

10

50

100

500

200

800

a. If A is defined as the event that a home is listed for more than 90 days before being sold, estimate the probability of A.

203

Problems

b. If B is defined as the event that the initial asking price is under $150,000, estimate the probability of B. c. What is the probability of A ù B? d. Assuming that a contract was just signed to list a home with an initial asking price of less than $150,000, what is the probability that the home will take Cooper Realty more than 90 days to sell? e. Are events A and B independent? 10. Computing Probabilities. The prior probabilities for events A1 and A2 are P( A1 ) 5 0.40 and P( A2 ) 5 0.60. It is also known that P ( A1 ù A2 ) 5 0. Suppose P( B | A1 ) 5 0.20 and P( B | A2 ) 5 0.05. a. Are A1 and A2 mutually exclusive? Explain. b. Compute P ( A1 ù B) and P( A2 ù B). c. Compute P(B). d. Apply Bayes’ theorem to compute P( A1 | B) and P( A2 | B). 11. Credit Card Defaults. A local bank reviewed its credit-card policy with the intention of recalling some of its credit cards. In the past, approximately 5% of cardholders defaulted, leaving the bank unable to collect the outstanding balance. Hence, management established a prior probability of 0.05 that any particular cardholder will default. The bank also found that the probability of missing a monthly payment is 0.20 for customers who do not default. Of course, the probability of missing a monthly payment for those who default is 1. a. Given that a customer missed a monthly payment, compute the posterior probability that the customer will default. b. The bank would like to recall its credit card if the probability that a customer will default is greater than 0.20. Should the bank recall its credit card if the customer misses a monthly payment? Why or why not? 12. Prostate Cancer Screening. According to a 2018 article in Esquire magazine, approximately 70% of males over age 70 will develop cancerous cells in their prostate. Prostate cancer is second only to skin cancer as the most common form of cancer for males in the United States. One of the most common tests for the detection of prostate cancer is the prostate-specific antigen (PSA) test. However, this test is known to have a high false-positive rate (tests that come back positive for cancer when no cancer is present). Suppose there is a .02 probability that a male patient has prostate cancer before testing. The probability of a false-positive test is .75, and the probability of a false-negative (no indication of cancer when cancer is actually present) is .20. a. What is the probability that the male patient has prostate cancer if the PSA test comes back positive? b. What is the probability that the male patient has prostate cancer if the PSA test comes back negative? c. For older men, the prior probability of having cancer increases. Suppose that the prior probability of the male patient is .3 rather than .02. What is the probability that the male patient has prostate cancer if the PSA test comes back positive? What is the probability that the male patient has prostate cancer if the PSA test comes back negative? d. What can you infer about the PSA test from the results of parts (a), (b), and (c)? 13. Finding Oil in Alaska. An oil company purchased an option on land in Alaska. Preliminary geologic studies assigned the following prior probabilities. P(high-quality oil) 5 0.50 P(medium-quality oil) 5 0.20 P(no oil) 5 0.30 a. What is the probability of finding oil? b. After 200 feet of drilling on the first well, a soil test is taken. The probabilities of finding the particular type of soil identified by the test are as follows. P(soil | high-quality oil) 5 0.20 P(soil | medium-quality oil) 5 0.80 P(soil | no oil) 5 0.20

204

Chapter 4 Probability: An Introduction to Modeling Uncertainty

How should the firm interpret the soil test? What are the revised probabilities, and what is the new probability of finding oil? 14. Unemployment Data. Suppose the following data represent the number of persons unemployed for a given number of months in Killeen, Texas. The values in the first column show the number of months unemployed and the values in the second column show the corresponding number of unemployed persons. Months Unemployed

Number Unemployed

1

1,029

2

1,686

3

2,269

4

2,675

5

3,487

6

4,652

7

4,145

8

3,587

9

2,325

10

1,120

Let x be a random variable indicating the number of months a randomly selected person is unemployed. a. Use the data to develop an empirical discrete probability distribution for x. b. Show that your probability distribution satisfies the conditions for a valid discrete probability distribution. c. What is the probability that a person is unemployed for two months or less? Unemployed for more than two months? d. What is the probability that a person is unemployed for more than six months? 15. Information Systems Job Satisfaction. The percent frequency distributions of job satisfaction scores for a sample of information systems (IS) senior executives and middle managers are as follows. The scores range from a low of 1 (very dissatisfied) to a high of 5 (very satisfied). Job Satisfaction Score

IS Senior Executives (%)

IS Middle Managers (%)

1

5

4

2

9

10

3

3

12

4

42

46

5

41

28

a. Develop a probability distribution for the job satisfaction score of a randomly selected senior executive. b. Develop a probability distribution for the job satisfaction score of a randomly selected middle manager. c. What is the probability that a randomly selected senior executive will report a job satisfaction score of 4 or 5? d. What is the probability that a randomly selected middle manager is very satisfied? e. Compare the overall job satisfaction of senior executives and middle managers.

205

Problems

16. Expectation and Variance of a Random Variable. The following table provides a probability distribution for the random variable y. y

f(y)

2

0.20

4

0.30

7

0.40

8

0.10

a. Compute E(y). b. Compute Var(y) and s . 1 7. Damage Claims at an Insurance Company. The probability distribution for damage claims paid by the Newton Automobile Insurance Company on collision insurance is as follows. Payment ($)

Probability

0.85

500

0.04

1,000

0.04

3,000

0.03

5,000

0.02

8,000

0.01

10,000

0.01

a. Use the expected collision payment to determine the collision insurance premium that would enable the company to break even. b. The insurance company charges an annual rate of $520 for the collision coverage. What is the expected value of the collision policy for a policyholder? (Hint: It is the expected payments from the company minus the cost of coverage.) Why does the policyholder purchase a collision policy with this expected value? 18. Plant Expansion Decision. The J.R. Ryland Computer Company is considering a plant expansion to enable the company to begin production of a new computer product. The company’s president must determine whether to make the expansion a mediumor large-scale project. Demand for the new product is uncertain, which for planning purposes may be low demand, medium demand, or high demand. The probability estimates for demand are 0.20, 0.50, and 0.30, respectively. Letting x and y indicate the annual profit in thousands of dollars, the firm’s planners developed the following profit forecasts for the medium- and large-scale expansion projects. Medium-Scale Expansion Profit Low

Demand

x

f(x)

Large-Scale Expansion Profit y

f(y)

50

0.20

0.20

Medium

150

0.50

100

0.50

High

200

0.30

300

0.30

a. Compute the expected value for the profit associated with the two expansion alternatives. Which decision is preferred for the objective of maximizing the expected profit? b. Compute the variance for the profit associated with the two expansion alternatives. Which decision is preferred for the objective of minimizing the risk or uncertainty?

206

Chapter 4 Probability: An Introduction to Modeling Uncertainty

19. Binomial Distribution Calculations. Consider a binomial experiment with n 5 10 and p 5 0.10. a. Compute f(0). b. Compute f(2). c. Compute P( x # 2). d. Compute P ( x $ 1). e. Compute E(x). f. Compute Var(x) and s . 20. Acceptance Sampling. Many companies use a quality control technique called acceptance sampling to monitor incoming shipments of parts, raw materials, and so on. In the electronics industry, component parts are commonly shipped from suppliers in large lots. Inspection of a sample of n components can be viewed as the n trials of a binomial experiment. The outcome for each component tested (trial) will be that the component is classified as good or defective. Reynolds Electronics accepts a lot from a particular supplier if the defective components in the lot do not exceed 1%. Suppose a random sample of five items from a recent shipment is tested. a. Assume that 1% of the shipment is defective. Compute the probability that no items in the sample are defective. b. Assume that 1% of the shipment is defective. Compute the probability that exactly one item in the sample is defective. c. What is the probability of observing one or more defective items in the sample if 1% of the shipment is defective? d. Would you feel comfortable accepting the shipment if one item was found to be defective? Why or why not? 21. Introductory Statistics Course Withdrawals. A university found that 20% of its students withdraw without completing the introductory statistics course. Assume that 20 students registered for the course. a. Compute the probability that two or fewer will withdraw. b. Compute the probability that exactly four will withdraw. c. Compute the probability that more than three will withdraw. d. Compute the expected number of withdrawals. 22. Poisson Distribution Calculations. Consider a Poisson distribution with m 5 3. a. Write the appropriate Poisson probability mass function. b. Compute f(2). c. Compute f(1). d. Compute P ( x $ 2). 23. 911 Calls. Emergency 911 calls to a small municipality in Idaho come in at the rate of one every 2 minutes. Assume that the number of 911 calls is a random variable that can be described by the Poisson distribution. a. What is the expected number of 911 calls in 1 hour? b. What is the probability of three 911 calls in 5 minutes? c. What is the probability of no 911 calls during a 5-minute period? 24. Small Business Failures. A regional director responsible for business development in the state of Pennsylvania is concerned about the number of small business failures. If the mean number of small business failures per month is 10, what is the probability that exactly 4 small businesses will fail during a given month? Assume that the probability of a failure is the same for any two months and that the occurrence or nonoccurrence of a failure in any month is independent of failures in any other month. 25. Uniform Distribution Calculations. The random variable x is known to be uniformly distributed between 10 and 20. a. Show the graph of the probability density function. b. Compute P( x , 15) . c. Compute P(12 # x # 18). d. Compute E(x). e. Compute Var(x).

207

Problems

26. RAND Function in Excel. Most computer languages include a function that can be used to generate random numbers. In Excel, the RAND function can be used to generate random numbers between 0and 1. If we let x denote a random number generated using RAND, then x is a continuous random variable with the following probability density function: 1 for 0 # x # 1 f (x) 5 0 elsewhere a. Graph the probability density function. b. What is the probability of generating a random number between 0.25 and 0.75? c. What is the probability of generating a random number with a value less than or equal to 0.30? d. What is the probability of generating a random number with a value greater than0.60? e. Generate 50 random numbers by entering 5RAND() into 50 cells of an Excel worksheet. f. Compute the mean and standard deviation for the random numbers in part (e). 27. Bidding on a Piece of Land. Suppose we are interested in bidding on a piece of land and we know one other bidder is interested. The seller announced that the highest bid in excess of $10,000 will be accepted. Assume that the competitor’s bid x is a random variable that is uniformly distributed between $10,000 and $15,000. a. Suppose you bid $12,000. What is the probability that your bid will be accepted? b. Suppose you bid $14,000. What is the probability that your bid will be accepted? c. What amount should you bid to maximize the probability that you get the property? d. Suppose you know someone who is willing to pay you $16,000 for the property.Would you consider bidding less than the amount in part (c)? Why or whynot? 28. Triangular Distribution Calculations. A random variable has a triangular probability density function with a 5 50 , b 5 375, and m 5 250. a. Sketch the probability distribution function for this random variable. Label the points a 5 50 , b 5 375, and m 5 250 on the x-axis. b. What is the probability that the random variable will assume a value between 50and 250? c. What is the probability that the random variable will assume a value greater than300? 29. Project Completion Time. The Siler Construction Company is about to bid on a new industrial construction project. To formulate their bid, the company needs to estimate the time required for the project. Based on past experience, management expects that the project will require at least 24 months, and could take as long as 48 months if there are complications. The most likely scenario is that the project will require 30 months. a. Assume that the actual time for the project can be approximated using a triangular probability distribution. What is the probability that the project will take less than 30 months? b. What is the probability that the project will take between 28 and 32 months? c. To submit a competitive bid, the company believes that if the project takes more than 36 months, then the company will lose money on the project. Management does not want to bid on the project if there is greater than a 25% chance that they will lose money on this project. Should the company bid on this project? 30. Large-Cap Stock Fund Returns. Suppose that the return for a particular large-cap stock fund is normally distributed with a mean of 14.4% and standard deviation of 4.4%. a. What is the probability that the large-cap stock fund has a return of at least 20%? b. What is the probability that the large-cap stock fund has a return of 10% or less?

208

Chapter 4 Probability: An Introduction to Modeling Uncertainty

31. IQ Scores and Mensa. A person must score in the upper 2% of the population on an IQ test to qualify for membership in Mensa, the international high IQ society. If IQ scores are normally distributed with a mean of 100 and a standard deviation of 15, what score must a person have to qualify for Mensa? 32. Web Site Traffic. Assume that the traffic to the web site of Smiley’s People, Inc., which sells customized T-shirts, follows a normal distribution, with a mean of 4.5 million visitors per day and a standard deviation of 820,000 visitors per day. a. What is the probability that the web site has fewer than 5 million visitors in a single day? b. What is the probability that the web site has 3 million or more visitors in a single day? c. What is the probability that the web site has between 3 million and 4 million visitors in a single day? d. Assume that 85% of the time, the Smiley’s People web servers can handle the daily web traffic volume without purchasing additional server capacity. What is the amount of web traffic that will require Smiley’s People to purchase additional server capacity? 33. Probability of Defect. Suppose that Motorola uses the normal distribution to determine the probability of defects and the number of defects in a particular production process. Assume that the production process manufactures items with a mean weight of 10 ounces. Calculate the probability of a defect and the suspected number of defects for a 1,000-unit production run in the following situations. a. The process standard deviation is 0.15, and the process control is set at plus or minus one standard deviation. Units with weights less than 9.85 or greater than 10.15 ounces will be classified as defects. b. Through process design improvements, the process standard deviation can be reduced to 0.05. Assume that the process control remains the same, with weights less than 9.85 or greater than 10.15 ounces being classified as defects. c. What is the advantage of reducing process variation, thereby causing process control limits to be at a greater number of standard deviations from the mean? 34. Exponential Distribution Calculations. Consider the following exponential probability density function: f (x) 5

1 2x / 3 e 3

for x $ 0

a. Write the formula for P( x # x 0 ). b. Find P( x # 2). c. Find P( x $ 3). d. Find P( x # 5). e. Find P (2 # x # 5) . 35. Vehicle Arrivals at an Intersection. The time between arrivals of vehicles at a particular intersection follows an exponential probability distribution with a mean of 12 seconds. a. Sketch this exponential probability distribution. b. What is the probability that the arrival time between vehicles is 12 seconds or less? c. What is the probability that the arrival time between vehicles is 6 seconds or less? d. What is the probability of 30 or more seconds between vehicle arrivals? 3 6. Time Spent Playing World of Warcraft. Suppose that the time spent by players in a single session on the World of Warcraft multiplayer online role-playing game follows an exponential distribution with a mean of 38.3 minutes. a. Write the exponential probability distribution function for the time spent by players on a single session of World of Warcraft. b. What is the probability that a player will spend between 20 and 40 minutes on a single session of World of Warcraft? c. What is the probability that a player will spend more than 1 hour on a single session of World of Warcraft?

209

Case Problem 1: Hamilton County Judges

C a se

P r ob l em

1 :

H a m i lt o n

C o u n t y

J u dges

Hamilton County judges try thousands of cases per year. In an overwhelming majority of the cases disposed, the verdict stands as rendered. However, some cases are appealed, and of those appealed, some of the cases are reversed. Kristen DelGuzzi of the Cincinnati Enquirer newspaper conducted a study of cases handled by Hamilton County judges over a three-year period. Shown in the table below are the results for 182,908 cases handled (disposed) by 38 judges in Common Pleas Court, Domestic Relations Court, and Municipal Court. Two of the judges (Dinkelacker and Hogan) did not serve in the same court for the entire three-year period. The purpose of the newspaper’s study was to evaluate the performance of the judges. Appeals are often the result of mistakes made by judges, and the newspaper wanted to know which judges were doing a good job and which were making too many mistakes. You are called in to assist in the data analysis. Use your knowledge of probability and conditional probability to help with the ranking of the judges. You also may be able to analyze the likelihood of appeal and reversal for cases handled by different courts.

Total Cases Disposed, Appealed, and Reversed in Hamilton County Courts Common Pleas Court Judge

Total Cases Disposed

Appealed Cases

Reversed Cases

Fred Cartolano

3,037

137

12

Thomas Crush

3,372

119

10

Patrick Dinkelacker

1,258

44

8

Timothy Hogan

1,954

60

7

Robert Kraft

3,138

127

7

William Mathews

2,264

91

18

William Morrissey

3,032

121

22

Norbert Nadel

2,959

131

20

Arthur Ney, Jr.

3,219

125

14

Richard Niehaus

3,353

137

16

Thomas Nurre

3,000

121

6

John O’Connor

2,969

129

12

Robert Ruehlman

3,205

145

18

955

60

10

Ann Marie Tracey

3,141

127

13

Ralph Winkler

3,089

88

6

43,945

1,762

199

Appealed Cases

Reversed Cases

J. Howard Sundermann

Total

Domestic Relations Court Judge

Total Cases Disposed

Penelope Cunningham

2,729

7

1

Patrick Dinkelacker

6,001

19

4

Deborah Gaines

8,799

48

9

Ronald Panioto

12,970

32

3

30,499

106

17

Total

210

Chapter 4 Probability: An Introduction to Modeling Uncertainty

Municipal Court Judge

Total Cases Disposed

Appealed Cases

Reversed Cases

Mike Allen

6,149

43

4

Nadine Allen

7,812

34

6

Timothy Black

7,954

41

6

David Davis

7,736

43

5

Leslie Isaiah Gaines

5,282

35

13

Karla Grady

5,253

6

Deidra Hair

2,532

5

Dennis Helmick

7,900

29

5

Timothy Hogan

2,308

13

2

James Patrick Kenney

2,798

6

1

Joseph Luebbers

4,698

25

8

William Mallory

8,277

38

9

Melba Marsh

8,219

34

7

Beth Mattingly

2,971

13

1

Albert Mestemaker

4,975

28

9

Mark Painter

2,239

7

3

Jack Rosen

7,790

41

13

Mark Schweikert

5,403

33

6

David Stockdale

5,371

22

4

John A. West Total

2,797

4

2

108,464

500

104

Managerial Report

Prepare a report with your rankings of the judges. Also, include an analysis of the likelihood of appeal and case reversal in the three courts. At a minimum, your report should include the following: 1. The probability of cases being appealed and reversed in the three different courts. 2. The probability of a case being appealed for each judge. 3. The probability of a case being reversed for each judge. 4. The probability of reversal given an appeal for each judge. 5. Rank the judges within each court. State the criteria you used and provide a rationale for your choice. C A S E

P R OB L EM

2 :

M c Ne i l ’ s

Au to

M a l l

Harriet McNeil, proprietor of McNeil’s Auto Mall, believes that it is good business for her automobile dealership to have more customers on the lot than can be served, as she believes this creates an impression that demand for the automobiles on her lot is high. However, she also understands that if there are far more customers on the lot than can be served by her salespeople, her dealership may lose sales to customers who become frustrated and leave without making a purchase. Ms. McNeil is primarily concerned about the staffing of salespeople on her lot on Saturday mornings (8:00 a.m. to noon), which are the busiest time of the week for McNeil’s Auto Mall. On Saturday mornings, an average of 6.8 customers arrive per hour. The customers arrive randomly at a constant rate throughout the morning, and a salesperson spends an average of one hour with a customer. Ms. McNeil’s experience has led her to conclude that if there are two more customers on her lot than can be served at any time on

211

Case Problem 3: Gebhardt Electronics

a Saturday morning, her automobile dealership achieves the optimal balance of creating an impression of high demand without losing too many customers who become frustrated and leave without making a purchase. Ms. McNeil now wants to determine how many salespeople she should have on her lot on Saturday mornings in order to achieve her goal of having two more customers on her lot than can be served at any time. She understands that occasionally the number of customers on her lot will exceed the number of salespersons by more than two, and she is willing to accept such an occurrence no more than 10% of the time. Managerial Report

Ms. McNeil has asked you to determine the number of salespersons she should have on her lot on Saturday mornings in order to satisfy her criteria. In answering Ms. McNeil’s question, consider the following three questions: 1. How is the number of customers who arrive in the lot on a Saturday morning distributed? 2. Suppose Ms. McNeil currently uses five salespeople on her lot on Saturday morning. Using the probability distribution you identified in (1), what is the probability that the number of customers who arrive on her lot will exceed the number of salespersons by more than two? Does her current Saturday morning employment strategy satisfy her stated objective? Why or why not? 3. What is the minimum number of salespeople Ms. McNeil should have on her lot on Saturday mornings to achieve her objective? C a se

P r ob l em

3 :

G ebh a rd t

E l ec t r o n i cs

Gebhardt Electronics produces a wide variety of transformers that it sells directly to manufacturers of electronics equipment. For one component used in several models of its transformers, Gebhardt uses a 1-meter length of 0.20 mm diameter solid wire made of pure Oxygen-Free Electronic (OFE) copper. A flaw in the wire reduces its conductivity and increases the likelihood it will break, and this critical component is difficult to reach and repair after a transformer has been constructed. Therefore, Gebhardt wants to use primarily flawless lengths of wire in making this component. The company is willing to accept no more than a 1 in 20 chance that a 1-meter length taken from a spool will be flawless. Gebhardt also occasionally uses smaller pieces of the same wire in the manufacture of other components, so the 1-meter segments to be used for this component are essentially taken randomly from a long spool of 0.20 mm diameter solid OFE copper wire. Gebhardt is now considering a new supplier for copper wire. This supplier claims that its spools of 0.20 mm diameter solid OFE copper wire average 127 centimeters between flaws. Gebhardt now must determine whether the new supply will be satisfactory if the supplier’s claim is valid. Managerial Report

In making this assessment for Gebhardt Electronics, consider the following three questions: 1. I f the new supplier does provide spools of 0.20 mm solid OFE copper wire that average 127 centimeters between flaws, how is the length of wire between two consecutive flaws distributed? 2. Using the probability distribution you identified in (1), what is the probability that Gebhardt’s criteria will be met (i.e., a 1 in 20 chance that a randomly selected 1-meter segment of wire provided by the new supplier will be flawless)? 3. In centimeters, what is the minimum mean length between consecutive flaws that would result in satisfaction of Gebhardt’s criteria? 4. In centimeters, what is the minimum mean length between consecutive flaws that would result in a 1 in 100 chance that a randomly selected 1-meter segment of wire provided by the new supplier will be flawless?

Chapter 5 Descriptive Data Mining CONTENTS Analytics in Action: Advice from a Machine 5.1 CLUSTER ANALYSIS Measuring Distance Between Observations k-Means Clustering Hierarchical Clustering and Measuring Dissimilarity Between Clusters Hierarchical Clustering Versus k-Means Clustering 5.2 ASSOCIATION RULES Evaluating Association Rules 5.3 TEXT MINING Voice of the Customer at Triad Airline Preprocessing Text Data for Analysis Movie Reviews Computing Dissimilarity Between Documents Word Clouds Summary 235 Glossary 235 Problems 237 AVAILABLE IN THE MINDTAP READER: APPENDIX: GETTING STARTED WITH RATTLE IN R APPENDIX: k-MEANS CLUSTERING WITH R APPENDIX: HIERARCHICAL CLUSTERING WITH R APPENDIX: ASSOCIATION RULES WITH R APPENDIX: TEXT MINING WITH R APPENDIX: R/RATTLE SETTINGS TO SOLVE CHAPTER 5 PROBLEMS APPENDIX: Opening and Saving Excel Files in JMP Pro APPENDIX: k-MEANS CLUSTERING WITH JMP PRO APPENDIX: HIERARCHICAL CLUSTERING WITH JMP PRO APPENDIX: ASSOCIATION RULES WITH JMP PRO APPENDIX: TEXT MINING WITH JMP PRO APPENDIX: JMP PRO SETTINGS TO SOLVE CHAPTER 5 PROBLEMS

214

A na l y t i cs

Chapter 5 Descriptive Data Mining

i n

A ct i on

Advice from a Machine1 The proliferation of data and increase in computing power have sparked the development of automated recommender systems, which provide consumers with suggestions for movies, music, books, clothes, restaurants, dating, and whom to follow on Twitter. The sophisticated, proprietary algorithms guiding recommender systems measure the degree of similarity between users or items to identify recommendations of potential interest to a user. Netflix, a company that provides media content via DVD-by-mail and Internet streaming, provides its users with recommendations for movies and television shows based on each user’s expressed interests and feedback on previously viewed content. As its business has shifted from renting DVDs by mail to streaming content online, Netflix has been able to track its customers’ viewing behavior more closely. This allows Netflix’s recommendations to account for differences in viewing behavior based on the day of the week,

Predictive data mining is discussed in Chapter 9.

the time of day, the device used (computer, phone, t elevision), and even the viewing location. The use of recommender systems is prevalent in e-commerce. Using attributes detailed by the Music Genome Project, Pandora Internet Radio plays songs with properties similar to songs that a user “likes.” In the online dating world, web sites such as eHarmony, Match.com, and OKCupid use different “formulas” to take into account hundreds of different behavioral traits to propose date “matches.” Stitch Fix, a personal shopping service, combines recommendation algorithms and human input from its fashion experts to match its inventory of fashion items to its clients. “The Science Behind the Netflix Algorithms that Decide What You’ll Watch Next,” http://www.wired.com/2013/08/qq_netflix-algorithm. Retrieved on August 7, 2013; E. Colson, “Using Human and Machine Processing in Recommendation Systems,” First AAAI Conference onHuman Computation and Crowdsourcing (2013); K. Zhao, X. Wang, M. Yu, and B. Gao, “User Recommendation in Reciprocal and Bipartite Social Networks—A Case Study of Online Dating,” IEEE Intelligent Systems 29, no. 2 (2014).

1

Over the past few decades, technological advances have led to a dramatic increase in the amount of recorded data. The use of smartphones, radio-frequency identification (RFID) tags, electronic sensors, credit cards, and the Internet has facilitated the collection of data from phone conversations, e-mails, business transactions, product and customer tracking, business transactions, and web browsing. The increase in the use of data-mining techniques in business has been caused largely by three events: the explosion in the amount of data being produced and electronically tracked, the ability to electronically warehouse these data, and the affordability of computer power to analyze the data. In this chapter, we discuss the analysis of large quantities of data in order to gain insight on customers and to uncover patterns to improve business processes. We define an observation, or record, as the set of recorded values of variables associated with a single entity. An observation is often displayed as a row of values in a spreadsheet or database in which the columns correspond to the variables. For example, in a university’s database of alumni, an observation may correspond to an alumnus’s age, gender, marital status, employer, position title, as well as size and frequency of donations to the university. In this chapter, we focus on descriptive data-mining methods, also called unsupervised learning techniques. In an unsupervised learning application, there is no outcome variable to predict; rather, the goal is to use the variable values to identify relationships between observations. Unsupervised learning approaches can be thought of as high-dimensional descriptive analytics because they are designed to describe patterns and relationships in large data sets with many observations of many variables. Without an explicit outcome (or one that is objectively known), there is no definitive measure of accuracy. Instead, qualitative assessments, such as how well the results match expert judgment, are used to assess and compare the results from an unsupervised learning method.

5.1 Cluster Analysis

215

5.1 Cluster Analysis The goal of clustering is to organize observations into similar groups based on the observed variables. As part of the data preparation step of a larger data analysis project, clustering can be employed to identify variables or observations that can be aggregated or removed from consideration. Cluster analysis is commonly used in marketing to divide consumers into different homogeneous groups, a process known as market segmentation. Identifying different clusters of consumers allows a firm to tailor marketing strategies for each segment. Cluster analysis can also be used to identify outliers, which in a manufacturing setting may represent quality-control problems and in financial transactions may represent fraudulent activity. In this section, we consider the use of cluster analysis to assist a company called Know Thy Customer (KTC), a financial advising company that provides personalized financial advice to its clients. As a basis for developing this tailored advising, KTC would like to segment its customers into several groups (or clusters) so that the customers within a group are similar with respect to key characteristics and are dissimilar to customers that are not in the group. For each customer, KTC has an observation consisting of the following variables: Age Female Income Married Children Loan Mortgage

DemoKTC

5 age of the customer in whole years 5 1 if female, 0 if not 5 annual income in dollars 5 1 if married, 0 if not 5 number of children 5 1 if customer has a car loan, 0 if not 5 1 if customer has a mortgage, 0 if not

We present two clustering methods using a small sample of data from KTC. First, we consider k-means clustering, a method which iteratively assigns each observation to one of k clusters in an attempt to achieve clusters that contain observations as similar to each other as possible. Second, we consider agglomerative hierarchical clustering which starts with each observation belonging to its own cluster and then sequentially merges the most similar clusters to create a series of nested clusters. Because both methods rely upon measuring the dissimilarity between observations, we first discuss how to calculate distance between observations.

Measuring Distance Between Observations The goal of cluster analysis is to group observations into clusters such that observations within a cluster are similar and observations in different clusters are dissimilar. Therefore, to formalize this process, we need explicit measurements of dissimilarity or, conversely, similarity. Some metrics track similarity between observations, and a clustering method using such a metric would seek to maximize the similarity between observations. Other metrics measure dissimilarity, or distance, between observations, and a clustering method using one of these metrics would seek to minimize the distance between observations in a cluster. When observations include numerical variables, Euclidean distance is a common method to measure dissimilarity between observations. Let observations u 5 (u1, u2, ..., uq) and v 5 (v1, v2, ..., vq) each comprise measurements of q variables. TheEuclidean distance between observations u and v is

duveuclid 5 (u1 2 v1 )2 1 (u2 2 v2 )2 1 1 (uq 2 vq )2

Figure 5.1 depicts Euclidean distance for two observations consisting of two variables (q 5 2). Euclidean distance becomes smaller as a pair of observations become more similar with respect to their variable values. Euclidean distance is highly influenced by the scale on which variables are measured. For example, consider the task of clustering customers on the basis of the variables Age and Income. Let observation u 5 (23, $20,375) correspond to a 23-year-old customer with an annual income of $20,375 and observation

216

Chapter 5 Descriptive Data Mining

Euclidean Distance

v = (v1, v2)

Second Variable

FIGURE 5.1

euclid duv

u = (u1, u2)

First Variable

v 5 (48, $19,475) correspond to a 48-year-old with an annual income of $19,475. As measured by Euclidean distance, the dissimilarity between these two observations is Refer to Chapter 2 for a discussion of z-scores.

d uveuclid 5 (23 2 48)2 1 (20,375 2 19,475)2 5 625 1 810,000 5 900

Thus, we see that when using the raw variable values, the amount of dissimilarity between observations is dominated by the Income variable because of the difference in the magnitude of the measurements. Therefore, it is common to standardize the units of each variable j of each observation u. One common standardization technique is to replace the variable values of each observation with the respective z-scores. For example, uj, the value of the jth variable in observation u, is replaced with: z (u j ) 5

u j 2 average value of jth variable standard deviation of jth variable

Suppose that the variable has a sample mean of 46 and a sample standard deviation of 13. Also, suppose that the Income variable has a sample mean of 28,012 and sample standard deviation of 13,703. Then, the standardized (or normalized) values of observations u and v are (21.77, 20.56) and (0.15, 20.62), respectively. The Euclidean distance between these two observations based on standardized values is

(standardized) d uveuclid 5 (21.77 2 0.15)2 1 (20.56 2 (20.62))2 5 3.6864 1 0.0036 5 1.92

Based on standardized variable values, we observe that observations u and v are actually much more different in age than in income. The conversion of the data to z-scores also makes it easier to identify outlier measurements, which can distort the Euclidean distance between observations due to the squaring of the differences in variable values under the square root. Depending on the business goal of the clustering task and the cause of the outlier value, the identification of an outlier may suggest the removal of the corresponding observation, the correction of the outlier value, or the uncovering of an interesting insight corresponding to the outlier observation. Manhattan distance is a dissimilarity measure that is more robust to outliers than Euclidean distance. The Manhattan distance between observations u and v is duvman 5 ∙u1 2 v1∙ 1 ∙u2 2 v2∙ 1 ... 1 ∙uq 2 vq∙

217

5.1 Cluster Analysis

Manhattan Distance

v = (v1, v2)

Second Variable

FIGURE 5.2

man duv

u = (u1, u2)

First Variable

Figure 5.2 depicts the Manhattan distance for two observations consisting of two variables (q 5 2). From Figure 5.2, we observe that the Manhattan distance between two observations is the sum of the lengths of the perpendicular line segments connecting observations u and v. In contrast to Euclidean distance, which corresponds to the straight-line “as the crow flies” segment between two observations, Manhattan distance corresponds to the distance as if travelled along rectangular city blocks. The Manhattan distance between the standardized observations u 5 (21.77, 20.56) and v 5 (0.15, 20.62) is duvman 5 ∙21.77 2 0.15)∙ 1 ∙20.56 2 (20.62)∙ 5 1.92 1 0.06 5 1.98 After conversion to z-scores, unequal weighting of variables can also be considered by multiplying the variables of each observation by a selected set of weights. For instance, after standardizing the units on customer observations so that income and age are expressed as their respective z-scores (instead of expressed in dollars and years), we can multiply the income z-scores by 2 if we wish to treat income with twice the importance of age. In other words, standardizing removes bias due to the difference in measurement units, and variable weighting allows the analyst to introduce any desired bias based on the business context. When clustering observations solely on the basis of categorical variables encoded as 0–1 (or dummy variables), a better measure of similarity between two observations can be achieved by counting the number of variables with matching values. The simplest overlap measure is called the matching coefficient and is computed as follows: MATCHING COEFFICIENT

number of variables with matching values for observations u and v total number of variables Subtracting the matching coefficient from 1 results in a distance measure for binary variables. The matching distance between observations u and v (consisting entirely of binary variables) is match

d uv

5 1 2 matching coefficient total number of variables number of variables with matching values 5 2 total number of variables total number of variables

5

number of variables with mismatching values total number of v ariables

218

Chapter 5 Descriptive Data Mining

One weakness of the matching coefficient is that if two observations both have a “0” value for a categorical variable, this is counted as a sign of similarity between the two observations. However, matching “0” values do not necessarily imply similarity. For instance, if the categorical variable is Own A Minivan, then a “0” value in two different observations does not mean that these two people own the same type of car; it means only that neither owns a minivan. The analyst must then determine if "not owning a minivan" constitutes a meaningful notion of similarity between observations in the business context. To avoid misstating similarity due to the absence of a feature, a similarity measure called Jaccard’s coefficient does not count matching “0” values and is computed as follows: JACCARD’S COEFFICIENT

number of variables with matching “1” values for observations u and v (total number of variables) 2 (number of variables with matching “0” values for observations u and v ) Subtracting Jaccard’s coefficient from 1 results in the Jaccard distance measure for binary variables. That is, the Jaccard distance between observations u and v (consisting entirely of binary variables) is jac duv 5 1 2 Jaccard’scoefficient

5 12

5

(totalnumberofvariables) 2 (numberofvariableswithmatching“0”) (totalnumberofvariables) 2 (numberofvariableswithmatching“0”)

2 5

numberofvariableswithmatching“1” (totalnumberofvariables) 2 (numberofvariableswithmatching“0”)

numberofvariableswithmatching“1” (totalnumberofvariables) 2 (numberofvariableswithmatching“0”)

numberofvariableswithmismatchingvalues totalnumberofvariables2 numberofvariableswithmatching“0”

For five customer observations from the file DemoKTC, Table 5.1 contains observations of the binary variables Female, Married, Loan, and Mortgage and the dissimilarity matrixes based on the matching distance and Jaccard’s distance, respectively. Based on the matching distance, Observation 1 and Observation 4 are more similar (0.25) than Observation 2 and Observation 3 (0.5) because 3 out of 4 variable values match between Observation 1 and Observation 4 versus just 2 matching values out of 4 for Observation 2 and Observation 3. However, based on Jaccard’s distance, Observation 1 and Observation 4 are equally similar (0.5) as Observation 2 and Observation 3 (0.5) as Jaccard’s coefficient discards the matching zero values for the Loan and Mortgage variables for Observation 1 and Observation 4. In the context of this example, choice of the matching distance or Jaccard’s distance depends on whether KTC believes that matching 0 entries imply similarity or not. That is, KTC must gauge whether meaningful similarity is implied if a pair of observations are not female, not married, do not have a car loan, or do not have a mortgage.

k-Means Clustering When considering observations consisting entirely of numerical observations, an approach called k-means clustering is commonly used to organize observations into similar groups. In k-means clustering, the analyst must specify the number of clusters, k. Given a value of k, the k-means algorithm begins by randomly assigning each observation to one of the k clusters. After all observations have been assigned to a cluster, the resulting cluster centroids are calculated (these cluster centroids are the “means” of k-means clustering). Using the updated cluster centroids, all observations are reassigned to the cluster with the closest

219

5.1 Cluster Analysis

Table 5.1 Observation

Comparison of Distance Matrixes for Observations with Binary Variables Female

Married

Loan

Mortgage

1

1

2

1

1

1

3

1

1

1

4

1

1

5

1

1

Matrix of Matching Distances Observation

1

1

2

1

3

0.5

0.5

4

0.25

0.75

0.25

5

0.25

0.75

0.25

2

3

4

5

2

3

4

5

Matrix of Jaccard Distances Observation

1

1

2

1

3

0.667

0.5

4

0.5

0.75

0.333

5

0.5

0.75

0.333

DemoKTC

centroid (where Euclidean distance is the standard metric). The algorithm repeats this process (calculate cluster centroid, assign each observation to the cluster with nearest centroid) until there is no change in the clusters or a specified maximum number of iterations is reached. As an unsupervised learning technique, cluster analysis is not guided by any explicit measure of accuracy, and thus the notion of a “good” clustering is subjective and is dependent on what the analyst hopes the cluster analysis will uncover. Regardless, one can measure the strength of a cluster by comparing the average distance between observations within the same cluster to the average distance between observations in different pairs of clusters. One rule of thumb is that the ratio of average between-cluster distance to average within-cluster distance should exceed 1.0 for useful clusters. If there is a wide disparity in the cluster strength across a collection of k clusters, it may be possible to find a better clustering of the data by removing all the observations of the strong clusters, and then continuing the clustering process on the remaining observations. To illustrate k-means clustering, we consider a 3-means clustering of a small sample of KTC’s customer data in the file DemoKTC. Figure 5.3 shows three clusters based on customer income and age. Cluster 1 is characterized by relatively younger, lower-income customers (Cluster 1’s centroid is at [32.58, $20,364]). Cluster 2 is characterized by relatively older, higher-income customers (Cluster 2’s centroid is at [57.88, $47,729]). Cluster 3 is characterized by relatively older, lower-income customers (Cluster 3’s centroid is at [52.50,$21,416]). As visually corroborated by Figure 5.3, Table 5.2 shows that Cluster 2 is the smallest, but most heterogeneous cluster. We also observe that Cluster 1 is the largest cluster and Cluster 3 is the most homogeneous cluster. Table 5.3 displays the average distance between observations in different pairs of clusters to demonstrate how distinct the clusters are from each other. Cluster 1 and Cluster 2 are the most distinct from each other. To evaluate the strength of the clusters, we compare the average distance within

220

Chapter 5 Descriptive Data Mining

FIGURE5.3

Cluster centroids are depicted by circles in Figure 5.3.

Cluster 1

$65,000

Cluster 2

Cluster 3

Income

$55,000 $45,000 $35,000 $25,000 $15,000

Although Figure 5.3 is plotted in the original scale of the variables, the clustering was based on the variables after standardizing their values.

Tables 5.2 and 5.3 are expressed in terms of standardized coordinates in order to eliminate any distortion resulting from differences in the scale of the input variables.

Clustering Observations by Age and Income Using k-Means Clustering with k 5 3

$5,000 20

Table 5.2

50 40 Age (years)

30

60

70

Average Distances Within Clusters No. of Observations

Average Distance Between Observations in Cluster

Cluster 1

12

0.886

Cluster 2

8

1.051

Cluster 3

10

0.731

Table 5.3

Average Distances Between Clusters Cluster 1

Cluster 2

Cluster 3

Cluster 1

2.812

1.629

Cluster 2

2.812

2.054

Cluster 3

1.629

2.054

each cluster (Table 5.2) to the average distances between clusters (Table 5.3). For example, although Cluster 2 is the most heterogeneous, with an average distance between observations of 1.051, comparing this to the average distance between Cluster 2 observations and Cluster 3 observations (2.054) reveals that on average an observation in Cluster 2 is approximately 1.95 times closer to Cluster 2 observations than Cluster 3 observations. In general, a clustering becomes more distinct as the ratio of the average between-distance to the average within-distance increases. Although qualitative considerations should take priority in evaluating clusters, using the ratios of the average between-cluster distance and the average within-cluster distance provides some guidance in evaluating a set of clusters. If the number of clusters, k, is not clearly established by the context of the business problem, the k-means clustering algorithm can be repeated for several values of k to identify promising values. A common approach to quickly compare the effect of k is to consider its impact on the total sum of squared deviations from the observations to their assigned cluster centroid. In Figure 5.4, we visualize the effect of varying the number of clusters (k 5 1, ..., 6). In this figure, the blue curve connects the total sums of squared deviations for the Copyright 2021 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.

5.1 Cluster Analysis

FIGURE 5.4

221

Total Sum of Squared Deviations for Various Number of Clusters

various values of k. The red curve depicts the decrease in the total sum of squared deviations that occurs when increasing the number of clusters from k21 to k. We observe from Figure 5.4 that as the number of clusters increases, the total sum of squared deviations decreases. Indeed, if we allow the number of clusters be equal to the number of observations this sum of squared deviations is zero. Of course, placing each observation in its own cluster provides no insight into the similarity between observations, so minimizing the total sum of squared deviations is not the goal. Figure 5.4 shows there is a large decrease in the total sum of squared deviation when k increases from 1 to 2, but the marginal decrease in the total sum of squared deviations decreases for further increases in k. From this plot, we see that the most promising values of k are 2, 3, or 4. The sets of clusters for these values of k should be examined more closely. In particular, an “elbow” occurs at k 5 3, as this is the point beyond which the marginal decrease in the total sum of squared deviations flattens, suggesting this may be a good choice.

Hierarchical Clustering and Measuring Dissimilarity Between Clusters An alternative to partitioning observations with the k-means approach is an agglomerative hierarchical clustering approach that starts with each observation in its own cluster and then iteratively combines the two clusters that are the least dissimilar (most similar) into a single cluster. Each iteration corresponds to an increased level of aggregation by decreasing the number of distinct clusters. Hierarchical clustering determines the dissimilarity of two clusters by considering the distance between the observations in first cluster and the observations in the second cluster. Given a way to measure distance between observations (e.g., Euclidean distance, Manhattan distance, matching distance, Jaccard distance), there are several agglomeration methods for comparing observations in two clusters to obtain a cluster dissimilarity measure. Using Euclidean distance to illustrate, Figure 5.5 provides a two-dimensional depiction of four agglomeration methods we will discuss. Copyright 2021 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.

222

FIGURE 5.5

Chapter 5 Descriptive Data Mining

Dendrogram for KTC Using Matching Coefficients and Group Average Linkage

6

6

2

2 4 3

1

4 5

Single Linkage, d3,4

Complete Linkage, d1,6

6

6

2

2 5

3

4

c1

4 1

5

3

1

Group Average Linkage, d1,41d1,51d1,61d2,41d2,51d2,61d3,41d3,51d3,6

1

c2

3

5

Centroid Linkage, dc1,c2

9

When using the single linkage agglomeration method, the dissimilarity between two clusters is defined by the distance between the pair of observations (one from each cluster) that are the most similar. Thus, single linkage will consider two clusters to be close if an observation in one of the clusters is close to at least one observation in the other cluster. However, a cluster formed by merging two clusters that are close with respect to single linkage may also consist of pairs of observations that are very different. The reason is that there is no consideration of how different an observation may be from other observations in a cluster as long as it is similar to at least one observation in that cluster. Thus, in two dimensions (variables), single linkage clustering can result in long, elongated clusters rather than compact, circular clusters. The complete linkage agglomeration method defines the dissimilarity between two clusters as the distance between the pair of observations (one from each cluster) that are the most different. Thus, complete linkage will consider two clusters to be close if their most-different pair of observations are close. This method produces clusters such that all member observations of a cluster are relatively close to each other. The clusters produced by complete linkage have approximately equal diameters. However, complete linkage clustering can be distorted by outlier observations. The single linkage and complete linkage methods define between-cluster dissimilarity based on the single pair of observations in two different clusters that are most similar or least similar. In contrast, the group average linkage agglomeration method defines the dissimilarity between two clusters to be the average distance computed over all pairs of observations between the two clusters. If Cluster 1 consists of n1 observations and Cluster 2 consists of n2 observations, the dissimilarity of these clusters would be the average of n1 3 n2 distances. This method produces clusters that are less dominated by the dissimilarity between single pairs of observations. The median linkage method is analogous to group average linkage except that it uses the median distance (not the average) computed over all pairs of observations between the two clusters. The use of the median reduces the effect of outliers.

5.1 Cluster Analysis

DemoKTC

223

Centroid linkage uses the averaging concept of cluster centroids to define between- cluster dissimilarity. The centroid for cluster k, denoted as ck , is found by calculating the average value for each variable across all observations in a cluster; that is, a centroid is the average observation of a cluster. The dissimilarity between cluster k and cluster j is then defined as the distance between the centroids ck and cj. Ward’s method for merging clusters is based on the notion that representing a cluster with its centroid can be viewed as a loss of information in the sense that the individual differences in the observations within the cluster are not be captured by the cluster centroid. For a pair of clusters under consideration for aggregation, Ward’s method computes the centroid of the resulting merged cluster and then calculates the sum of squared distances between this centroid and each observation in the union of the two clusters. At each iteration, Ward's method merges the pair of clusters with the smallest value of this dissimilarity measure. As a result, hierarchical clustering using Ward’s method results in a sequence of aggregated clusters that minimizes this loss of information between the individual observation level and the cluster centroid level. Similar to group average linkage, McQuitty’s method for merging clusters also defines the dissimilarity between two clusters on averaging, but computes the average in a different manner. To illustrate, suppose at an iteration cluster A and cluster B are the most similar over the entire set of clusters and therefore merged into cluster AB. For the next iteration, the dissimilarity between cluster AB and any other cluster C is updated as ((dissimilarity between A and C) 1 (dissimilarity between B and C)) 4 2. This is different than group average linkage because this calculation is a simple average of two dissimilarity measures rather than calculating the average dissimilarity over all pairs of observations between cluster AB and cluster C. By always computing the average dissimilarity between two clusters as a simple average of the two component dissimilarity measures, McQuitty’s method is implicitly placing different weights on the distances between the individual observations whereas in group average linkage the distance between each pair of observations between two clusters is weighted equally. Returning to our example, KTC is interested in developing customer segments based on gender, marital status, whether the customer is repaying a car loan, and whether the customer is repaying a mortgage. Using data in the file DemoKTC, we base the clusters on a collection of 0–1 categorical variables (Female, Married, Loan, and Mortgage). We use the matching distance to measure dissimilarity between observations and the group average linkage agglomeration method to measure similarity between clusters. The choice of the matching distance (over Jaccard’s distance) is reasonable because a pair of customers that both have an entry of zero for any of these four variables implies some degree of similarity. For example, two customers that both have zero entries for Mortgage means that neither has the significant debt associated with a mortgage. Figure 5.6 depicts a dendrogram to visually summarize the output from a hierarchical clustering using matching distance to measure dissimilarity between observations and the group average linkage agglomeration method to measure dissimilarity between clusters. A dendrogram is a chart that depicts the set of nested clusters resulting at each step of aggregation. The horizontal axis of the dendrogram lists the observation indexes. The vertical axis of the dendrogram represents the dissimilarity (distance) resulting from a merger of two different groups of observations. Each blue horizontal line in the dendrogram represents a merger of two (or more) clusters, where the observations composing the merged clusters are connected to the blue horizontal line with a blue vertical line. For example, the blue horizontal line connecting observations 4, 5, 6, 11, 19, and 28 conveys that these six observations are grouped together and the resulting cluster has a dissimilarity measure of 0. A dissimilarity of 0 results from this merger because these six observations have identical values for the Female, Married, Loan, and Mortgage variables. In this case, each of these six observations corresponds to a married female with no car loan and no mortgage. Following the blue vertical line up from the cluster of {4, 5, 6, 11, 19, 28}, another blue horizontal line connects this cluster with the cluster consisting solely of observation 1. Thus, the cluster {4, 5, 6, 11, 19, 28} and cluster {1} are merged resulting in a dissimilarity of 0.25. The dissimilarity of 0.25 results from this merger because

224

FIGURE 5.6

Chapter 5 Descriptive Data Mining

Dendrogram for KTC Using Matching Distance and Group Average Linkage

observation 1 differs in one out of the four categorical variable values; observation 1 is an unmarried female with no car loan and no mortgage. To interpret a dendrogram at a specific level of aggregation, it is helpful to visualize a horizontal line such as one of the black dashed lines we have drawn across Figure 5.6. The bottom horizontal black dashed line intersects with the vertical branches in the dendrogram three times; each intersection corresponds to a cluster containing the observations connected by the vertical branch that is intersected. The composition of these three clusters is Cluster 1: {4, 5, 6, 11, 19, 28, 1, 7, 21, 22, 23, 30, 13, 17, 18, 15, 27} 5 10 out of 17 female, 15 out of 17 married, no car loans, 5 out of 17 with mortgages Cluster 2: {2, 26, 8, 10, 20, 25} 5 all males with car loans, 5 out of 6 married, 2 out of 6 with mortgages Cluster 3: {3, 9, 14, 16, 12, 24, 29} 5 all females with car loans, 4 out of 7 married, 5 out of 7 with mortgages These clusters segment KTC’s customers into three groups that could possibly indicate varying levels of responsibility—an important factor to consider when providing financial advice. The nested construction of the hierarchical clusters allows KTC to identify different numbers of clusters and assess (often qualitatively) the implications. By sliding a horizontal line up or down the vertical axis of a dendrogram and observing the intersection of the horizontal line with the vertical dendrogram branches, an analyst can extract varying numbers of clusters. Note that sliding up to the position of the top horizontal black line in Figure 5.6 results in merging cluster2 with cluster 3 into a single, more dissimilar, cluster. The vertical distance between the points of agglomeration is the “cost” of merging clusters in terms of decreased homogeneity within clusters. Thus, vertically elongated portions of the dendrogram represent mergers of more dissimilar Copyright 2021 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.

225

5.1 Cluster Analysis

clusters, and vertically compact portions of the dendrogram represent mergers of more similar clusters. A cluster’s durability (or strength) can be measured by the difference between the distance value at which a cluster is originally formed and the distance value at which it is merged with another cluster. Figure 5.6 shows that the cluster consisting of {12, 24, 29} (single females with car loans and mortgages) is a very durable cluster in this example because the vertical line for this cluster is very long before it is merged with another cluster.

Hierarchical Clustering versus k-Means Clustering Hierarchical clustering is a good choice in situations in which you want to easily examine solutions with a wide range of clusters. Hierarchical clusters are also convenient if you want to observe how clusters are nested. However, hierarchical clustering can be very sensitive to outliers, and clusters may change dramatically if observations are eliminated from (or added to) the data set. Hierarchical clustering may be less appropriate option as the number of the observations in the data set grows large as the procedure is relatively computationally expensive (starting with each observation in its own cluster). The k-means approach is a good option for clustering data on the basis of numerical variables, and is computationally efficient enough to handle an increasingly large number of observations. Recall that k-means clustering partitions the observations, which is appropriate if you are trying to summarize the data with k “average” observations that describe the data with the minimum amount of error. However, k-means clustering is generally not appropriate for categorical or ordinal data, for which an “average” is not meaningful. For both hierarchical and k-means clustering, the selection of variables on which to base the clustering process is a critical aspect. Clustering should be based on a parsimonious set of variables, determined through a combination of context knowledge and experimentation with various variable combinations, that reveal interesting patterns in the data. As the number of variables upon which distance between observations is computed increases, all observations tend to become equidistant. That is, as the number of variables considered in a clustering approach increases, the distances between pairs of observations get larger (distance can only increase when adding variables to the calculation), but the relative differences in the distances between pairs of observations tend to get smaller.

N otes

+

C omments

1. Clustering observations based on both numerical and categorical variables (mixed data) can be challenging. Dissimilarity between observations with numerical variables is commonly computed using Euclidean distance. However, Euclidean distance is not well defined for categorical variables as the magnitude of the Euclidean distance measure between two category values will depend on the numerical encoding of the categories. There are elaborate methods beyond the scope of this book to try to address the challenge of clustering mixed data. Using the methods introduced in this section, there are two alternative approaches to clustering mixed data. The first approach is to decompose the clustering into two steps. The first step applies hierarchical clustering of the observations only on categorical variables using an appropriate measure (matching distance or Jaccard’s distance) to identify a set of “first-step” clusters. The second step is to apply k-means clustering (or hierarchical clustering again) separately to each of these “first-step” clusters using only the numerical variables. This decomposition approach is not fail-safe as it fixes clusters with respect

to one variable type before clustering with respect to the other variable type, but it does allow the analyst to identify how the observations are similar or different with respect to the two variable types. A second approach to clustering mixed data is to numerically encode the categorical values (e.g., binary coding, ordinal coding) and then to standardize both the categorical and numerical variable values. To reflect relative importance of the variables, the analyst may experiment with various weightings of the variables and apply hierarchical or k-means clustering. This approach is very experimental and the variable weights are subjective. 2. When dealing with mixed data, instead of standardizing the variable values by replacing them with the corresponding z-scores, it a common to scale the numerical variable values between 0 and 1 so that they have values on same scale as the binary-encoded categorical variables. To achieve this, uj, the value of the jth variable in observation u, is replaced with: b(u j ) 5

(u j 2 minimum value of jth variable) (maximum value of jth variable 2 minimum value of jth variable)

226

Chapter 5 Descriptive Data Mining

5.2 Association Rules

Support is also sometimes expressed as the number (or count) of total transactions in the data containing an item set.

The data in Table 5.4 are in item list format; that is, each transaction row corresponds to a list of item names. Alternatively, the data can be represented in binary matrix format, in which each row is a transaction record and the columns correspond to each distinct item. A third approach is to store the data in stacked form in which each row is an ordered pair; the first entry is the transaction number and the second entry is the item.

In marketing, analyzing consumer behavior can lead to insights regarding the placement and promotion of products. Specifically, marketers are interested in examining transaction data on customer purchases to identify the products commonly purchased together. Bar-code scanners facilitate the collection of retail transaction data, and membership in a customer’s loyalty program can further associate the transaction with a specific customer. In this section, we discuss the development of probabilistic if–then statements, called association rules, which convey the likelihood of certain items being purchased together. Although association rules are an important tool in market basket analysis, they are also applicable to disciplines other than marketing. For example, association rules can assist medical researchers in understanding which treatments have been commonly prescribed to certain patient symptoms (and the resulting effects). Hy-Vee grocery store would like to gain insight into its customers’ purchase patterns to possibly improve its in-aisle product placement and cross-product promotions. Table 5.4 contains a small sample of data in which each transaction comprises the items purchased by a shopper in a single visit to a Hy-Vee. An example of an association rule from this data would be “if {bread, jelly}, then {peanut butter},” meaning that “if a transaction includes bread and jelly, then it also includes peanut butter.” The collection of items (or item set) corresponding to the if portion of the rule, {bread, jelly}, is called the antecedent. The item set corresponding to the then portion of the rule, {peanut butter}, is called the consequent. Typically, only association rules for which the consequent consists of a single item are considered because these are more actionable. Although the number of possible association rules can be overwhelming, we typically investigate only association rules that involve antecedent and consequent item sets that occur together frequently. To formalize the notion of “frequent,” we define the support of an item set as the percentage of transactions in the data that include that item set. In Table 5.4, the support of {bread, jelly} is 4/10 5 0.4. For a transaction randomly selected from the data set displayed in Table 5.4, the probability of it containing the item set {bread, jelly} is 0.4. The potential impact of an association rule is often governed by the number of transactions it may affect, which is measured by computing the support of the item set consisting of the union of its antecedent and consequent. Investigating the rule “if {bread, jelly}, then {peanut butter}” from Table 5.4, we see the support of {bread, jelly, peanut butter} is 0.2. For a transaction randomly selected from the data set displayed in Table 5.4, the probability of it containing the item set {bread, jelly, peanut butter} is 0.2. By only considering rules involving item sets with a support above a minimum level, inexplicable rules capturing random noise in the data can generally be avoided. A rule of thumb is to Table 5.4

Shopping-Cart Transactions

Transaction

Shopping Cart

1

bread, peanut butter, milk, fruit, jelly

2

bread, jelly, soda, potato chips, milk, fruit, vegetables, peanut butter

3

whipped cream, fruit, chocolate sauce, beer

4

steak, jelly, soda, potato chips, bread, fruit

5

jelly, soda, peanut butter, milk, fruit

6

jelly, soda, potato chips, milk, bread, fruit

7

fruit, soda, potato chips, milk

8

fruit, soda, peanut butter, milk

9

fruit, cheese, yogurt

10

yogurt, vegetables, beer

227

5.2 Association Rules

consider only association rules with a support of at least 20% of the total number of transactions. If an item set is particularly valuable and represents a lucrative opportunity, then the minimum support used to filter the rules can be lowered. A property of a reliable association rule is that, given a transaction contains the antecedent item set, there is a high probability that it contains the consequent item set. This conditional probability of P(consequent item set | antecedent item set) is called the confidence of a rule, and is computed as The definition of confidence follows from the definition of conditional probability discussed in Chapter 4.

CONFIDENCE

P(consequent and antecedent) P(antecedent) support of{consequent and antecedent} 5 support of antecedent

P(consequent | antecedent) 5

Although high value of confidence suggests a rule in which the consequent is frequently true when the antecedent is true, a high value of confidence can be misleading. For example, if the support of the consequent is high—that is, the item set corresponding to the then part is very frequent—then the confidence of the association rule could be high even if there is little or no association between the items. In Table 5.4, the rule “if {cheese}, then {fruit}” has a confidence of 1.0 (or 100%). This is misleading because {fruit} is a frequent item; almost any rule with {fruit} as the consequent will have high confidence. Therefore, to evaluate the efficiency of a rule, we need to compare the P(consequent | antecedent) to the P(consequent) to determine how much more likely the consequent item set is given the antecedent item set versus just the overall (unconditional) likelihood that a transaction contains the consequent. The ratio of the P(consequent | antecedent) to P(consequent) is called the lift ratio of the rule and is computed as: LIFT RATIO

P(consequent | antecedent) P(consequent and antecedent) 5 P(consequent) P(consequent) 3 P(antecedent) confidence of rule 5 support of consequent

Thus, the lift ratio represents how effective an association rule is at identifying transactions in which the consequent item set occurs versus a randomly selected transaction. A lift ratio greater than one suggests that there is some usefulness to the rule and that it is better at identifying cases when the consequent occurs than having no rule at all. From the definition of lift ratio, we see that the denominator contains the probability of a transaction containing the consequent set multiplied by the probability of a transaction containing the antecedent set. This product of probabilities is equivalent to the expected likelihood of a transaction containing both the consequent item set and antecedent item set if these item sets were independent. In other words, a lift ratio greater than one suggests that the level of association between the antecedent and consequent is higher than would be expected if these item sets were independent. For the data in Table 5.4, the rule “if {bread, jelly}, then {peanut butter}” has confidence 5 2/4 5 0.5 and lift ratio 5 0.5/0.4 5 1.25. In other words, a customer who purchased both bread and jelly is 25% more likely to have purchased peanut butter than a randomly selected customer. The utility of a rule depends on both its support and its lift ratio. Although a high lift ratio suggests that the rule is very efficient at finding when the consequent occurs,

228

Table 5.5

Chapter 5 Descriptive Data Mining

Association Rules for Hy-Vee

Antecedent (A)

Consequent (C)

Support for A

Support for C

Support for A & C

Confidence

Lift Ratio

Bread

Fruit, Jelly

0.40

0.50

0.40

1.00

2.00

Bread

Jelly

0.40

0.50

0.40

1.00

2.00

Bread, Fruit

Jelly

0.40

0.50

0.40

1.00

2.00

Fruit, Jelly

Bread

0.50

0.40

0.40

0.80

2.00

Jelly

Bread

0.50

0.40

0.40

0.80

2.00

Jelly

Bread, Fruit

0.50

0.40

0.40

0.80

2.00 1.67

Fruit, Potato Chips

Soda

0.40

0.60

0.40

1.00

Peanut Butter

Milk

0.40

0.40

0.60

1.00

1.67

Peanut Butter

Milk, Fruit

0.40

0.60

0.40

1.00

1.67

Peanut Butter, Fruit

Milk

0.40

0.60

0.40

1.00

1.67

Potato Chips

Fruit, Soda

0.40

0.60

0.40

1.00

1.67

Potato Chips

Soda

0.40

0.60

0.40

1.00

1.67

Fruit, Soda

Potato Chips

0.60

0.40

0.40

0.67

1.67

Milk

Peanut Butter

0.60

0.40

0.40

0.67

1.67 1.67

Milk

Peanut Butter, Fruit

0.60

0.40

0.40

0.67

Milk, Fruit

Peanut Butter

0.60

0.40

0.40

0.67

1.67

Soda

Fruit, Potato Chips

0.60

0.40

0.40

0.67

1.67

Soda

Potato Chips

0.60

0.40

0.40

0.67

1.67

Fruit, Soda

Milk

0.60

0.60

0.50

0.83

1.39

Milk

Fruit, Soda

0.60

0.60

0.50

0.83

1.39

Milk

Soda

0.60

0.60

0.50

0.83

1.39

Milk, Fruit

Soda

0.60

0.60

0.50

0.83

1.39

Soda

Milk

0.60

0.60

0.50

0.83

1.39

Soda

Milk, Fruit

0.60

0.60

0.50

0.83

1.39

HyVeeDemoBinary HyVeeDemoStacked

if it has a very low support, the rule may not be as useful as another rule that has a lower lift ratio but affects a large number of transactions (as demonstrated by a high support). However, an association rule with a high lift ratio and low support may still be useful if the consequent represents a very valuable opportunity. Based on the data in Table 5.4, Table 5.5 shows the list of association rules that achieve a lift ratio of at least 1.39 while satisfying a minimum support of 40% and a minimum confidence of 50%. The top rules in Table 5.5 suggest that bread, fruit, and jelly are commonly associated items. For example, the fourth rule listed in Table 5.5 states, “If Fruit and Jelly are purchased, then Bread is also purchased.” Perhaps Hy-Vee could consider a promotion and/or product placement to leverage this perceived relationship.

Evaluating Association Rules Although explicit measures such as support, confidence, and lift ratio can help filter association rules, an association rule is ultimately judged on how actionable it is and how well it explains the relationship between item sets. For example, suppose Walmart mined its transactional data to uncover strong evidence of the association rule, “If a customer purchases a Barbie doll, then a customer also purchases a candy bar.” Walmart could leverage this relationship in product placement decisions as well as in advertisements and promotions, perhaps by placing a high-margin candy-bar display near the Barbie dolls. However, we must be aware that association rule analysis often results in obvious relationships such as “If a customer purchases hamburger patties,

5.3 Text Mining

229

then a customer also purchases hamburger buns,” which may be true but provide no new insight. Association rules with a weak support measure often are inexplicable. For an association rule to be useful, it must be well supported and explain an important previously unknown relationship. The support of an association rule can generally be improved by basing it on less specific antecedent and consequent item sets. Unfortunately, association rules based on less specific item sets tend to yield less insight. Adjusting the data by aggregating items into more general categories (or splitting items into more specific categories) so that items occur in roughly the same number of transactions often yields better association rules.

5.3 Text Mining Every day, nearly 500 million tweets are published on the online social network service Twitter. Many of these tweets contain important clues about how Twitter users value a company’s products and services. Some tweets might sing the praises of a product; others might complain about low-quality service. Furthermore, Twitter users vary greatly in the number of followers (some have thousands of followers and others just a few) and therefore these users have varying degrees of influence. Data-savvy companies can use social media data to improve their products and services. Online reviews on web sites such as Amazon and Yelp provide data on how customers feel about products and services. However, the data in these examples are not numerical. The data are text: words, phrases, sentences, and paragraphs. Text, like numerical data, may contain information that canhelp solve problems and lead to better decisions. Text mining is the process of extracting useful information from text data. In this section, we discuss text mining, how it is different from data mining of numerical data, and how it can be useful for decision making. Text data is often referred to as unstructured data because in its raw form, it cannot be stored in a traditional structured database (with observations in rows and variables in columns). Audio and video data are also examples of unstructured data. Data mining with text data is more challenging than data mining with traditional numerical data, because it requires more preprocessing to convert the text to a format amenable for analysis. However, once the text data has been converted to numerical data, we can apply the data mining methods discussed earlier in this chapter. We begin with a small example which illustrates how text data can be converted to numerical data and then analyzed. Then we will provide more in-depth discussion of text-mining concepts and preprocessing procedures.

Voice of the Customer at Triad Airline Triad Airlines is a regional commuter airline. Through its voice of the customer program, Triad solicits feedback from its customers through a follow-up e-mail the day after the customer has completed a flight. The e-mail survey asks the customer to rate various aspects of the flight and asks the respondent to type comments into a dialog box in the e-mail. In addition to the quantitative feedback from the ratings, the comments entered by the respondents need to be analyzed so that Triad can better understand its customers’ specific concerns and respond in an appropriate manner. Table 5.6 contains a small training sample of these comments we will use to illustrate how descriptive text mining can be used in this business context. In the text mining domain, a contiguous piece of text is referred to as a document. A document can be a single sentence or an entire book, depending on how the text is organized for analysis. Each document is composed of individual terms, which often correspond to words. In general, a collection of text documents to be analyzed is called a corpus. Inthe Triad Airline example, our corpus consists of 10 documents, where each document is a single customer’s comments. Triad’s management would like to categorize these customer comments into groups whose member comments share similar characteristics so that a focused solution team can be assigned to each group of comments. Preprocessing text can be viewed as representation engineering. To be analyzed, text data needs to be converted to structured data (rows and columns of numerical data) so that the tools of descriptive statistics, data visualization, and data mining can be applied. Considering each document as an observation (row in a data set), we wish to represent the text in

230

Chapter 5 Descriptive Data Mining

TABLE 5.6

Ten Respondents’ Comments for Triad Airlines

Comments The wi-fi service was horrible. It was slow and cut off several times. My seat was uncomfortable. My flight was delayed 2 hours for no apparent reason. Triad

My seat would not recline. The man at the ticket counter was rude. Service was horrible. The flight attendant was rude. Service was bad. My flight was delayed with no explanation. My drink spilled when the guy in front of me reclined his seat. My flight was canceled. The arm rest of my seat was nasty.

If the document-term matrix is transposed (so that the terms are in the rows and the documents are in the columns), the resulting matrix is referred to as a term-document matrix

the document with variables (or columns in a data set). One common approach, called bag of words, treats every document as just a collection of individual words (or terms). Bag of words is a simple (but often effective) approach that ignores natural language processing aspects such as grammar, word order, and sentence structure. In bag of words, we can think of converting a group of documents into a matrix of rows and columns where the rows correspond to a document and the columns correspond to a particular word. In Triad’s case, a document is a single respondent’s comment. A presence/absence or binary document-term matrix is a matrix with the rows representing documents and the columns representing words, and the entries in the columns indicating either the presence or the absence of a particular word in a particular document (1 5 present and 0 5 not present). Creating the list of terms to use in the presence/absence matrix can be a complicated matter. Too many terms results in a matrix with many columns, which may be difficult to manage and could yield meaningless results. Too few terms may miss important relationships. Often, term frequency along with the problem context are used as a guide. We discuss this in more detail in the next section. In Triad’s case, management used word frequency and the context of having a goal of satisfied customers to come up with the following list of terms they feel are relevant for categorizing the respondent’s comments: delayed, flight, horrible, recline, rude, seat, and service. As shown in Table 5.7, these seven terms correspond to the columns of the presence/ absence document-term matrix and the rows correspond to the 10 documents. Each matrix entry indicates whether or not a column’s term appears in the document corresponding to the row. For example, a 1 entry in the first row and third column means that the term “horrible” appears in document 1. A zero entry in the third row and fourth column means that the term “recline” does not appear in document 3. Having converted the text to numerical data, we can apply clustering. In this case, because we have binary presence-absence data, we apply hierarchical clustering. Observing that the absence of a term in two different documents does not imply similarity between the documents, we select Jaccard’s distance to measure dissimilarity between observations (documents). To measure dissimilarity between clusters, we use the complete linkage agglomeration method. At the level of three clusters, hierarchical clustering results in the following groups of documents: Cluster 1: Cluster 2: Cluster 3:

{1, 5, 6} 5 documents discussing service issues {2, 4, 8, 10} 5 documents discussing seat issues {3, 7, 9} 5 documents discussing schedule issues

With these three clusters defined, management can assign an expert team to each of these clusters to directly address the concerns of its customers.

231

5.3 Text Mining

TABLE 5.7

The Presence/Absence Document-Term Matrix for Triad Airlines Term

Document

Delayed

Flight

Horrible

Recline

Rude

Seat

Service

1

1

1

2

1

3

1

1

4

1

1

5

1

1

1

6

1

1

1

7

1

1

8

1

1

9

1

10

1

Preprocessing Text Data for Analysis In general, the text mining process converts unstructured text into numerical data and applies data mining techniques. For the Triad example, we converted the text documents into a document-term matrix and then applied hierarchical clustering to gain insight on the different types of comments. In this section, we present a more detailed discussion of terminology and methods used in preprocessing text data into numerical data for analysis. Converting documents to a document-term matrix is not a simple task. Obviously, which terms become the headers of the columns of the document-term matrix can greatly impact the analysis. Tokenization is the process of dividing text into separate terms, referred to as tokens. The process of identifying tokens is not straightforward and involves term normalization, a set of natural language processing techniques to map text into a standardized form. First, symbols and punctuations must be removed from the document and all letters should be converted to lowercase. For example, “Awesome!”, “awesome,” and “#Awesome” should all be converted to “awesome.” Likewise, different forms of the same word, such as “stacking,” “stacked,” and “stack” probably should not be considered as distinct terms. Stemming, the process of converting a word to its stem or root word, would drop the “ing” and “ed” suffixes and place only “stack” in the list of terms to be tracked. The goal of preprocessing is to generate a list of most relevant terms that is sufficiently small so as to lend itself to analysis. In addition to stemming, frequency can be used to eliminate words from consideration as tokens. For example, if a term occurs very frequently in every document in the corpus, then it probably will not be very useful and can be eliminated from consideration. Text mining software contains procedures to automatically remove stopwords, very common words in English (or whatever language is being analyzed), such as “the,” “and,” and “of.” Similarly, low-frequency words probably will not be very useful as tokens. Another technique for reducing the consideration set for tokens is to consolidate a set of words that are synonyms. For example, “courteous,” “cordial,” and “polite” might be best represented as a single token, “polite.” In addition to automated stemming and text reduction via frequency and synonyms, most text-mining software gives the user the ability to manually specify terms to include or exclude as tokens. Also, the use of slang, humor, irony, and sarcasm can cause interpretation problems and might require more sophisticated data cleansing and subjective intervention on the part of the analyst to avoid misinterpretation. Data preprocessing parses the original text data down to the set of tokens deemed relevant for the topic being studied. Based on these tokens, a presence/absence document-term matrix such as in Table 5.7 can be generated.

232

Chapter 5 Descriptive Data Mining

When the documents in a corpus contain many more words than the brief comments in the Triad Airline example, and when the frequency of word occurrence is important to the context of the business problem, preprocessing can be used to develop a frequency document-term matrix. A frequency document-term matrix is a matrix whose rows represent documents and columns represent tokens, and the entries in the matrix are the frequency of occurrence of each token in each document. We illustrate this in the following example.

Movie Reviews A new action film has been released and we now have a sample of 10 reviews from moviecritics. Using preprocessing techniques, including text reduction by synonyms, we have reduced the number of tokens to only two: “great” and “terrible.” Table 5.8 displays the corresponding frequency document-term matrix. As Table 5.8 shows, the token “great” appears four times in Document 7. Reviewing the entire table, we observe that five is the maximum frequency of a token in a document and zero is the minimum frequency. To demonstrate the analysis of a frequency document-term matrix with descriptive data mining, we apply k-means clustering with k 5 2 to the frequency document-term matrix to obtain the two clusters in Figure 5.7. Cluster 1 contains reviews that tend to be negative The Frequency Document-Term Matrix for Movie Reviews

TABLE 5.8

Term Document

Great

Terrible

1

5

2

5

1

3

5

1

4

3

3

5

5

1

6

5

7

4

1

8

5

3

9

1

3

10

1

2

Two Clusters Using k-Means Clustering on Movie Reviews

FIGURE 5.7

Cluster 1 5 Cluster 2

Terrible

4 3 2 1 0

1

2

3

4

5

Great

233

5.3 Text Mining

and Cluster 2 contains reviews that tend to be positive. We note that the Observation (3, 3) corresponds to the balanced review of Document 4; based on this small corpus, the balanced review is more similar to the positive reviews than the negative reviews, suggesting that the negative reviews may tend to be more extreme. Table 5.8 shows the raw counts (frequencies) of terms. When documents in a corpus substantially vary in length (number of terms), it is common to adjust for document length by dividing the raw term frequencies by the total number of terms in the document. Term frequency (whether based on raw count or relative frequency to account for document length) is a text mining measure that pertains to an individual document. For a particular document, an analyst may also be interested how unique a term is relative to the other documents in the corpus. As previously mentioned, it is common to impose lower and upper limits on the term frequencies to filter out terms that are extremely rare or extremely common. In addition to those mechanisms, an analyst may be interested in weighting terms based on their distribution over the corpus. The logic is that the fewer the documents in which a term occurs, the more likely it is to be potentially insightful to the analysis of the documents it does occur in. A popular measure for weighting terms based on frequency and uniqueness is term frequency times inverse document frequency (TFIDF). The TFIDF of a term t in document d is computed as follows: TFIDF (t, d ) 5 termfrequency 3 inversedocumentfrequency

5 (numberoftimesterm t appearsindocument d ) totalnumberofdocumentsincorpus 3 1 1 log numberofdocumentscontainingterm t The inverse document frequency portion of the TFIDF calculation gives a boost to terms for being unique. Thus, using TFIDF(t, d) more highly scores a term t that frequently appears in document d and does not appear frequently in other documents. Thus, basing a document-term matrix on TFIDF can make the unique terms (with respect to their frequency) more pronounced. For five different possible term frequencies of term t in a document d (represented by TF(t,d)), Figure 5.8 displays the FIGURE 5.8

TFIDF(t,d) for Varying Levels of Term Frequency and Document Sparseness

30

TFIDF(t, d)

25 20 15 10 5

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 91 94 97 100

Number of Documents Containing Term t

TF(t,d) = 1

TF(t,d) = 2

TF(t,d) = 3

TF(t,d) = 4

TF(t,d) = 5

234

Chapter 5 Descriptive Data Mining

TFIDF(t,d) value as the number of total documents containing the term t ranges from one document to 100 documents (in a corpus of 100 documents).

Computing Dissimilarity Between Documents After preprocessing text and the conversion of a corpus into a frequency document-term matrix (based on raw frequency, relative frequency, or TFIDF), the notion of distance between observations discussed earlier in this chapter applies. In this case, the distance between observations corresponds to the dissimilarity between documents. To measure the dissimilarity between text documents, cosine distance is commonly used. Cosine distance is computed by: cos duv 5 12

u13v11...1 uq 3 vq u12 1... uq2 v12 1... vq2

The cosine distance between document 3 and document 10 in Table 5.8 is: cos duv 5 12

5311132 52112 12122

5 0.386

To visualize the cosine distance between two observations, in this case (5, 1) and (1, 2), Figure 5.9 represents these observations as vectors emanating from the origin. The cosine distance between two observations is equivalent to the cosine of the angle (measured in radians) between their corresponding vectors. Cosine distance can be particularly useful for analyzing a frequency document-term matrix because the angle between two observation vectors does not depend on the magnitude of the variables (making it different than distance measures discussed earlier in this chapter). This allows cosine distance to measure dissimilarity in frequency patterns rather than frequency magnitude. For instance, the cosine distance between the observation (10, 2) and observation (1, 2) is the same as the cosine distance between (5, 1) and (1, 2).

Word Clouds A word cloud is a visual representation of a document or set of documents in which the size of the word is proportional to the frequency with which the word appears. Figure 5.10 displays a word cloud of Chapter 1 of this textbook.

Visualization of Cosine Distance Frequency of Term “Terrible”

FIGURE 5.9

2 1.5 1

0.5 0

1

2 3 4 Frequency of Term “Great”

5

235

Glossary

FIGURE 5.10

A Word Cloud of Chapter 1 of this Text

analytics models

available simulation

prescriptive techniques increase

IBM strategic alternatives make dashboards better uncertainty optimal making alternative three problem new application million price company technology database each pricing world analysis overall other model year revenue set risk time best more interestmarketing distribution example help airline customer uses descriptive one predict areas assess determine statistics big organization sales analytical concerned over action text Figure past Watson supply future similar chain mining all often team process used business some managers understand strategy study result financial decisions system product management number products customers optimization solutions companies predictive amounts applications

use

data

decision

As can be seen from this word cloud, analytics and data are used most frequently in Chapter 1. Other more frequently mentioned words are decision, models, prescriptive, and predictive. The word cloud gives a quick visual sense of what the document's content. Using word clouds on tweets, for example, can provide insight on trending topics.

S u mmar y We have introduced descriptive data-mining methods and related concepts. After introducing how to measure the similarity of individual observations, we presented two methods for grouping observations based on the similarity of their respective variable values: hierarchical clustering and k-means clustering. Agglomerative hierarchical clustering begins with each observation in its own cluster and iteratively aggregates clusters using a specified agglomeration method. We described several of these agglomeration methods and discussed their features. In k-means clustering, the analyst specifies k, the number of clusters, and observations then are placed into these clusters in an attempt to minimize the dissimilarity within the clusters. We concluded our discussion of clustering with a comparison of hierarchical clustering and k-means clustering. We then introduced association rules and explained their use for identifying patterns across transactions, particularly in retail data. We defined the concepts of support count, confidence, and lift ratio, and we described their utility in gleaning actionable insight from association rules. Finally, we discussed the text-mining process. Text is first preprocessed by deriving a smaller set of tokens from the larger set of words contained in a collection of documents. The tokenized text data is then converted into a presence/absence document-term matrix or a frequency document-term matrix. We then demonstrated the application of hierarchical clustering on a binary document-term matrix and k-means clustering on a frequency document-term matrix to glean insight from the underlying text data. G l o ssar y Antecedent The item set corresponding to the if portion of an if–then association rule. Association rule An if–then statement describing the relationship between item sets.

236

Chapter 5 Descriptive Data Mining

Bag of words An approach for processing text into a structured row-column data format in which documents correspond to row observations and words (or more specifically, terms) correspond to column variables. Binary document-term matrix A matrix with the rows representing documents (units of text) and the columns representing terms (words or word roots), and the entries in the columns indicating either the presence or absence of a particular term in a particular document (1 5 present and 0 5 not present). Centroid linkage Method of calculating dissimilarity between clusters by considering the two centroids of the respective clusters. Complete linkage Measure of calculating dissimilarity between clusters by considering only the two most dissimilar observations between the two clusters. Confidence The conditional probability that the consequent of an association rule occurs given the antecedent occurs. Consequent The item set corresponding to the then portion of an if–then association rule. Corpus A collection of documents to be analyzed. Cosine distance A measure of dissimilarity between two observations often used on frequency data derived from text because it is unaffected by the magnitude of the frequency and instead measures differences in frequency patterns. Dendrogram A tree diagram used to illustrate the sequence of nested clusters produced by hierarchical clustering. Document A piece of text, which can range from a single sentence to an entire book depending on the scope of the corresponding corpus. Euclidean distance Geometric measure of dissimilarity between observations based on the Pythagorean theorem. Frequency document-term matrix A matrix whose rows represent documents (units of text) and columns represent terms (words or word roots), and the entries in the matrix are the number of times each term occurs in each document. Group average linkage Measure of calculating dissimilarity between clusters by considering the distance between each pair of observations between two clusters. Hierarchical clustering Process of agglomerating observations into a series of nested groups based on a measure of similarity. Jaccard’s coefficient Measure of similarity between observations consisting solely of binary categorical variables that considers only matches of nonzero entries. Jaccard distance Measure of dissimilarity between observations based on Jaccard’s coefficient. k-means clustering Process of organizing observations into one of k groups based on a measure of similarity (typically Euclidean distance). Lift ratio The ratio of the performance of a data mining model measured against the performance of a random choice. In the context of association rules, the lift ratio is the ratio of the probability of the consequent occurring in a transaction that satisfies the antecedent versus the probability that the consequent occurs in a randomly selected transaction. Manhattan distance Measure of dissimilarity between two observations based on the sum of the absolute differences in each variable dimensions. Market basket analysis Analysis of items frequently co-occurring in transactions (such as purchases). Market segmentation The partitioning of customers into groups that share common characteristics so that a business may target customers within a group with a tailored marketing strategy. Matching coefficient Measure of similarity between observations based on the number of matching values of categorical variables. Matching distance Measure of dissimilarity between observations based on the matching coefficient.

Problems

237

McQuitty’s method Measure that computes the dissimilarity introduced by merging clusters A and B by, for each other cluster C, averaging the distance between A and C and the distance between B and C and summing these average distances. Median linkage Method that computes the similarity between two clusters as the median of the similarities between each pair of observations in the two clusters. Observation (record) A set of observed values of variables associated with a single entity, often displayed as a row in a spreadsheet or database. Presence/absence document-term matrix A matrix with the rows representing documents and the columns representing words, and the entries in the columns indicating either the presence or the absence of a particular word in a particular document (1 5 present and 0 5 not present). Sentiment analysis The process of clustering/categorizing comments or reviews as positive, negative, or neutral. Single linkage Measure of calculating dissimilarity between clusters by considering only the two most similar observations between the two clusters. Stemming The process of converting a word to its stem or root word. Stopwords Common words in a language that are removed in the pre-processing of text Support The percentage of transactions in which a collection of items occurs together in a transaction data set. Term The most basic unit of text comprising a document, typically corresponding to a word or word stem. Term frequency times inverse document frequency (TFIDF) Text mining measure which accounts for term frequency and the uniqueness of a term in a document relative to other documents in a corpus. Term normalization A set of natural language processing techniques to map text into a standardized form. Text mining The process of extracting useful information from text data. Tokenization The process of dividing text into separate terms, referred to as tokens. Unsupervised learning Category of data-mining techniques in which an algorithm explains relationships without an outcome variable to guide the process. Unstructured data Data, such as text, audio, or video, that cannot be stored in a traditional structured database. Ward’s method Procedure that partitions observations in a manner to obtain clusters with the least amount of information loss due to the aggregation. Word cloud A visualization of text data based on word frequencies in a document or set of documents. P r o b l e ms Problems 1 through 10 and Case Problem 1 do not require the use of data mining software and focus on knowledge of concepts and basic calculations. Problems 11 through 23 and Case Problem 2 require the use of data mining software. If using R/Rattle to solve these problems, refer to Appendix: R/Rattle Settings to Solve Chapter 5 Problems. If using JMP Pro to solve these problems, refer to Appendix: JMP Pro Settings to Solve Chapter 5 Problems. 1. k-Means Clustering of Wines. Amanda Boleyn, an entrepreneur who recently sold her start-up for a multi-million-dollar sum, is looking for alternate investments for her newfound fortune. She is considering an investment in wine, similar to how some people invest in rare coins and fine art. To educate herself on the properties of fine wine, she has collected data on 13 different characteristics of 178 wines. Amanda has applied k-means clustering to this data for k 5 1, . . . , 10 and generated the following plot of total sums of squared deviations. After analyzing this plot, Amanda generates summaries for k 5 2, 3, and 4. Which value of k is the most appropriate to categorize these wines? Justify your choice with calculations.

238

Chapter 5 Descriptive Data Mining

k52

Inter-Cluster Distances

Cluster 1 Cluster 2

Cluster 1 0 5.640

Cluster 1 Cluster 2 Total

Size 87 91 178

Cluster 2 5.640 0

Within-Cluster Summary

k53

Average Distance 4.003 4.260 4.134

Inter-Cluster Distances Cluster 1

Cluster 1

Cluster 2 Cluster 3

Cluster 2

Cluster 3

5.147

6.078

5.147

5.432

6.078

5.432

Within-Cluster Summary Size

Average Distance

Cluster 1

62

3.355

Cluster 2

65

3.999

51

3.483

178

3.627

Cluster 3 Total

239

Problems

k54 Cluster 1 Cluster 2 Cluster 3 Cluster 4

Inter-Cluster Distances Cluster 1 0 5.255 6.070 4.853

Cluster 2 5.255 0 5.136 4.789

Cluster 3 6.070 5.136 0 6.074

Cluster 4 4.853 4.789 6.074 0

Within-Cluster Summary Size

Average Distance

Cluster 1

56

3.024

Cluster 2

45

3.490

Cluster 3

49

3.426

Cluster 4

28

4.580

178

3.498

Total

2. Distance to Centroid Calculation for Wine Clusters. Jay Gatsby categorizes wines into one of three clusters. The centroids of these clusters, describing the average characteristics of a wine in each cluster, are listed in the following table. Characteristic

Cluster 1

Cluster 2

Alcohol Malic Acid Ash Alkalinity Magnesium Phenols Flavanoids Nonflavanoids Proanthocyanins Color Intensity Hue Dilution Proline

0.819 20.329 0.248 20.677 0.643 0.825 0.896 20.595 0.619 0.135 0.497 0.744 1.117

0.164 0.869 0.186 0.523 20.075 0.977 21.212 0.724 20.778 0.939 21.162 21.289 20.406

Cluster 3 20.937 20.368 20.393 0.249 20.573 20.034 0.083 0.009 0.010 20.881 0.437 0.295 20.776

Jay has recently discovered a new wine from the Piedmont region of Italy with the following characteristics. In which cluster of wines should he place this new wine? Justify your choice with appropriate calculations. Characteristic Alcohol Malic Acid Ash Alkalinity Magnesium Phenols Flavanoids Nonflavanoids Proanthocyanins Color Intensity Hue Dilution Proline

21.023 20.480 0.049 0.600 21.242 1.094 0.001 0.548 20.229 20.797 0.711 20.425 0.010

240

Chapter 5 Descriptive Data Mining

3. Outliers’ Impact on Clustering. Sol & Nieve is a sporting good and outdoor gear retailer that operates in North America and Central America. In an attempt to characterize its stores (and re-assess Sol & Nieve’s supply chain operations), Gustavo Esposito is analyzing sales in 22 regions for its two primary product lines: sol (beach-oriented apparel) and nieve (mountain-oriented apparel). Gustavo has generated the following output for k-means clustering for k 5 2, 3, 4 (output reported in standardized units). Which value of k is the most appropriate for these data? How should Gustavo interpret the results to characterize the clusters? k52

Inter-Cluster Distances Cluster 1

Cluster 2

Cluster 1

2.215

Cluster 2

2.215

Within-Cluster Summary Size

Centroid (Original Units)

Average Distance

Sol

Nieve

Cluster 1

11

0.655

25.148

6.695

Cluster 2

11

0.685

6.340

25.276

k53

Inter-Cluster Distances Cluster 2

Cluster 3

Cluster 1

Cluster 1 0

4.603

1.977

Cluster 2

4.603

3.076

Cluster 3

1.977

3.076

Within-Cluster Summary Size

Average Distance

Centroid (Original Units) Sol

Nieve

Cluster 1

11

0.655

25.148

6.695

Cluster 2

1

6.500

60.000

Cluster 3

10

0.154

6.324

22.932

k54

Inter-Cluster Distances Cluster 1

Cluster 1

Cluster 2

Cluster 3

Cluster 4

4.458

3.060

6.064

Cluster 2

4.458

1.728

3.076

Cluster 3

3.060

1.728

4.457

Cluster 4

6.064

3.076

4.457

Within-Cluster Summary

Centroid (Original Units)

Size

Average Distance

Sol

Cluster 1

1

60.000

Nieve 6.500

Cluster 2

10

0.154

6.324

22.932

Cluster 3

10

0.120

20.099

6.715

Cluster 4

1

6.500

60.000

4. Cluster Shapes for k-Means versus Single Linkage. Heidi Zahn is a human resources manager currently reviewing data on 98 employees. In the data, each observation consists of an employee’s age and an employee’s performance rating. a. Heidi applied k-means clustering with k 5 2 to the data and generated the following plot to visualize the clusters. Based on this plot, qualitatively characterize the two clusters of employees categorized by the k-means approach.

241

Problems

k-Means Clustering Solution with k 5 2

Performance Rating

Age

b. For a comparison, Heidi applied hierarchical clustering with the Euclidean distance and single linkage to the data and generated the following plot based on the level of agglomeration with two clusters. Based on this plot, qualitatively characterize the two clusters of employees categorized by the hierarchical approach. c. Which of the two approaches (k-means clustering from part (a) or hierarchical clustering from part (b)) would you recommend? Hierarchical Clustering Solution Using Euclidean Distance and Single Linkage

Performance Rating

Age

5. Dendrogram of Utility Companies. The regulation of electric and gas utilities is an important public policy question affecting consumer’s choice and cost of energy

242

Chapter 5 Descriptive Data Mining

Distance

provider. To inform deliberation on public policy, data on eight numerical variables have been collected for a group of energy companies. To summarize the data, hierarchical clustering has been executed using Euclidean distance to measure dissimilarity between observations and Ward’s method as the agglomeration method. Based on the following dendrogram, what is the most appropriate number of clusters to organize these utility companies? 11.0 10.5 10.0 9.5 9.0 8.5 8.0 7.5 7.0 6.5 6.0 5.5 5.0 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0

10 13 4 20 2 21 1 18 14 19 3 9 Cluster

6

8 16 11 5

7 12 15 17

6. Complete Linkage Clustering of Utility Companies. In an effort to inform political leaders and economists discussing the deregulation of electric and gas utilities, data on eight numerical variables from utility companies have been grouped using hierarchical clustering based on Euclidean distance to measure dissimilarity between observations and complete linkage as the agglomeration method. a. Based on the following dendrogram, what is the most appropriate number of clusters to organize these utility companies? 6.5 6.0 5.5 5.0 4.5 Distance

4.0 4.5 3.0

Between–Cluster Distance 2.577

2.5 2.0 1.5 1.0 0.5 0

10 13 4 20 2 21 5

1 18 14 19 6 Cluster

3

9

7 12 15 17 8 16 11

243

Problems

b. Using the following data on the Observations 10, 13, 4, and 20, confirm that the complete linkage distance between the cluster containing {10, 13} and the cluster containing {4, 20} is 2.577 units as displayed in the dendrogram. Observation 10 13

Income/Debt Return Cost Load Peak Sales Percent Nuclear Total Fuel Costs

0.032 0.741 0.700 20.892 20.173 20.693 1.620 20.863

0.195 0.875 0.748 20.735 1.013 20.489 2.275 21.035

4 20.510 0.207 20.004 20.219 20.943 20.702 1.328 20.724

20 0.466 0.474 20.490 0.655 0.083 20.458 1.733 20.721

7. Interpreting Merge Steps from a Dendrogram. From 1946 to 1990, the Big Ten Conference consisted of the University of Illinois, Indiana University, University of Iowa, University of Michigan, Michigan State University, University of Minnesota, Northwestern University, Ohio State University, Purdue University, and University of Wisconsin. In 1990, the conference added Pennsylvania State University. In 2011, the conference added the University of Nebraska. In 2014, the University of Maryland and Rutgers University were added to the conference with speculation of more schools being added in the future. Based on the football stadium capacity, latitude, longitude, and enrollment, the Big Ten commissioner is curious how a clustering algorithm would suggest the conference expand beyond its current 14 members. To represent the 14 member schools as an entity, each variable value for the 14 schools of the Big Ten Conference has been replaced with the respective variable median over these 10 schools. Using Euclidean distance to measure dissimilarity between observations, hierarchical clustering with complete linkage generates the following dendrogram. Describe the next three stages of conference expansion plan suggested by the dendrogram.

6

4 3 2 1 0

Notre Dame Vanderbilt Duke Virginia Charlotte Old Dominion Wake Forest Marshall Syracuse Buffalo Pittsburgh West Virginia Temple Army Navy Bowling Green Western Michigan Eastern Michigan Toledo Central Michigan Northern Illinois Ball State Miami University Cincinnati Ohio Akron Kent State Tulsa Arkansas State Middle Tennessee Western Kentucky Tennessee Memphis Louisville Kentucky North Carolina East Carolina North Carolina State Virginia Tech Oklahoma Oklahoma State Iowa State Kansas Kansas State Missouri Rutgers Maryland Penn State Nebraska Wisconsin Purdue Ohio State Northwestern Minnesota Michigan Michigan State Iowa Illinois Indiana

Height

5

8. Comparing Different Linkage Methods. The Football Bowl Subdivision (FBS) level of the National Collegiate Athletic Association (NCAA) consists of over 100 schools. Most of these schools belong to one of several conferences, or collections of schools, that compete with each other on a regular basis in collegiate sports.

244

Chapter 5 Descriptive Data Mining

Suppose the NCAA has commissioned a study that will propose the formation of conferences based on the similarities of constituent schools. If the NCAA cares only about geographic proximity of schools when determining conferences, it could use hierarchical clustering based on the schools’ latitude and longitude values and Euclidean distance to compute dissimilarity between observations. The following charts and tables illustrate the 10-cluster solutions when using various linkage methods (group average, Ward’s method, complete, and centroid) to determine clusters of schools to be assigned to conferences using hierarchical clustering. Compare and contrast the resulting clusters created using each of these different linkage methods for hierarchical clustering. Group Average Linkage Clusters

54 49 44 39 34 29 24

2160

2150

Row Labels 1 2 3 4 5 6 7 8 9 10 Grand Total

2140

2130

2120

2110

2100

290

280

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Cluster 6

Cluster 7

Cluster 8

Cluster 9

Cluster 10

Count of School 7 6 8 6 43 1 30 18 7 1 127

Min of Latitude2 38.9 36.2 31.8 43.6 38.0 45.0 31.8 29.5 25.8 19.7 19.7

Max of Latitude 41.7 39.4 35.1 47.6 43.6 45.0 38.0 33.2 30.5 19.7 47.6

Min of Longitude2 2111.9 2122.3 2118.4 2123.1 296.7 293.3 297.5 298.5 284.3 2155.1 2155.1

270

19 260

Max of Longitude2 2104.8 2115.3 2106.4 2116.2 271.0 293.3 276.2 288.1 280.1 2155.1 271.0

Ward’s Method Clusters

54 49 44 39 34 29 24

2160 2150 2140 2130 2120 2110 2100 290 280 270 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9 Cluster 10

Cluster 1 2 3 4 5 6 7 8 9 10 Grand Total

Count of School 12 9 6 1 28 13 20 7 18 7 127

Min of Latitude 36.8 31.8 43.6 19.7 38.0 10.8 31.8 35.5 29.5 25.8 19.7

Max of Latitude 41.7 36.2 47.6 19.7 43.0 45.0 38.0 39.2 33.2 30.5 47.6

Min of Longitude 122.3 2118.4 2123.1 2155.1 288.3 296.7 290.7 297.5 298.5 284.3 2155.1

Complete Linkage

19 260

Max of Longitude 104.8 2106.4 2116.2 2155.1 271.0 283.6 270.2 292.3 288.1 280.1 271.0

54 49 44 39 34 29 24

2160 2150 2140 2130 2120 2110 2100 290 280 270 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9 Cluster 10

Cluster 1 2 3 4 5 6 7 8 9 10 Grand Total

Count of School 7 5 9 6 28 16 28 20 7 1 127

Min of Latitude 38.9 36.8 31.8 43.6 38.0 39.0 29.5 33.8 25.8 19.7 19.7

Max of Latitude 41.7 39.2 36.2 47.6 43.0 45.0 36.1 38.0 30.5 19.7 47.6

Min of Longitude 2111.9 2122.3 2118.4 2123.1 288.3 296.7 298.5 290.7 284.3 2155.1 2155.1

19 260

Max of Longitude 2104.8 2119.7 2106.4 2116.2 271.0 283.6 285.5 276.2 280.1 2155.1 271.0

245

246

Chapter 5 Descriptive Data Mining

Centroid Linkage Clusters

54 49 44 39 34 29 24

2160

2150

Cluster 1 2 3 4 5 6 7 8 9 10 Grand Total

2140

2130

2120

2110

2100

290

280

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Cluster 6

Cluster 7

Cluster 8

Cluster 9

Cluster 10

Count of School 7 8 6 6 39 34 1 18 7 1 127

Min of Latitude 38.9 31.8 36.2 43.6 38.0 31.8 45.0 29.5 25.8 19.7 19.7

Max of Latitude 41.7 35.1 39.4 47.6 43.6 40.8 45.0 33.2 30.5 19.7 47.6

Min of Longitude 2111.9 2118.4 2122.3 2123.1 293.6 297.5 293.3 298.5 284.3 2155.1 2155.1

270

19 260

Max of Longitude 2104.8 2106.4 2115.3 2116.2 271.0 276.2 293.3 288.1 280.1 2155.1 271.0

9. Association Rules for Bookstore Transactions. Leggere, an Internet book retailer, is interested in better understanding the purchase decisions of its customers. For a set of 2,000 customer transactions, it has categorized the individual book purchases comprising those transactions into one or more of the following categories: Novels, Willa Bean series, Cooking Books, Bob Villa Do-It-Yourself, Youth Fantasy, Art Books, Biography, Cooking Books by Mossimo Bottura, Harry Potter series, Florence Art Books, and Titian Art Books. Leggere has conducted association rules analysis on this data set and would like to analyze the output. Based on a minimum support of 200 transactions and a minimum confidence of 50%, the table below shows the top 10 rules with respect to lift ratio. a. Explain why the top rule “If customer buys a Bottura cooking book, then they buy a cooking book,” is not helpful even though it has the largest lift and 100% confidence. b. Explain how the confidence of 52.99% and lift ratio of 2.20 was computed for the rule “If a customer buys a cooking book and a biography book, then they buy an art book.” Interpret these quantities. c. Based on these top 10 rules, what general insight can Leggere gain on the purchase habits of these customers?

247

Problems

d. What will be the effect on the rules generated if Leggere decreases the minimum support and reruns the association rules analysis? e. What will be the effect on the rules generated if Leggere decreases the minimum confidence and reruns the association rules analysis?

Support for A&C

Confidence

Lift Ratio

Cooking

0.227

1.00

2.32

Art

0.205

0.54

2.24

Cooking, Art

Biography

0.204

0.61

2.20

Cooking, Biography

Art

0.204

0.53

2.20

Youth Fantasy

Novels, Cooking

0.245

0.55

2.15

Cooking, Art

BobVilla

0.205

0.61

2.11

Cooking, BobVilla

Biography

0.218

0.58

2.08

Biography

Novels, Cooking

0.293

0.53

2.07

Novels, Cooking

Biography

0.293

0.57

2.07

Art

Novels, Cooking

0.249

0.52

2.02

Antecedent

Consequent

BotturaCooking Cooking, BobVilla

10. Association Rules on Congressional Voting Records. Freelance reporter Irwin Fletcher is examining the historical voting records of members of the U.S. Congress. For 175 representatives, Irwin has collected the voting record (yes or no) on 16 pieces of legislation. To examine the relationship between representatives’ votes on different issues, Irwin has conducted an association rules analysis with a minimum support of 40% and a minimum confidence of 90%. The data included the following bills: Budget: approve federal budget resolution Contras: aid for Nicaraguan contra rebels El_Salvador: aid to El Salvador Missile: funding for M-X missile program Physician: freeze physician fees Religious: equal access to all religious groups at schools Satellite: ban on anti-satellite weapons testing The following table shows the top five rules with respect to lift ratio. The table displays representatives’ decisions in a “bill-vote” format. For example, “Contras-y” indicates that the representative voted yes on a bill to support the Nicaraguan Contra rebels and “Physician-n” indicates a no vote on a bill to freeze physician fees. Antecedent

Consequent

Support for A&C

Confidence

Lift Ratio

Contras-y, Physician-n, Satellite-y

El_Salvador-n

0.40

0.95

1.98

Contras-y, Missile-y

El_Salvador-n

0.40

0.40

0.91

Contras-y, Physician-n

El_Salvador-n

0.44

0.91

1.90

Missile-n, Religious-y

El_Salvador-y

0.40

0.93

1.90

Budget-y, Contras-y, Physician-n

El_Salvador-n

0.41

0.90

1.89

248

Chapter 5 Descriptive Data Mining

Problems 11 through 23 require the use of data mining software such as R or JMP Pro to solve. There are two versions (.csv and .xlsx) of the DATAfiles for these problems. Use the .csv file as input if using R and use the .xlsx file as input if using JMP Pro.

a. Interpret the lift ratio of the first rule in the table. b. What is the probability that a representative votes no on El Salvador aid given that they vote yes to aid to Nicaraguan Contra rebels and yes to the M-X missile program? c. What is the probability that a representative votes no on El Salvador aid given that they vote no to the M-X missile program and yes to equal access to religious groups in schools? d. What is the probability that a randomly selected representative votes yes on El Salvador aid? 11. k-Means Clustering of Bank Customers. Apply k-means clustering with values of k 5 2, 3, 4, and 5 to cluster the data in DemoKTC based on the Age, Income, and Children variables. Normalize the values of the input variables to adjust for the different magnitudes of the variables. How many clusters do you recommend? Why? 12. Hierarchical Clustering on Binary Variables. Using matching distance to compute dissimilarity between observations, apply hierarchical clustering employing group average linkage to the data in DemoKTC to create three clusters based on the Female, Married, Loan, and Mortgage variables. Report the characteristics of each cluster including the total number of customers in each cluster as well as the number of customers who are female, the number of customers who are married, the number of customers with a car loan, and the number of customers with a mortgage in each cluster. How would you describe each cluster? 13. Clustering Colleges with k-Means. The Football Bowl Subdivision (FBS) level of the National Collegiate Athletic Association (NCAA) consists of over 100 schools. Most of these schools belong to one of several conferences, or collections of schools, that compete with each other on a regular basis in collegiate sports. Suppose the NCAA has commissioned a study that will propose the formation of conferences based on the similarities of the constituent schools. The file FBS contains data on schools that belong to the Football Bowl Subdivision. Each row in this file contains information on a school. The variables include football stadium capacity, latitude, longitude, athletic department revenue, endowment, and undergraduate enrollment. a. Apply k-means clustering with k 510 using football stadium capacity, latitude, longitude, endowment, and enrollment as variables. Normalize the input variables to adjust for the different magnitudes of the variables. Analyze the resultant clusters. What is the smallest cluster? What is the least dense cluster (as measured by the average distance in the cluster)? What makes the least dense cluster so diverse? b. What problems do you see with the plan for defining the school membership of the 10 conferences directly with the 10 clusters? c. Repeat part (a), but this time do not normalize the values of the input variables. Analyze the resultant clusters. How and why do they differ from those in part (a)? Identify the dominating factor(s) in the formation of these new clusters. 14. Grouping Colleges with Hierarchical Clustering. Refer to the clustering problem involving the file FBS described in Problem 13. Using Euclidean distance to compute dissimilarity between observations, apply hierarchical clustering employing Ward’s method with 10 clusters using football stadium capacity, latitude, longitude, endowment, and enrollment as variables. Normalize the values of the input variables to adjust for the different magnitudes of the variables. a. Compute the cluster centers for the clusters created by the hierarchical clustering. b. Identify the cluster with the largest average football stadium capacity. Using all the variables, how would you characterize this cluster? c. Examine the smallest cluster. What makes this cluster unique? 15. Cluster Comparison of Single Linkage to Group Average Linking. Refer to the clustering problem involving the file FBS described in Problem 13. Using Euclidean distance to compute dissimilarity between observations, apply hierarchical clustering with 10 clusters using latitude and longitude as variables. Execute the clustering two times—once with single linkage and once with group average linkage. Compute the

DemoKTC

FBS

Problems

BigBlue

Sandler

TraderJoes

AppleCartBinary AppleCartStacked

CookieMonsterBinary CookieMonsterStacked

249

cluster sizes and visualize the clusters by creating a scatter plot with longitude as the x-variable and latitude as the y-variable. Compare the results of the two approaches. 16. k-Means Clustering of Employees. IBM employs a network of expert analytics consultants for various projects. To help it determine how to distribute its bonuses, IBM wants to form groups of employees with similar performance according to key performance metrics. Each observation (corresponding to an employee) in the file BigBlue consists of values for UsageRate which corresponds to the proportion of time that the employee has been actively working on high-priority projects, Recognition which is the number of projects for which the employee was specifically requested, and Leader which is the number of projects on which the employee has served as project leader. Apply k-means clustering with values of k 5 2 to 7. Normalize the values of the input variables to adjust for the different magnitudes of the variables. How many clusters do you recommend to categorize the employees? Why? 17. k-Means Clustering of Sandler Movies. Attracted by the possible returns from a portfolio of movies, hedge funds have invested in the movie industry by financially backing individual films and/or studios. The hedge fund Star Ventures is currently conducting some research involving movies involving Adam Sandler, an American actor, screenwriter, and film producer. As a first step, Star Ventures would like to cluster Adam Sandler movies based on their gross box office returns and movie critic ratings. Using the data in the file Sandler, apply k-means clustering with k 5 3 to characterize three different types of Adam Sandler movies. Base the clusters on the variables Rating and Box. Rating corresponds to movie ratings provided by critics (a higher score represents a movie receiving better reviews). Box represents the gross box office earnings in 2015 dollars. Normalize the values of the input variables to adjust for the different magnitudes of the variables. Report the characteristics of each cluster including a count of movies, the average rating of movies, and the average box office earnings of movies in each cluster. How would you describe the movies in each cluster? 18. k-Means Clustering of Trader Joe’s Stores. Josephine Mater works for the supply-chain analytics division of Trader Joe’s, a national chain of specialty grocery stores. Trader Joe’s is considering a redesign of its supply chain. Josephine knows that Trader Joe’s uses frequent truck shipments from its distribution centers to its retail stores. To keep costs low, retail stores are typically located near a distribution center. The file TraderJoes contains data on the location of Trader Joe’s retail stores. Josephine would like to use k-means clustering with k 5 8 to estimate the preferred locations for a proposal to use eight distribution centers to support its retail stores. If Trader Joe’s establishes eight distribution centers, how many retail stores does the k-means approach suggest assigning to each distribution center? What are the drawbacks to directly applying this solution to assign retail stores to distribution centers? 19. Association Rules of iStore Transactions. Apple Inc. tracks online transactions at its iStore and is interested in learning about the purchase patterns of its customers in order to provide recommendations as a customer browses its web site. A sample of the “shopping cart” data resides in the files AppleCartBinary and AppleCartStacked. Use a minimum support of 10% of the total number of transactions and a minimum confidence of 50% to generate a list of association rules. a. Interpret what the rule with the largest lift ratio is saying about the relationship between the antecedent item set and consequent item set. b. Interpret the confidence of the rule with the largest lift ratio. c. Interpret the lift ratio of the rule with the largest lift ratio. d. Review the top 15 rules and summarize what the rules suggest. 20. Association Rules of Browser Histories. Cookie Monster Inc. is a company that specializes in the development of software that tracks web browsing history of individuals. Cookie Monster Inc. is interested in analyzing its data to gain insight on the online behavior of individuals. A sample of browser histories is provided in the files CookieMonsterBinary and CookieMonsterStacked that indicate which websites were visited by which customers. Use a

250

Chapter 5 Descriptive Data Mining

GroceryStoreList GroceryStoreStacked

AirlineTweets

YelpItalian

minimum support of 4% of the transactions (800 of the 20,000 total transactions) and a minimum confidence of 50% to generate a list of association rules. a. Based on the top 14 rules, which three web sites appear in the association rules with the largest lift ratio? b. Identify the association rule with the largest lift ratio that also has Pinterest as the antecedent. What is the consequent web site in this rule? c. Interpret the confidence of the rule from part (b). While the antecedent and consequent are not necessarily chronological, what does this rule suggest? d. Identify the association rule with the largest lift ratio that also has TheEveryGirl as the antecedent. What is the consequent web site in this rule? e. Interpret the lift ratio of the rule from part (d). 21. Association Rules of Grocery Store Transactions. A grocery store introducing items from Italy is interested in analyzing buying trends of these new “international” items, namely prosciutto, Peroni, risotto, and gelato. The files GroceryStoreList and GroceryStoreStacked provide data on a collection of transactions in item-list format. a. Use a minimum support of 10% of the transactions (100 of the 1,000 total transactions) and a minimum confidence of 50% to generate a list of association rules. How many rules satisfy this criterion? b. Use a minimum support of 25% of the transactions (250 of the 1,000 total transactions) and a minimum confidence of 50% to generate a list of association rules. How many rules satisfy this criterion? Why may the grocery store want to increase the minimum support required for their analysis? What is the risk of increasing the minimum support required? c. Using the list of rules from part (b), consider the rule with the largest lift ratio that also involves an Italian item. Interpret what this rule is saying about the relationship between the antecedent item set and consequent item set. d. Interpret the confidence of the rule with the largest lift ratio that also involves an Italian item. e. Interpret the lift ratio of the rule with the largest lift ratio that also involves an Italian item. f. What insight can the grocery store obtain about its purchasers of the Italian fare? 22. Text Mining of Tweets. Companies can learn a lot about customer experiences by monitoring the social media web site Twitter. The file AirlineTweets contains a sample of 36 tweets of an airline’s customers. Normalize the terms by using stemming and generate frequency and binary document-term matrices. a. What are the five most common terms occurring in these tweets? How often does each term appear? b. Using Jaccard’s distance to compute dissimilarity between observations, apply hierarchical clustering employing Ward’s linkage method to yield three clusters on the binary document-term matrix using the following tokens as variables: agent, attend, bag, damag, and rude. How many documents are in each cluster? Give a description of each cluster. c. How could management use the results obtained in part (b)? Source: Kaggle web site 23. Text Mining of Yelp Reviews. The online review service Yelp helps millions of consumers find the goods and services they seek. To help consumers make more-informed choices, Yelp includes over 120 million reviews. The file YelpItalian contains a sample of 21 reviews for an Italian restaurant. Normalize the terms by using stemming and a generate frequency and binary document-term matrices. a. What are the five most common terms in these reviews? How often does each term appear? b. Using Jaccard’s distance to compute dissimilarity between observations, apply hierarchical clustering employing Ward’s linkage method to yield two clusters from the binary document-term matrix using all five of the most common terms from the reviews. How many documents are in each cluster? Give a description of each cluster.

251

Case Problem 2: Know Thy Customer

Case Problem 1 does not require the use of data mining software (such as R/Rattle or JMP Pro), but the use of Excel is recommended.

BigTenExpand

C A S E

P R OBLE M

1 :

B i g

T e n

E x pa n s i o n

From 1946 to 1990, the Big Ten Conference consisted of the University of Illinois, Indiana University, University of Iowa, University of Michigan, Michigan State University, University of Minnesota, Northwestern University, Ohio State University, Purdue University, and University of Wisconsin. In 1990, the conference added Pennsylvania State University. In 2011, the conference added the University of Nebraska. In 2014, the University of Maryland and Rutgers University were added to the conference with speculation of more schools being added in the future. The file BigTenExpand contains data on the football stadium capacity, latitude, longitude, endowment, and enrollment of 59 National Collegiate Athletic Association (NCAA) Football Bowl Subdivision (FBS) schools. Treat the 10 schools that were members of the Big Ten from 1946 to 1990 as being in a cluster and the other 49 schools as each being in their own cluster. 1. Using Euclidean distance to measure dissimilarity between observations, determine which school (in its own cluster of one) that hierarchical clustering with complete linkage would recommend integrating into the Big Ten Conference. That is, which school is the most similar with respect to complete linkage to the cluster of ten schools that were members of the Big Ten from 1946 to 1990? 2. Add the single school identified in (1) to create a cluster of 11 schools representing a hypothetical Big Ten Conference. Repeat the calculations to identify the school most similar with respect to complete linkage to this new cluster of 11 schools. 3. Add the school identified in (2) to create a cluster of 12 schools representing a hypothetical Big Ten Conference. Repeat the calculations to identify the school most similar with respect to complete linkage to this new cluster of 12 schools. 4. Add the school identified in (3) to create a cluster of 13 schools representing a hypothetical Big Ten Conference. Repeat the calculations to identify the school most similar with respect to complete linkage to this new cluster of 13 schools. Add this school to create a 14-school cluster. 5. How does the hypothetical 14-team cluster created in (4) compare to the actual 14-team Big Ten Conference? For both the hypothetical 14-team Big Ten Conference and the actual 14-team Big Ten Conference, compute the cluster centroid, the distance from each cluster member to the cluster centroid, and average distance between the observations in the cluster. What do you observe when comparing these two clusters? Which cluster has the smaller average distance between observations? Is this surprising? Explain. C as e

KnowThyCustomer

P r o b l em

2 :

K n ow

T h y

C u st o m e r

Know Thy Customer (KTC) is a financial consulting company that provides personalized financial advice to its clients. As a basis for developing this tailored advising, KTC would like to segment its customers into several representative groups based on key characteristics. Peyton Blake, the director of KTC’s fledging analytics division, plans to establish the set of representative customer profiles based on 600 customer records in the file KnowThyCustomer. Each customer record contains data on age, gender, annual income, marital status, number of children, whether the customer has a car loan, and whether the customer has a home mortgage. KTC’s market research staff has determined that these seven characteristics should form the basis of the customer clustering. Peyton has invited a summer intern, Danny Riles, into her office so they can discuss how to proceed. As they review the data on the computer screen, Peyton’s brow furrows as she realizes that this task may not be trivial. The data contains both categorical variables (Female, Married, Car, and Mortgage) and numerical variables (Age, Income, and Children). 1. Using Manhattan distance to compute dissimilarity between observations, apply hierarchical clustering on all seven variables, experimenting with using complete linkage and group average linkage. Normalize the values of the input variables. Recommend a

252

Chapter 5 Descriptive Data Mining

set of customer profiles (clusters). Describe these clusters according to their “average” characteristics. Why might hierarchical clustering not be a good method to use for these seven variables? 2. Apply a two-step approach: a. Using matching distance to compute dissimilarity between observations, employ hierarchical clustering with group average linkage to produce four clusters using the variables Female, Married, Loan, and Mortgage. b. Based on the clusters from part (a), split the original 600 observations into four separate data sets as suggested by the four clusters from part (a). For each of these four data sets, apply k-means clustering with k 5 2 using Age, Income, and Children as variables. Normalize the values of the input variables. This will generate a total of eight clusters. Describe these eight clusters according to their “average” characteristics. What benefit does this two-step clustering approach have over just using hierarchical clustering on all seven variables as in part (1) or just using k-means clustering on all seven variables? What weakness does it have?

Chapter 6 Statistical Inference CONTENTS Analytics in Action: John Morrell & Company 6.1 SELECTING A SAMPLE Sampling from a Finite Population Sampling from an Infinite Population 6.2 POINT ESTIMATION Practical Advice 6.3 SAMPLING DISTRIBUTIONS Sampling Distribution of x Sampling Distribution of p 6.4 INTERVAL ESTIMATION Interval Estimation of the Population Mean Interval Estimation of the Population Proportion 6.5 HYPOTHESIS TESTS Developing Null and Alternative Hypotheses Type I and Type II Errors Hypothesis Test of the Population Mean Hypothesis Test of the Population Proportion 6.6 BIG DATA, STATISTICAL INFERENCE, AND PRACTICAL SIGNIFICANCE Sampling Error Nonsampling Error Big Data Understanding What Big Data Is Big Data and Sampling Error Big Data and the Precision of Confidence Intervals Implications of Big Data for Confidence Intervals Big Data, Hypothesis Testing, and p Values Implications of Big Data in Hypothesis Testing SUMMARY 310 GLOSSARY 311 PROBLEMS 314 AVAILABLE IN THE MINDTAP READER: APPENDIX: RANDOM SAMPLING WITH R APPENDIX: INTERVAL ESTIMATION WITH R APPENDIX: HYPOTHESIS TESTING WITH R

254

A N A LY T I C S

Chapter 6 Statistical Inference

I N

A C T IO N

John Morrell & Company* Cincinnati Ohio

John Morrell & Company, which was established in England in 1827, is considered the oldest continuously operating meat manufacturer in the United States. It is a wholly owned and independently managed subsidiary of Smithfield Foods, Smithfield, Virginia. John Morrell & Company offers an extensive product line of processed meats and fresh pork to consumers under 13 regional brands, including John Morrell, E-Z-Cut, Tobin’s First Prize, Dinner Bell, Hunter, Kretschmar, Rath, Rodeo, Shenson, Farmers Hickory Brand, Iowa Quality, and Peyton’s. Each regional brand enjoys high brand recognition and loyalty among consumers. Market research at Morrell provides management with up-to-date information on the company’s various products and how the products compare with competing brands of similar products. In order to compare a beef pot roast made by Morrell to similar beef products from two major competitors, Morrell asked a random sample of consumers to indicate how the products rated in terms of taste, appearance, aroma, and overall preference. In Morrell’s independent taste-test study, a sample of 224 consumers in Cincinnati, Milwaukee, and Los Angeles was chosen. Of these 224 consumers, 150 preferred the beef pot roast made by Morrell. Based on these results, Morrell estimates that the population proportion that prefers Morrell’s beef pot roast is p 5 150/224 5 0.67. Recognizing that this estimate is subject to sampling error, Morrell calculates the 95% confidence interval for the population proportion that prefers Morrell’s beef pot roast to be 0.6080 to 0.7312.

Morrell then turned its attention to whether these sample data support the conclusion that Morrell’s beef pot roast is the preferred choice of more than 50% of the consumer population. Letting p indicate the proportion of the population that prefers Morrell’s product, the hypothesis test for the research question is as follows: H0 : p # 0.50 Ha : p . 0.50 The null hypothesis H0 indicates the preference for Morrell’s product is less than or equal to 50%. If the sample data support rejecting H0 in favor of the alternative hypothesis Ha , Morrell will draw the research conclusion that in a three-product comparison, its beef pot roast is preferred by more than 50% of the consumer population. Using statistical hypothesis testing procedures, the null hypothesis H0 was rejected. The study provided statistical evidence supporting Ha and the conclusion that the Morrell product is preferred by more than 50% of the consumer population. In this chapter, you will learn about simple random sampling and the sample selection process. In addition, you will learn how statistics such as the sample mean and sample proportion are used to estimate parameters such as the population mean and population proportion. The concept of a sampling distribution will be introduced and used to compute the margins of error associated with sample estimates. You will then learn how to use this information to construct and interpret interval estimates of a population mean and a population proportion. We then discuss how to formulate hypotheses and how to conduct tests such as the one used by Morrell. You will learn how to use sample data to determine whether or not a hypothesis should be rejected.

*The authors are indebted to Marty Butler, Vice President of Marketing, John Morrell, for providing this Analytics in Action.

Refer to Chapter 2 for a fundamental overview of data and descriptive statistics.

When collecting data, we usually want to learn about some characteristic(s) of the population, the collection of all the elements of interest, from which we are collecting that data. In order to know about some characteristic of a population with certainty, we must collect data from every element in the population of interest; such an effort is referred to as a census. However, there are many potential difficulties associated with taking a census: ••A census may be expensive; if resources are limited, it may not be feasible to take a census. ••A census may be time consuming; if the data need be collected quickly, a census may not be suitable. ••A census may be misleading; if the population is changing quickly, by the time a census is completed the data may be obsolete.

Analytics in Action 255

••A census may be unnecessary; if perfect information about the characteristic(s) of the population of interest is not required, a census may be excessive. ••A census may be impractical; if observations are destructive, taking a census would destroy the population of interest.

A sample that is similar to the population from which it has been drawn is said to be representative of the population.

In order to overcome the potential difficulties associated with taking a census, we may decide to take a sample (a subset of the population) and subsequently use the sample data we collect to make inferences and answer research questions about the population of interest. Therefore, the objective of sampling is to gather data from a subset of the population that is as similar as possible to the entire population, so that what we learn from the sample data accurately reflects what we want to understand about the entire population. When we use the sample data we have collected to make estimates of or draw conclusions about one or more characteristics of a population (the value of one or more parameters), we are using the process of statistical inference. Sampling is done in a wide variety of research settings. Let us begin our discussion of statistical inference by citing two examples in which sampling was used to answer a research question about a population. 1. Members of a political party in Texas are considering giving their support to a particular candidate for election to the U.S. Senate, and party leaders want to estimate the proportion of registered voters in the state that favor the candidate. A sample of 400registered voters in Texas is selected, and 160 of those voters indicate a preference for the candidate. Thus, an estimate of proportion of the population of registered voters who favor the candidate is 160/400 5 0.40. 2. A tire manufacturer is considering production of a new tire designed to provide an increase in lifetime mileage over the firm’s current line of tires. To estimate the mean useful life of the new tires, the manufacturer produced a sample of 120 tires for testing. The test results provided a sample mean of 36,500 miles. Hence, an estimate of the mean useful life for the population of new tires is 36,500 miles.

A sample mean provides an estimate of a population mean, and a sample proportion provides an estimate of a population proportion. With estimates such as these, some estimation error can be expected. This chapter provides the basis for determining how large that error might be.

It is important to realize that sample results provide only estimates of the values of the corresponding population characteristics. We do not expect exactly 0.40, or 40%, of the population of registered voters to favor the candidate, nor do we expect the sample mean of 36,500 miles to exactly equal the mean lifetime mileage for the population of all new tires produced. The reason is simply that the sample contains only a portion of the population and cannot be expected to perfectly replicate the population. Some error, or deviation of the sample from the population, is to be expected. With proper sampling methods, the sample results will provide “good” estimates of the population parameters. But how good can we expect the sample results to be? Fortunately, statistical procedures are available for answering this question. Let us define some of the terms used in sampling. The sampled population is the population from which the sample is drawn, and a frame is a list of the elements from which the sample will be selected. In the first example, the sampled population is all registered voters in Texas, and the frame is a list of all the registered voters. Because the number of registered voters in Texas is a finite number, the first example is an illustration of sampling from a finite population. The sampled population for the tire mileage example is more difficult to define because the sample of 120 tires was obtained from a production process at a particular point in time. We can think of the sampled population as the conceptual population of all the tires that could have been made by the production process at that particular point in time. In this sense, the sampled population is considered infinite, making it impossible to construct a frame from which to draw the sample. In this chapter, we show how simple random sampling can be used to select a sample from a finite population and we describe how a random sample can be taken from an infinite population that is generated by an ongoing process. We then discuss how data obtained from a sample can be used to compute estimates of a population mean, apopulation standard deviation, and a population proportion. In addition, we introduce the important concept of a sampling distribution. As we will show, knowledge of the appropriate sampling distribution enables us

256

Chapter 6 Statistical Inference

to make statements about how close the sample estimates are to the corresponding population parameters, to compute the margins of error associated with these sample estimates, and to construct and interpret interval estimates. We then discuss how to formulate hypotheses and how to use sample data to conduct tests of a population means and a population proportion.

6.1 Selecting a Sample

EAI

Chapter 2 discusses the computation of the mean and standard deviation of a population.

Often the cost of collecting information from a sample is substantially less than the cost of taking a census. Especially when personal interviews must be conducted to collect the information.

The director of personnel for Electronics Associates, Inc. (EAI) has been assigned the task of developing a profile of the company’s 2,500 employees. The characteristics to be identified include the mean annual salary for the employees and the proportion of employees having completed the company’s management training program. Using the 2,500 employees as the population for this study, we can find the annual salary and the training program status for each individual by referring to the firm’s personnel records. The data set containing this information for all 2,500 employees in the population is in the file EAI. A measurable factor that defines a characteristic of a population, process, or system is called a parameter. For EAI, the population mean annual salary m , the population standard deviation of annual salaries s , and the population proportion p of employees who completed the training program are of interest to us. Using the EAI data, we compute the population mean and the population standard deviation for the annual salary data. Population mean: m 5 $71,800 Population standard deviation: s 5 $4,000 The data for the training program status show that 1,500 of the 2,500 employees completed the training program. Letting p denote the proportion of the population that completed the training program, we see that p 5 1,500/2,500 5 0.60 . The population mean annual salary (m 5 $71,800) , the population standard deviation of annual salary (s 5 $4,000), and the population proportion that completed the training program ( p 5 0.60) are parameters of the population of EAI employees. Now suppose that the necessary information on all the EAI employees was not readily available in the company’s database. The question we must consider is how the firm’s director of personnel can obtain estimates of the population parameters by using a sample of employees rather than all 2,500 employees in the population. Suppose that a sample of 30 employees will be used. Clearly, the time and the cost of developing a profile would be substantially less for 30 employees than for the entire population. If the personnel director could be assured that a sample of 30 employees would provide adequate information about the population of 2,500 employees, working with a sample would be preferable to working with the entire population. Let us explore the possibility of using a sample for the EAI study by first considering how we can identify a sample of 30 employees.

Sampling from a Finite Population Statisticians recommend selecting a probability sample when sampling from a finite population because a probability sample allows you to make valid statistical inferences about the population. The simplest type of probability sample is one in which each sample of size n has the same probability of being selected. It is called a simple random sample. A simple random sample of size n from a finite population of size N is defined as follows. Simple Random Sample (Finite Population) The random numbers generated using Excel’s RAND function follow a uniform probability distribution between 0and 1.

A simple random sample of size n from a finite population of size N is a sample selected such that each possible sample of size n has the same probability of being selected. Procedures used to select a simple random sample from a finite population are based on the use of random numbers. We can use Excel’s RAND function to generate a random number between 0 and 1 by entering the formula 5RAND() into any cell in a worksheet. The number generated is called a random number because the mathematical procedure used by the RAND

Excel’s Sort procedure is especially useful for identifying the n elements assigned the n smallest random numbers.

6.1 Selecting a Sample

257

function guarantees that every number between 0 and 1 has the same probability of being selected. Let us see how these random numbers can be used to select a simple random sample. Our procedure for selecting a simple random sample of size n from a population of size N involves two steps. Step 1. Assign a random number to each element of the population. Step 2. Select the n elements corresponding to the n smallest random numbers. Because each set of n elements in the population has the same probability of being assigned the n smallest random numbers, each set of n elements has the same probability of being selected for the sample. If we select the sample using this two-step procedure, every sample of size n has the same probability of being selected; thus, the sample selected satisfies the definition of a simple random sample. Let us consider the process of selecting a simple random sample of 30 EAI employees from the population of 2,500. We begin by generating 2,500 random numbers, one for each employee in the population. Then we select 30 employees corresponding to the 30 smallest random numbers as our sample. Refer to Figure 6.1 as we describe the steps involved. Step 1. Step 2. Step 3. Step 4.

The random numbers generated by executing these steps will vary; therefore, results will not match Figure6.1.

In cell D1, enter the text Random Numbers In cells D2:D2501, enter the formulaRAND() Select the cell range D2:D2501 In the Home tab in the Ribbon: Click Copy in the Clipboard group Click the arrow below Paste in the Clipboard group. When the Paste window appears, click Values in the Paste Values area Press the Esc key Step 5. Select cells A1:D2501 Step 6. In the Data tab on the Ribbon, click Sort in the Sort & Filter group Step 7. When the Sort dialog box appears: Select the check box for My data has headers In the first Sort by dropdown menu, select Random Numbers Click OK

After completing these steps we obtain a worksheet like the one shown on the right in Figure 6.1. The employees listed in rows 2–31 are the ones corresponding to the smallest 30random numbers that were generated. Hence, this group of 30 employees is a simple random sample. Note that the random numbers shown on the right in Figure 6.1 are in ascending order, and that the employees are not in their original order. For instance, employee 812 in the population is associated with the smallest random number and is the first element in the sample, and employee 13 in the population (see row 14 of the worksheet on the left) has been included as the 22nd observation in the sample (row 23 of the worksheet on the right).

Sampling from an Infinite Population Sometimes we want to select a sample from a population, but the population is infinitely large or the elements of the population are being generated by an ongoing process for which there is no limit on the number of elements that can be generated. Thus, it is not possible to develop a list of all the elements in the population. This is considered the infinite population case. With an infinite population, we cannot select a simple random sample because we cannot construct a frame consisting of all the elements. In the infinite population case, statisticians recommend selecting what is called a random sample. Random Sample (Infinite Population)

A random sample of size n from an infinite population is a sample selected such that the following conditions are satisfied. 1. Each element selected comes from the same population. 2. Each element is selected independently.

258

Chapter 6 Statistical Inference

FIGURE 6.1

Using Excel to Select a Simple Random Sample A 1 Employee 2 1 3 2 4 3 5 4 6 5 7 6 8 7 9 8 10 9 11 10 12 11 13 12 14 13 15 14 16 15 17 16 18 17 19 18 20 19 21 20 22 21 23 22 24 23 25 24 26 25 27 26 28 27 29 28 30 29 31 30 32

Note: Rows 32–2501 are not shown.

B Annual Salary 75769.50 70823.00 68408.20 69787.50 72801.60 71767.70 78346.60 66670.20 70246.80 71255.00 72546.60 69512.50 71753.00 73547.10 68052.20 64652.50 71764.90 65187.80 69867.50 73706.30 72039.50 72973.60 73372.50 74592.00 75738.10 72975.10 72386.20 71051.60 72095.60 64956.50

C Training Program No Yes No No Yes No Yes No Yes No No Yes Yes No No Yes Yes Yes Yes Yes Yes No No Yes Yes Yes Yes Yes Yes No

D Random Numbers 0.613872 0.473204 0.549011 0.047482 0.531085 0.994296 0.189065 0.020714 0.647318 0.524341 0.764998 0.255244 0.010923 0.238003 0.635675 0.177294 0.415097 0.883440 0.476824 0.101065 0.775323 0.011729 0.762026 0.066344 0.776766 0.828493 0.841532 0.899427 0.486284 0.264628

E

F

G

The formula in cells D2:D2501 is = RAND[].

A 1 Employee 2 812 3 1411 4 1795 5 2095 6 1235 7 744 8 470 9 1606 10 1744 11 179 12 1387 13 1782 14 1006 15 278 16 1850 17 844 18 2028 19 1654 20 444 21 556 22 2449 23 13 24 2187 25 1633 26 22 27 1530 28 820 29 1258 30 2349 31 1698 32

B Annual Salary 69094.30 73263.90 69643.50 69894.90 67621.60 75924.00 69092.30 71404.40 70957.70 75109.70 65922.60 77268.40 75688.80 71564.70 76188.20 71766.00 72541.30 64980.00 71932.60 72973.00 65120.90 71753.00 74391.80 70164.20 72973.60 70241.30 72793.90 70979.40 75860.90 77309.10

C Training Program Yes Yes Yes Yes No Yes Yes Yes Yes Yes Yes No Yes No No Yes No Yes Yes Yes Yes Yes No No No No No Yes Yes No

D Random Numbers 0.000193 0.000484 0.002641 0.002763 0.002940 0.002977 0.003182 0.003448 0.004203 0.005293 0.005709 0.005729 0.005796 0.005966 0.006250 0.006708 0.007767 0.008095 0.009686 0.009711 0.010595 0.010923 0.011364 0.011603 0.011729 0.013570 0.013669 0.014042 0.014532 0.014539

Care and judgment must be exercised in implementing the selection process for obtaining a random sample from an infinite population. Each case may require a different selection procedure. Let us consider two examples to see what we mean by the conditions: (1)Each element selected comes from the same population, and (2) each element is selected independently. A common quality-control application involves a production process for which there is no limit on the number of elements that can be produced. The conceptual population from which we are sampling is all the elements that could be produced (not just the ones that are produced) by the ongoing production process. Because we cannot develop a list of all the elements that could be produced, the population is considered infinite. To be more specific, let us consider a production line designed to fill boxes with breakfast cereal to a mean weight of 24ounces per box. Samples of 12 boxes filled by this process are periodically selected by a quality-control inspector to determine if the process is operating properly or whether, perhaps, a machine malfunction has caused the process to begin underfilling or overfilling the boxes. With a production operation such as this, the biggest concern in selecting a random sample is to make sure that condition 1, the sampled elements are selected from the same population, is satisfied. To ensure that this condition is satisfied, the boxes must be selected at approximately the same point in time. This way the inspector avoids the possibility of selecting some boxes when the process is operating properly and other boxes when the

259

6.1 Selecting a Sample

process is not operating properly and is underfilling or overfilling the boxes. With a production process such as this, the second condition, each element is selected independently, is satisfied by designing the production process so that each box of cereal is filled independently. With this assumption, the quality-control inspector need only worry about satisfying the same population condition. As another example of selecting a random sample from an infinite population, consider the population of customers arriving at a fast-food restaurant. Suppose an employee is asked to select and interview a sample of customers in order to develop a profile of customers who visit the restaurant. The customer-arrival process is ongoing, and there is no way to obtain a list of all customers in the population. So, for practical purposes, the population for this ongoing process is considered infinite. As long as a sampling procedure is designed so that all the elements in the sample are customers of the restaurant and they are selected independently, a random sample will be obtained. In this case, the employee collecting the sample needs to select the sample from people who come into the restaurant and make a purchase to ensure that the same population condition is satisfied. If, for instance, the person selected for the sample is someone who came into the restaurant just to use the restroom, that person would not be a customer and the same population condition would be violated. So, as long as the interviewer selects the sample from people making a purchase at the restaurant, condition 1 is satisfied. Ensuring that the customers are selected independently can be more difficult. The purpose of the second condition of the random sample selection procedure (each element is selected independently) is to prevent selection bias. In this case, selection bias would occur if the interviewer were free to select customers for the sample arbitrarily. The interviewer might feel more comfortable selecting customers in a particular age group and might avoid customers in other age groups. Selection bias would also occur if the interviewer selected a group of five customers who entered the restaurant together and asked all of them to participate in the sample. Such a group of customers would be likely to exhibit similar characteristics, which might provide misleading information about the population of customers. Selection bias such as this can be avoided by ensuring that the selection of a particular customer does not influence the selection of any other customer. In other words, the elements (customers) are selected independently. McDonald’s, a fast-food restaurant chain, implemented a random sampling procedure for this situation. The sampling procedure was based on the fact that some customers presented discount coupons. Whenever a customer presented a discount coupon, the next customer served was asked to complete a customer profile questionnaire. Because arriving customers presented discount coupons randomly and independently of other customers, this sampling procedure ensured that customers were selected independently. As a result, the sample satisfied the requirements of a random sample from an infinite population. Situations involving sampling from an infinite population are usually associated with a process that operates over time. Examples include parts being manufactured on a production line, repeated experimental trials in a laboratory, transactions occurring at a bank, telephone calls arriving at a technical support center, and customers entering a retail store. In each case, the situation may be viewed as a process that generates elements from an infinite population. As long as the sampled elements are selected from the same population and are selected independently, the sample is considered a random sample from an infinite population. N otes

+

C o m m ents

1. In this section we have been careful to define two types of samples: a simple random sample from a finite population and a random sample from an infinite population. In the remainder of the text, we will generally refer to both of these as either a random sample or simply a sample. We will not make a distinction of the sample being a “simple” random sample unless it is necessary for the exercise or discussion.

2. Statisticians who specialize in sample surveys from finite populations use sampling methods that provide probability samples. With a probability sample, each possible sample has a known probability of selection and a random process is used to select the elements for the sample. Simple random sampling is one of these methods. We use the term simple in simple random sampling to clarify that this is the

260

Chapter 6 Statistical Inference

probability sampling method that ensures that each sample of size n has the same probability of being selected.

randomly select one of the first k elements of the population, and then select every kth element from the population thereafter. Calculation of sample statistics such as the sample mean x ,

3. The number of different simple random samples of size n that can be selected from a finite population of size N is: N! n!(N 2 n)! In this formula, N! and n! are the factorial formulas. For the EAI problem with N 5 2,500 and n 5 30 , this expression can be used to show that approximately 2.75 3 10 69 different simple random samples of 30 EAI employees can be obtained. 4. In addition to simple random sampling, other probability sampling methods include the following: • Stratified random sampling—a method in which the population is first divided into homogeneous subgroups or strata and then a simple random sample is taken from each stratum. • Cluster sampling—a method in which the population is first divided into heterogeneous subgroups or clusters and then simple random samples are taken from some or all of the clusters. • Systematic sampling—a method in which we sort the population based on an important characteristic,

the sample standard deviation s, and the sample proportion p differ depending on which method of probability sampling is used. See specialized books on sampling such as Elementary Survey Sampling (2011) by Scheaffer, Mendenhall, and Ott for more information. 5. Nonprobability sampling methods include the following: • Convenience sampling—a method in which sample elements are selected on the basis of accessibility. • Judgment sampling—a method in which sample elements are selected based on the opinion of the person doing the study. Although nonprobability samples have the advantages of relatively easy sample selection and data collection, no statistically justified procedure allows a probability analysis or inference about the quality of nonprobability sample results. Statistical methods designed for probability samples should not be applied to a nonprobability sample, and we should be cautious in interpreting the results when a nonprobability sample is used to make inferences about a population.

6.2 Point Estimation

Chapter 2 discusses the computation of the mean and standard deviation of a sample.

Now that we have described how to select a simple random sample, let us return to the EAI problem. A simple random sample of 30 employees and the corresponding data on annual salary and management training program participation are as shown in Table 6.1. The notation x1, x 2, and so on is used to denote the annual salary of the first employee in the sample, the annual salary of the second employee in the sample, and so on. Participation in the management training program is indicated by Yes in the management training program column. To estimate the value of a population parameter, we compute a corresponding characteristic of the sample, referred to as a sample statistic. For example, to estimate the population mean m and the population standard deviation s for the annual salary of EAI employees, we use the data in Table 6.1 to calculate the corresponding sample statistics: the sample mean and the sample standard deviation s. The sample mean is x5

2,154,420 Sxi 5 5 $71,814 n 30

and the sample standard deviation is s5

∑ ( xi 2 x )2 325, 009,260 5 5 $3,384 n 21 29

To estimate p, the proportion of employees in the population who completed the management training program, we use the corresponding sample proportion p. Let x denote the number of employees in the sample who completed the management training program. The data in Table 6.1 show that x 5 19. Thus, with a sample size of n 5 30 , the sample proportion is p5

x 19 5 5 0.63 n 30

261

6.2 Point Estimation

Table 6.1

Annual Salary and Training Program Status for a Simple Random Sample of 30 EAI Employees Management Training Program

Annual Salary ($)

Management Training Program

x1 5 69, 094.30

Yes

x16 5 71, 766.00

Yes

x 2 5 73, 263.90

Yes

x17 5 72,541.30

No

x 3 5 69, 343.50

Yes

x18 5 64, 980.00

Yes

x 4 5 69, 894.90

Yes

x19 5 71, 932.60

Yes

x 5 5 67, 621.60

No

x 20 5 72, 973.00

Yes

x 6 5 75, 924.00

Yes

x 21 5 65,120.90

Yes

x 7 5 69, 092.30

Yes

x 22 5 71, 753.00

Yes

x 8 5 71, 404.40

Yes

x 23 5 74, 391.80

No

x 9 5 70, 957.70

Yes

x 24 5 70,164.20

No

x10 5 75,109, 70

Yes

x 25 5 72, 973.60

No

x11 5 65, 922.60

Yes

x 26 5 70, 241.30

No

x12 5 77, 268.40

No

x 27 5 72, 793.90

No

x13 5 75, 688.80

Yes

x 28 5 70, 979.40

Yes

x14 5 71,564.70

No

x 29 5 75, 860.90

Yes

x15 5 76,188.20

No

x 30 5 77, 309.10

No

Annual Salary ($)

By making the preceding computations, we perform the statistical procedure called point estimation. We refer to the sample mean x as the point estimator of the population mean m , the sample standard deviation s as the point estimator of the population standard deviation s , and the sample proportion p as the point estimator of the population proportion p. The numerical value obtained for x , s, or p is called the point estimate. Thus, for the simple random sample of 30 EAI employees shown in Table 6.1, $71,814 is the point estimate of m , $3,348 is the point estimate of s , and 0.63 is the point estimate of p. Table6.2 summarizes the sample results and compares the point estimates to the actual values of the population parameters.

Table 6.2

Summary of Point Estimates Obtained from a Simple Random Sample of 30 EAI Employees

Population Parameter

Parameter Value Point Estimator

Point Estimate

m 5 Populationmean annual salary

$71,800

x 5 Sample mean annual salary

$71,814

s 5 Population standard deviation for annual salary

$4,000

s 5 Sample standard deviation for annual salary

$3,348

p 5 Population proportion completing the management training program

0.60

p 5 Sample proportion having completed the management training program

0.63

262

Chapter 6 Statistical Inference

In Chapter 7, we will show how to construct an interval estimate in order to provide information about how close the point estimate is to the population parameter.

As is evident from Table 6.2, the point estimates differ somewhat from the values of corresponding population parameters. This difference is to be expected because a sample, and not a census of the entire population, is being used to develop the point estimates.

Practical Advice The subject matter of most of the rest of the book is concerned with statistical inference, of which point estimation is a form. We use a sample statistic to make an inference about a population parameter. When making inferences about a population based on a sample, it is important to have a close correspondence between the sampled population and the target population. The target population is the population about which we want to make inferences, while the sampled population is the population from which the sample is actually taken. In this section, we have described the process of drawing a simple random sample from the population of EAI employees and making point estimates of characteristics of that same population. So the sampled population and the target population are identical, which is the desired situation. But in other cases, it is not as easy to obtain a close correspondence between the sampled and target populations. Consider the case of an amusement park selecting a sample of its customers to learn about characteristics such as age and time spent at the park. Suppose all the sample elements were selected on a day when park attendance was restricted to employees of a large company. Then the sampled population would be composed of employees of that company and members of their families. If the target population we wanted to make inferences about were typical park customers over a typical summer, then we might encounter a significant difference between the sampled population and the target population. In such a case, we would question the validity of the point estimates being made. Park management would be in the best position to know whether a sample taken on a particular day was likely to be representative of the target population. In summary, whenever a sample is used to make inferences about a population, we should make sure that the study is designed so that the sampled population and the target population are in close agreement. Good judgment is a necessary ingredient of sound statistical practice.

6.3 Sampling Distributions In the preceding section we said that the sample mean x is the point estimator of the population mean m , and the sample proportion p is the point estimator of the population proportion p. For the simple random sample of 30 EAI employees shown in Table 6.1, the point estimate of m is x 5 $71,814 and the point estimate of p is p 5 0.63. Suppose we select another simple random sample of 30 EAI employees and obtain the following point estimates: Sample mean: x 5 $72,670 Sample proportion: p 5 0.70 Note that different values of x and p were obtained. Indeed, a second simple random sample of 30 EAI employees cannot be expected to provide the same point estimates as the first sample. Now, suppose we repeat the process of selecting a simple random sample of 30 EAI employees over and over again, each time computing the values of x and p. Table 6.3 contains a portion of the results obtained for 500 simple random samples, and Table 6.4 shows the frequency and relative frequency distributions for the 500 values of x . Figure 6.2 shows the relative frequency histogram for the x values. A random variable is a quantity whose values are not known with certainty. Because the sample mean x is a quantity whose values are not known with certainty, the sample mean x is a random variable. As a result, just like other random variables, x has a mean or expected value, a standard deviation, and a probability distribution. Because the various

Values of x and p from 500 Simple Random Samples of 30 EAI Employees

Table 6.3

Sample Number

Sample Mean ( x )

Sample Proportion ( p )

1

71,814

0.63

2

72,670

0.70

3

71,780

0.67

4

71,588

0.53

·

·

·

·

·

·

·

·

·

500

71,752

0.50

Table 6.4

Frequency and Relative Frequency Distributions of x from 500Simple Random Samples of 30 EAI Employees

Mean Annual Salary ($)

Chapter 2 introduces the concept of a random variable, and Chapter 4 discusses properties of random variables and their relationship to probability concepts.

The ability to understand the material in subsequent sections of this chapter depends heavily on the ability to understand and use the sampling distributions presented in this section.

263

6.3 Sampling Distributions

Frequency

Relative Frequency

69,500.00–69,999.99

2

0.004

70,000.00–70,499.99

16

0.032

70,500.00–70,999.99

52

0.104

71,000.00–71,499.99

101

0.202

71,500.00–71,999.99

133

0.266

72,000.00–72,499.99

110

0.220

72,500.00–72,999.99

54

0.108

73,000.00–73,499.99

26

0.052

73,500.00–73,999.99

6

0.012

Totals:

500

1.000

possible values of x are the result of different simple random samples, the probability distribution of x is called the sampling distribution of x . Knowledge of this sampling distribution and its properties will enable us to make probability statements about how close the sample mean x is to the population mean m . Let us return to Figure 6.2. We would need to enumerate every possible sample of 30employees and compute each sample mean to completely determine the sampling distribution of x . However, the histogram of 500 values of x gives an approximation of this sampling distribution. From the approximation we observe the bell-shaped appearance of the distribution. We note that the largest concentration of the x values and the mean of the 500values of x is near the population mean m 5 $71,800 . We will describe the properties of the sampling distribution of x more fully in the next section. The 500 values of the sample proportion p are summarized by the relative frequency histogram in Figure 6.3. As in the case of x , p is a random variable. If every possible sample of size 30 were selected from the population and if a value of p were computed for each sample, the resulting probability distribution would be the sampling distribution of p. The relative frequency histogram of the 500 sample values in Figure 6.3 provides a general idea of the appearance of the sampling distribution of p. In practice, we select only one simple random sample from the population. We repeated the sampling process 500 times in this section simply to illustrate that many different

264

Chapter 6 Statistical Inference

FIGURE6.2

Relative Frequency Histogram of x Values from 500 Simple Random Samples of Size 30 Each

.30

Relative Frequency

.25 .20 .15 .10 .05

70,000

FIGURE6.3

71,000

72,000 Values of x

73,000

74,000

Relative Frequency Histogram of p Values from 500 Simple Random Samples of Size 30 Each

0.40 0.35

Relative Frequency

0.30 0.25 0.20 0.15 0.10 0.05

0.32

0.40

0.48

0.56 0.64 Values of p

0.72

0.80

0.88

265

6.3 Sampling Distributions

samples are possible and that the different samples generate a variety of values for the sample statistics x and p. The probability distribution of any particular sample statistic is called the sampling distribution of the statistic. Next we discuss the characteristics of the sampling distributions of x and p.

Sampling Distribution of x In the previous section we said that the sample mean x is a random variable and that its probability distribution is called the sampling distribution of x . Sampling Distribution of x

The sampling distribution of x is the probability distribution of all possible values of the sample mean x .

This section describes the properties of the sampling distribution of x . Just as with other probability distributions we studied, the sampling distribution of x has an expected value or mean, a standard deviation, and a characteristic shape or form. Let us begin by considering the mean of all possible x values, which is referred to as the expected value of x . Expected Value of x In the EAI sampling problem we saw that different simple random

samples result in a variety of values for the sample mean x . Because many different values of the random variable x are possible, we are often interested in the mean of all possible values of x that can be generated by the various simple random samples. The mean of the x random variable is the expected value of x . Let E ( x ) represent the expected value of x and m represent the mean of the population from which we are selecting a simple random sample. It can be shown that with simple random sampling, E ( x ) and m are equal. The expected value of x equals the mean of the population from which the sample is selected.

Expected Value of x

E ( x ) 5 m

(6.1)

where E ( x ) 5 the expected value of x m 5 the population mean

The term standard error is used in statistical inference to refer to the standard deviation of a point estimator.

This result states that with simple random sampling, the expected value or mean of the sampling distribution of x is equal to the mean of the population. In Section 6.1 we saw that the mean annual salary for the population of EAI employees is m 5 $71,800 . Thus, according to equation (6.1), if we considered all possible samples of size n from the population of EAI employees, the mean of all the corresponding sample means for the EAI study would be equal to $71,800, the population mean. When the expected value of a point estimator equals the population parameter, we say the point estimator is unbiased. Thus, equation (6.1) states that x is an unbiased estimator of the population mean m . Standard Deviation of x Let us define the standard deviation of the sampling distribution of x . We will use the following notation:

sx s n N

5 5 5 5

the standard deviation of x , or the standard error of the mean the standard deviation of the population the sample size the population size

266

Chapter 6 Statistical Inference

It can be shown that the formula for the standard deviation of x depends on whether the population is finite or infinite. The two formulas for the standard deviation of x follow. Standard Deviation of x

Finite Population

sx 5

N 2n s N 21 n

Infinite Population

sx 5

s n

(6.2)

In comparing the two formulas in equation (6.2), we see that the factor ( N 2 n)/( N 2 1) is required for the finite population case but not for the infinite population case. This factor is commonly referred to as the finite population correction factor. In many practical sampling situations, we find that the population involved, although finite, is large relative to the sample size. In such cases the finite population correction factor ( N 2 n)/( N 2 1) is close to 1. As a result, the difference between the values of the standard deviation of x for the finite and infinite populations becomes negligible. Then, s x 5 s / n becomes a good approximation to the standard deviation of x even though the population is finite. In cases where n /N . 0.05, the finite population version of equation(6.2) should be used in the computation of s x . Unless otherwise noted, throughout the text we will assume that the population size is large relative to the sample size, i.e., n /N # 0.05. Observe from equation (6.2) that we need to know s , the standard deviation of the population, in order to compute s x . That is, the sample-to-sample variability in the point estimator x , as measured by the standard error s x , depends on the standard deviation of the population from which the sample is drawn. However, when we are sampling to estimate the population mean with x , usually the population standard deviation is also unknown. Therefore, we need to estimate the standard deviation of x with sx using the sample standard deviations as shown in equation (6.3). Estimated Standard Deviation of x

Finite Population

sx 5

N 2n s N 21 n

Infinite Population s sx 5 n

(6.3)

Let us now return to the EAI example and compute the estimated standard error (standard deviation) of the mean associated with simple random samples of 30 EAI employees. Recall from Table 6.2 that the standard deviation of the sample of 30 EAI employees is s 5 3,348. In this case, the population is finite ( N 5 2,500), but because n /N 5 30/2,500 5 0.012 , 0.05, we can ignore the finite population correction factor and compute the estimated standard error as sx 5

3,348 s 5 5 611.3 n 30

In this case, we happen to know that the standard deviation of the population is actually s 5 4,000, so the true standard error is

sx 5

4,000 s 5 5 730.3 n 30

The difference between sx and s x is due to sampling error, or the error that results from observing a sample of 30 rather than the entire population of 2,500. Form of the Sampling Distribution of x The preceding results concerning the expected

value and standard deviation for the sampling distribution of x are applicable for any population. The final step in identifying the characteristics of the sampling distribution of x is

6.3 Sampling Distributions

267

to determine the form or shape of the sampling distribution. We will consider two cases: (1) The population has a normal distribution; and (2) the population does not have a normal distribution. Population Has a Normal Distribution In many situations it is reasonable to assume that the population from which we are selecting a random sample has a normal, or nearly normal, distribution. When the population has a normal distribution, the sampling distribution of x is normally distributed for any sample size. Population Does Not Have a Normal Distribution When the population from which we are selecting a random sample does not have a normal distribution, the central limit theorem is helpful in identifying the shape of the sampling distribution of x . A statement of the central limit theorem as it applies to the sampling distribution of x follows.

Central Limit Theorem

In selecting random samples of size n from a population, the sampling distribution of the sample mean x can be approximated by a normal distribution as the sample size becomes large.

Figure 6.4 shows how the central limit theorem works for three different populations; each column refers to one of the populations. The top panel of the figure shows that none of the populations are normally distributed. Population I follows a uniform distribution. Population II is often called the rabbit-eared distribution. It is symmetric, but the more likely values fall in the tails of the distribution. Population III is shaped like the exponential distribution; it is skewed to the right. The bottom three panels of Figure 6.4 show the shape of the sampling distribution for samples of size n 5 2 , n 5 5, and n 5 30 . When the sample size is 2, we see that the shape of each sampling distribution is different from the shape of the corresponding population distribution. For samples of size 5, we see that the shapes of the sampling distributions for populations I and II begin to look similar to the shape of a normal distribution. Even though the shape of the sampling distribution for population III begins to look similar to the shape of a normal distribution, some skewness to the right is still present. Finally, for a sample size of 30, the shapes of each of the three sampling distributions are approximately normal. From a practitioner’s standpoint, we often want to know how large the sample size needs to be before the central limit theorem applies and we can assume that the shape of the sampling distribution is approximately normal. Statistical researchers have investigated this question by studying the sampling distribution of x for a variety of populations and a variety of sample sizes. General statistical practice is to assume that, for most applications, the sampling distribution of x can be approximated by a normal distribution whenever the sample size is 30 or more. In cases in which the population is highly skewed or outliers are present, sample sizes of 50 may be needed. Sampling Distribution of x for the EAI Problem Let us return to the EAI problem where

we previously showed that E ( x ) 5 $71,800 and s x 5 730.3. At this point, we do not have any information about the population distribution; it may or may not be normally distributed. If the population has a normal distribution, the sampling distribution of x is normally distributed. If the population does not have a normal distribution, the simple random sample of 30 employees and the central limit theorem enable us to conclude that the sampling distribution of x can be approximated by a normal distribution. In either case, we are comfortable proceeding with the conclusion that the sampling distribution of x can be described by the normal distribution shown in Figure 6.5. In other words, Figure 6.5 illustrates the distribution of the sample means corresponding to all possible sample sizes of 30for the EAI study.

268

Chapter 6 Statistical Inference

FIGURE6.4

Illustration of the Central Limit Theorem for Three Populations Population I

Population II

Population III

Values of x

Values of x

Values of x

Values of x

Values of x

Values of x

Values of x

Values of x

Values of x

Values of x

Values of x

Values of x

Population Distribution

Sampling Distribution of x (n 5 2)

Sampling Distribution of x (n 5 5)

Sampling Distribution of x (n 5 30)

Relationship Between the Sample Size and the Sampling Distribution of x Suppose that in the EAI sampling problem we select a simple random sample of 100 EAI employees instead of the 30 originally considered. Intuitively, it would seem that because the larger sample size provides more data, the sample mean based on n 5 100 would provide a better estimate of the population mean than the sample mean based on n 5 30 . To see how much better, let us consider the relationship between the sample size and the sampling distribution of x . First, note that E ( x ) 5 m regardless of the sample size. Thus, the mean of all possible values of x is equal to the population mean m regardless of the sample size m . However, note that the standard error of the mean, s x 5 s / n , is related to the square root of the sample size. Whenever the sample size is increased, the standard error of the mean s x

FIGURE6.5

The sampling distribution in Figure 6.5 is a theoretical construct, as typically the population mean and the population standard deviation are not known. Instead, we must estimate these parameters with the sample mean and the sample standard deviation, respectively.

269

6.3 Sampling Distributions

Sampling Distribution of x for the Mean Annual Salary of a Simple Random Sample of 30 EAI Employees

Sampling distribution of x

x 5

4,000 5 730.3 5 n 30

x

71,800 E(x ) 5

decreases. With n 5 30 , the standard error of the mean for the EAI problem is 730.3. However, with the increase in the sample size to n 5 100, the standard error of the mean is decreased to

sx 5

s 4, 000 5 5 400 n 100

The sampling distributions of x with n 5 30 and n 5 100 are shown in Figure 6.6. Because the sampling distribution with n 5 100 has a smaller standard error, the values of x with n 5 100 have less variation and tend to be closer to the population mean than the values of x with n 5 30 .

FIGURE6.6

A Comparison of the Sampling Distributions of x for Simple Random Samples of n 5 30 and n 5100 EAI Employees

With n 5 100, x 5 400

With n 5 30, x 5 730.3

71,800

x

270

Chapter 6 Statistical Inference

The important point in this discussion is that as the sample size increases, the standard error of the mean decreases. As a result, a larger sample size will provide a higher probability that the sample mean falls within a specified distance of the population mean. The practical reason we are interested in the sampling distribution of x is that it can be used to provide information about how close the sample mean is to the population mean. The concepts of interval estimation and hypothesis testing discussed in Sections 6.4 and 6.5 rely on the properties of sampling distributions.

Sampling Distribution of p The sample proportion p is the point estimator of the population proportion p. The formula for computing the sample proportion is x p5 n where x 5 the number of elements in the sample that possess the characteristic of interest n 5 sample size As previously noted in this section, the sample proportion p is a random variable and its probability distribution is called the sampling distribution of p. Sampling Distribution of p

The sampling distribution of p is the probability distribution of all possible values of the sample proportion p. To determine how close the sample proportion p is to the population proportion p, we need to understand the properties of the sampling distribution of p: the expected value of p, the standard deviation of p, and the shape or form of the sampling distribution of p. Expected Value of p The expected value of p, the mean of all possible values of p, is

equal to the population proportion p. Expected Value of p

(6.4)

E ( p) 5 p

where E ( p) 5 the expected value of p p 5 the population proportion

Because E ( p) 5 p, p is an unbiased estimator of p. In Section 6.1, we noted that p 5 0.60 for the EAI population, where p is the proportion of the population of employees who participated in the company’s management training program. Thus, the expected value of p for the EAI sampling problem is 0.60. That is, if we considered the sample proportions corresponding to all possible samples of size n for the EAI study, the mean of these sample proportions would be 0.6. Standard Deviation of p Just as we found for the standard deviation of x , the standard

deviation of p depends on whether the population is finite or infinite. The two formulas for computing the standard deviation of p follow. Standard Deviation of p

Finite Population

sp 5

N 2n N 21

p(1 2 p) n

Infinite Population

sp 5

p(1 2 p) n

(6.5)

271

6.3 Sampling Distributions

Comparing the two formulas in equation (6.5), we see that the only difference is the use of the finite population correction factor ( N 2 n)/( N 2 1) . As was the case with the sample mean x , the difference between the expressions for the finite population and the infinite population becomes negligible if the size of the finite population is large in comparison to the sample size. We follow the same rule of thumb that we recommended for the sample mean. That is, if the population is finite with n /N # 0.05 , we will use s p 5 p(1 2 p)/n . However, if the population is finite with n /N . 0.05 , the finite population correction factor should be used. Again, unless specifically noted, throughout the text we will assume that the population size is large in relation to the sample size and thus the finite population correction factor is unnecessary. Earlier in this section, we used the term standard error of the mean to refer to the standard deviation of x . We stated that in general the term standard error refers to the standard deviation of a point estimator. Thus, for proportions we use standard error of the proportion to refer to the standard deviation of p. From equation (6.5), we observe that the sampleto-sample variability in the point estimator p, as measured by the standard error s p, depends on the population proportion p. However, when we are sampling to compute p, typically the population proportion is unknown. Therefore, we need to estimate the standard deviation of p with s p using the sample proportion as shown in equation (6.6). Estimated Standard Deviation of p

Finite Population

sp 5

N 2n N 21

p(1 2 p) n

Infinite Population sp 5

p(1 2 p) n

(6.6)

Let us now return to the EAI example and compute the estimated standard error (standard deviation) of the proportion associated with simple random samples of 30 EAI employees. Recall from Table 6.2 that the sample proportion of EAI employees who completed the management training program is p 5 0.63. Because n /N 5 30/2,500 5 0.012 , 0.05, wecan ignore the finite population correction factor and compute the estimated standard error as sp 5

p(1 2 p) 5 n

0.63(1 2 0.63) 5 0.0881 30

In the EAI example, we actually know that the population proportion is p 5 0.6, so we know that the true standard error is

sp 5

p(1 2 p) 5 n

0.6(1 2 0.6) 5 0.0894 30

The difference between s p and s p is due to sampling error. Form of the Sampling Distribution of p Now that we know the mean and standard devi-

ation of the sampling distribution of p, the final step is to determine the form or shape of the sampling distribution. The sample proportion is p 5 x /n. For a simple random sample from a large population, x is a binomial random variable indicating the number of elements in the sample with the characteristic of interest. Because n is a constant, the probability of x/n is the same as the binomial probability of x, which means that the sampling distribution of p is also a discrete probability distribution and that the probability for each value of x/n the same as the binomial probability of the corresponding value of x. Statisticians have shown that a binomial distribution can be approximated by a normal distribution whenever the sample size is large enough to satisfy the following two conditions: np $ 5 and n(1 2 p) $ 5

272

Chapter 6 Statistical Inference

Because the population proportion p is typically unknown in a study, the test to see whether the sampling distribution of pcan be approximated by a normal distribution is often based on the sample proportion, np $ 5 and n(1 2 p) $ 5 .

Assuming that these two conditions are satisfied, the probability distribution of x in the sample proportion, p 5 x /n, can be approximated by a normal distribution. And because n is a constant, the sampling distribution of p can also be approximated by a normal distribution. This approximation is stated as follows: The sampling distribution of p can be approximated by a normal distribution whenever np $ 5 and n(1 2 p) $ 5. In practical applications, when an estimate of a population proportion is desired, we find that sample sizes are almost always large enough to permit the use of a normal approximation for the sampling distribution of p. Recall that for the EAI sampling problem we know that a sample proportion of employees who participated in the training program is p 5 0.63. With a simple random sample of size 30, we have np 5 30(0.63) 5 18.9 and n(1 2 p) 5 30(0.37) 5 11.1. Thus, the sampling distribution of p can be approximated by a normal distribution shown in Figure 6.7. Relationship Between Sample Size and the Sampling Distribution of p Suppose that in the EAI sampling problem we select a simple random sample of 100 EAI employees instead of the 30 originally considered. Intuitively, it would seem that because the larger sample size provides more data, the sample proportion based on n 5 100 would provide a better estimate of the population proportion than the sample proportion based on n 5 30 . To see how much better, recall that the standard error of the proportion is 0.0894 when the sample size is n 5 30 . If we increase the sample size to n 5 100, the standard error of the proportion becomes

sp 5

0.60(1 2 0.60) 5 0.0490 100

As we observed with the standard deviation of the sampling distribution of x , increasing the sample size decreases the sample-to-sample variability of the sample proportion. As a result, a larger sample size will provide a higher probability that the sample proportion falls within a specified distance of the population proportion. The practical reason we are The sampling distribution in Figure 6.7 is a theoretical construct, as typically the population proportion is not known. Instead, we must estimate it with the sample proportion.

FIGURE6.7

Sampling Distribution of p for the Proportion of EAI mployees Who Participated in the Management Training E Program

Sampling distribution of p

p 5 0.0894

p

0.60 E( p)

273

6.4 Interval Estimation

interested in the sampling distribution of p is that it can be used to provide information about how close the sample proportion is to the population proportion. The concepts of interval estimation and hypothesis testing discussed in Sections 6.4 and 6.5 rely on the properties of sampling distributions.

6.4 Interval Estimation In Section 6.2, we stated that a point estimator is a sample statistic used to estimate a population parameter. For instance, the sample mean x is a point estimator of the population mean m and the sample proportion p is a point estimator of the population proportion p. Because a point estimator cannot be expected to provide the exact value of the population parameter, interval estimation is frequently used to generate an estimate of the value of a population parameter. An interval estimate is often computed by adding and subtracting a value, called the margin of error, to the point estimate: Point estimate 6 Margin of error The purpose of an interval estimate is to provide information about how close the point estimate, provided by the sample, is to the value of the population parameter. In this section, we show how to compute interval estimates of a population mean m and a population proportion p.

Interval Estimation of the Population Mean The general form of an interval estimate of a population mean is x 6 Margin of error The sampling distribution of x plays a key role in computing this interval estimate. In Section 6.3 we showed that the sampling distribution of x has a mean equal to the population mean ( E ( x ) 5 m ) and a standard deviation equal to the population standard deviation divided by the square root of the sample size (s x 5 s / n ). We also showed that for a sufficiently large sample or for a sample taken from a normally distributed population, the sampling distribution of x follows a normal distribution. These results for samples of 30EAI employees are illustrated in Figure 6.5. Because the sampling distribution of x shows how values of x are distributed around the population mean m , the sampling distribution of x provides information about the possible differences between x and m . For any normally distributed random variable, 90% of the values lie within 1.645 standard deviations of the mean, 95% of the values lie within 1.960 standard deviations of the mean, and 99% of the values lie within 2.576 standard deviations of the mean. Thus, when the sampling distribution of x is normal, 90% of all values of x must be within 61.645s x of the mean m , 95% of all values of x must be within 61.96s x of the mean m , and 99% of all values of x must be within 62.576s x of the mean m . Figure 6.8 shows what we would expect for values of sample means for 10 independent random samples when the sampling distribution of x is normal. Because 90% of all values of x are within 61.645s x of the mean m , we expect 9 of the values of x for these 10samples to be within 61.645s x of the mean m . If we repeat this process of collecting 10samples, our results may not include 9 sample means with values that are within 1.645s x of the mean m , but on average, the values of x will be within 61.645s x of the mean m for 9of every 10 samples. We now want to use what we know about the sampling distribution of x to develop an interval estimate of the population mean m . However, when developing an interval estimate of a population mean m , we generally do not know the population standard deviation s , and therefore, we do not know the standard error of x , s x 5 s / n . In this case, we must use the same sample data to estimate both m and s , so we use sx 5 s / n to estimate the standard error of x . When we estimate s x with sx , we introduce an additional source of uncertainty about the distribution of values of x . If the sampling distribution of x follows a

274

Chapter 6 Statistical Inference

FIGURE 6.8

Sampling Distribution of the Sample Mean

Sampling distribution of x– x– =

μ – 1.645

μ + 1.645

μ

x–

/ n

x–

x–

–x 1 –x 2 –x 3 –x 4 –x 5 –x 6 –x 7 –x 8 –x 9 x–10

The standard normal distribution is a normal distribution with a mean of zero and a standard deviation of one. Chapter5 contains a discussion of the normal distribution and the special case of the standard normal distribution.

normal distribution, we address this additional source of uncertainty by using a probability distribution known as the t distribution. The t distribution is a family of similar probability distributions; the shape of each specific t distribution depends on a parameter referred to as the degrees of freedom. The t distribution with 1 degree of freedom is unique, as is the t distribution with 2 degrees of freedom, the t distribution with 3 degrees of freedom, and so on. These t distributions are similar in shape to the standard normal distribution but are wider; this reflects the additional uncertainty that results from using sx to estimate s x . As the degrees of freedom increase, the difference between sx and s x decreases and the t distribution narrows. Furthermore, because the area under any distribution curve is fixed at 1.0, a narrower tdistribution will have a higher peak. Thus, as the degrees of freedom increase, the t distribution narrows, its peak becomes higher, and it becomes more similar to the standard normal distribution. We can see this in Figure 6.9, which shows t distributions with 10 and 20 degrees of freedom as well as the standard normal probability distribution. Note that as with the standard normal distribution, the mean of the t distribution is zero. To use the t distribution to compute the margin of error for the EAI example, we consider the t distribution with n 2 1 5 30 2 1 5 29 degrees of freedom. Figure 6.10 shows that for a t-distributed random variable with 29 degrees of freedom, 90% of the values are within 61.699 standard deviations of the mean and 10% of the values are more than

Although the mathematical development of the tdistribution is based on the assumption that the population from which we are sampling is normally distributed, research shows that the t distribution can be successfully applied in many situations in which the population deviates substantially from a normal distribution.

275

6.4 Interval Estimation

FIGURE6.9

Comparison of the Standard Normal Distribution with t Distributions with 10 and 20 Degrees of Freedom

Standard normal distribution t distribution (20 degrees of freedom) t distribution (10 degrees of freedom)

z, t

FIGURE 6.10

t Distribution with 29 Degrees of Freedom

90% 5%

5%

–1.699

To see how the difference between the t distribution and the standard normal distribution decreases as the degrees of freedom increase, use Excel’s T.INV.2T function to compute t 0.05 for increasingly larger degrees of freedom (n 2 1) and watch the value of t 0.05 approach 1.645.

t0.05 51.699

61.699 standard deviations away from the mean. Thus, 5% of the values are more than 1.699 standard deviations below the mean and 5% of the values are more than 1.699 standard deviations above the mean. This leads us to use t0.05 to denote the value of t for which the area in the upper tail of a t distribution is 0.05. For a t distribution with 29 degrees of freedom, t0.05 5 1.699. We can use Excel’s T.INV.2T function to find the value from a t distribution such that a given percentage of the distribution is included in the interval 6t for any degrees of freedom. For example, suppose again that we want to find the value of t from the t distribution with 29 degrees of freedom such that 90% of the t distribution is included in the interval 2t to 1t. Excel’s T.INV.2T function has two inputs: (1) 1 2 the proportion of the t distribution that will fall between 2t and 1t, and (2) the degrees of freedom (which in this case is equal to the sample size 2 1). For our example, we would enter the formula T.INV.2T(1 - 0.90,30 - 1) , which computes the value of 1.699. This confirms the data shown in Figure 6.10; for the t distribution with 29 degrees of freedom, t0.05 5 1.699 and 90% of all values for the t distribution with 29 degrees of freedom will lie between 21.699 and 1.699.

276

Chapter 6 Statistical Inference

At the beginning of this section, we stated that the general form of an interval estimate of the population mean m is x 6 margin of error. To provide an interpretation for this interval estimate, let us consider the values of x that might be obtained if we took 10 independent simple random samples of 30 EAI employees. The first sample might have the mean x1 and standard deviation s1. Figure 6.11 shows that the interval formed by subtracting 1.699s1 / 30 from x1 and adding 1.699s1 / 30 to x1 includes the population mean m . Now consider what happens if the second sample has the mean x 2 and standard deviation s2 . Although this sample mean differs from the first sample mean, we see in Figure 6.11 that the interval formed by subtracting 1.699s2 / 30 from x 2 and adding 1.699s2 / 30 to x 2 also includes the population mean m . However, consider the third sample, which has the mean x3 and standard deviation s3 . As we see in Figure 6.11, the interval formed by subtracting 1.699s3 / 30 from x3 and adding 1.699s3 / 30 to x3 does not include the population mean m . Because we are using t0.05 5 1.699 to form this interval, we expect that

FIGURE 6.11

Intervals Formed Around Sample Means from 10 Independent Random Samples

Sampling distribution of x–

x–

μ x–1

x–1 – 1.699 s1/ 30

x–1 + 1.699 s1/ 30 x–2

x–2 – 1.699 s2/ 30

x–2 + 1.699 s2/ 30 x–3

x–3 – 1.699 s3/ 30 x–4

x–4 – 1.699 s4/ 30

x–6

x–6 – 1.699 s6 / 30

x–4 + 1.699 s4/ 30 x–5

x–5 – 1.699 s5/ 30

x–5 + 1.699 s5/ 30

x–6 + 1.699 s6 / 30

x–7 – 1.699 s7 / 30

x–7

x–9

x–7 + 1.699 s7 / 30

x–8

x–8 – 1.699 s8 / 30 x–9 – 1.699 s9 / 30

x–3 + 1.699 s3/ 30

x–8 + 1.699 s8 / 30

x–9 + 1.699 s9 / 30

x–10 – 1.699 s10 / 30

x–10

x–10 + 1.699 s10 / 30

277

6.4 Interval Estimation

90% of the intervals for our samples will include the population mean m , and we see in Figure 6.11 that the results for our 10 samples of 30 EAI employees are what we would expect; the intervals for 9 of the 10 samples of n 5 30 observations in this example include the mean m . However, it is important to note that if we repeat this process of collecting 10 samples of n 5 30 EAI employees, we may find that fewer than 9 of the resulting intervals x 6 1.699sx include the mean m or all 10 of the resulting intervals x 6 1.699sx include the mean m . However, on average, the resulting intervals x 6 1.699sx for 9 of 10 samples of n 5 30 observations will include the mean m . Now recall that the sample of n 5 30 EAI employees from Section 6.2 had a sample mean of salary of x 5 $71,814 and sample standard deviation of s 5 $3,340. Using x 6 1.699(3,340/ 30) to construct the interval estimate, we obtain 71,814 6 1,036 . Thus, the specific interval estimate of m based on this specific sample is $70,778 to $72,850. Because approximately 90% of all the intervals constructed using x 6 1.699(s / 30) will contain the population mean, we say that we are approximately 90% confident that the interval $70,778 to $72,850 includes the population mean m . We also say that this interval has been established at the 90% confidence level. The value of 0.90 is referred to as the confidence coefficient, and the interval $70,564 to $73,064 is called the 90% confidence interval. Another term sometimes associated with an interval estimate is the level of significance. The level of significance associated with an interval estimate is denoted by the Greek letter a . The level of significance and the confidence coefficient are related as follows:

a 5 level of significance 5 1 2 confidence coefficient The level of significance is the probability that the interval estimation procedure will generate an interval that does not contain m (such as the third sample in Figure 6.11). For example, the level of significance corresponding to a 0.90 confidence coefficient is a 5 1 2 0.90 5 0.10. In general, we use the notation ta / 2 to represent the value such that there is an area of a /2 in the upper tail of the t distribution (see Figure 6.12). If the sampling distribution of x is normal, the margin of error for an interval estimate of a population mean m is ta / 2 sx 5 ta / 2

s n

So if the sampling distribution of x is normal, we find the interval estimate of the mean m by subtracting this margin of error from the sample mean x and adding this margin of error to the sample mean x . Using the notation we have developed, equation (6.7) can be used to find the confidence interval or interval estimate of the population mean m .

FIGURE 6.12

t Distribution with a /2 Area or Probability in the Upper Tail

/2 0

t

t /2

278

Chapter 6 Statistical Inference

Interval Estimate of a Population Mean Observe that the margin of error, t a /2 (s/ n) , varies from sample to sample. This variation occurs because the sample standard deviation s varies depending on the sample selected. A large value for s results in a larger margin of error, while a small value for s results in a smaller margin of error.

x 6 ta / 2

s , n

(6.7)

where s is the sample standard deviation, a is the level of significance, and ta / 2 is the t value providing an area of a /2 in the upper tail of the t distribution with n 2 1 degrees of freedom.

If we want to find a 95% confidence interval for the mean m in the EAI example, we again recognize that the degrees of freedom are 30 2 1 5 29 and then use Excel’s T.INV.2T function to find t0.025 5 2.045. We have seen that sx 5 611.3 in the EAI example, so the margin of error at the 95% level of confidence is t0.025 sx 5 62.045(611.3) 5 1,250. We also know that x 5 71,814 for the EAI example, so the 95% confidence interval is 71,814 6 1,250 , or $70,564 to $73,064. It is important to note that a 95% confidence interval does not have a 95% probability of containing the population mean m . Once constructed, a confidence interval will either contain the population parameter (m in this EAI example) or not contain the population parameter. If we take several independent samples of the same size from our population and construct a 95% confidence interval for each of these samples, we would expect 95% of these confidence intervals to contain the mean m . Our 95% confidence interval for the EAI example, $70,564 to $73,064, does indeed contain the population mean $71,800; however, if we took many independent samples of 30 EAI employees and developed a 95% confidence interval for each, we would expect that 5% of these confidence intervals would not include the population mean $71,800. To further illustrate the interval estimation procedure, we will consider a study designed to estimate the mean credit card debt for the population of U.S. households. A sample of n 5 70 households provided the credit card balances shown in Table 6.5. For this situation, no previous estimate of the population standard deviation s is available. Thus, the sample data must be used to estimate both the population mean and the population standard deviation. Using the data in Table 6.5, we compute the sample mean x 5 $9,312 and the sample standard deviation s 5 $4,007 . We can use Excel’s T.INV.2T function to compute the value of ta / 2 to use in finding this confidence interval. With a 95% confidence level and n 2 1 5 69 degrees of freedom, we have that T.INV.2T(1 2 0.95,69) 5 1.995, so ta / 2 5 t(120.95) / 2 5 t0.025 5 1.995 for this confidence interval. We use equation (6.7) to compute an interval estimate of the population mean credit card balance. 9,312 6 1.995

4,007 70

9,312 6 995 The point estimate of the population mean is $9,312, the margin of error is $955, and the 95% confidence interval is 9,312 2 955 5 $8,357 to 9,312 1 955 5 $10,267. Thus, we are 95% confident that the mean credit card balance for the population of all households is between $8,357 and $10,267. Using Excel We will use the credit card balances in Table 6.5 to illustrate how Excel can

be used to construct an interval estimate of the population mean. We start by summarizing the data using Excel’s Descriptive Statistics tool. Refer to Figure 6.13 as we describe the tasks involved. The formula worksheet is on the left; the value worksheet is on the right. Step 1. Click the Data tab on the Ribbon Step 2. In the Analysis group, click Data Analysis

279

6.4 Interval Estimation

Table 6.5

NewBalance

FIGURE 6.13 A 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 70 71 72

NewBalance 9430 7535 4078 5604 5179 4416 10676 1627 10112 6567 13627 18719 14661 12195 10544 13659 7061 6245 13021 9743 10324

Credit Card Balances for a Sample of 70 Households

9,430

14,661

7,159

9,071

9,691

7,535

12,195

8,137

3,603

11,448

11,032 6,525

4,078

10,544

9,467

16,804

8,279

5,239

5,604

13,659

12,595

13,479

5,649

6,195

5,179

7,061

7,917

14,044

11,298

12,584

4,416

6,245

11,346

6,817

4,353

15,415

10,676

13,021

12,806

6,845

3,467

15,917

1,627

9,719

4,972

10,493

6,191

12,591

10,112

2,200

11,356

615

12,851

9,743

6,567

10,746

7,117

13,627

5,337

10,324

13,627

12,744

9,465

12,557

8,372

18,719

5,742

19,263

6,232

7,445

95% Confidence Interval for Credit Card Balances B

C

D

NewBalance Mean Standard Error Median Mode Standard Deviation Sample Variance Kurtosis Skewness Range Minimum Maximum Sum Count Confidence Level(95.0%)

9312 478.9281 9466 13627 4007 16056048 20.2960 0.1879 18648 615 19263 651840 70 955

Point Estimate 5D3 Lower Limit 5D182D16 Upper Limit 5D31D16

A 1 NewBalance 2 9430 3 7535 4 4078 5 5604 6 5179 7 4416 8 10676 9 1627 10 10112 11 6567 12 13627 13 18719 14 14661 15 12195 16 10544 17 13659 18 7061 19 6245 20 13021 70 9743 71 10324 72

B

C

D

E

F

NewBalance 9312 Mean 478.9281 Standard Error 9466 Median 13627 Mode 4007 Standard Deviation 16056048 Sample Variance 20.2960 Kurtosis 0.1879 Skewness 18648 Range 615 Minimum 19263 Maximum 651840 Sum 70 Count 955 Confidence Level(95.0%) Point Estimate Lower Limit Upper Limit

Point Estimate

Margin of Error

9312 8357 10267

Note: Rows 21–69 are hidden.

If you can’t find Data Analysis on the Data tab, you may need to install the Analysis Toolpak add-in (which is included with Excel).

Step 3. When the Data Analysis dialog box appears, choose Descriptive Statistics from the list of Analysis Tools Step 4. When the Descriptive Statistics dialog box appears: Enter A1:A71 in the Input Range box Select Grouped By Columns Select Labels in First Row Select Output Range: Enter C1 in the Output Range box Select Summary Statistics Select Confidence Level for Mean Enter 95 in the Confidence Level for Mean box Click OK

280

Chapter 6 Statistical Inference

The margin of error using the t distribution can also be computed with the Excel function CONFIDENCE.T(alpha, s, n), where alpha is the level of significance, s is the sample standard deviation, and n is the sample size.

As Figure 6.13 illustrates, the sample mean ( x ) is in cell D3. The margin of error, labeled “Confidence Level(95%),” appears in cell D16. The value worksheet shows x 5 9,312 and a margin of error equal to955. Cells D18:D20 provide the point estimate and the lower and upper limits for the confidence interval. Because the point estimate is just the sample mean, the formula 5D3 is entered into cell D18. To compute the lower limit of the 95% confidence interval, x 2 (margin of error), we enter the formula 5D18-D16 into cell D19. To compute the upper limit of the 95% confidence interval, x 1 (margin of error), we enter the formula 5 D181D16 into cell D20. The value worksheet shows a lower limit of 8,357 and an upper limit of 10,267. In other words, the 95% confidence interval for the population mean is from 8,357 to 10,267.

Interval Estimation of the Population Proportion The general form of an interval estimate of a population proportion p is The notation za/2 represents the value such that there is an area of a /2 in the upper tail of the standard normal distribution (a normal distribution with a mean of zero and standard deviation of one).

p 6 Margin of error The sampling distribution of p plays a key role in computing the margin of error for this interval estimate. In Section 6.3 we said that the sampling distribution of p can be approximated by a normal distribution whenever np $ 5 and n(1 2 p) $ 5. Figure 6.14 shows the normal approximation of the sampling distribution of p. The mean of the sampling distribution of p is the population proportion p, and the standard error of p is

p(1 2 p) n

sp 5

(6.8)

Because the sampling distribution of p is normally distributed, if we choose za / 2s p as the margin of error in an interval estimate of a population proportion, we know that 100(1 2 a )% of the intervals generated will contain the true population proportion. But s p cannot be used directly in the computation of the margin of error because p will not be known; p is what we are trying to estimate. So we estimate s p with s p and then the margin of error for an interval estimate of a population proportion is given by

Margin of error 5 za / 2 s p 5 za / 2

FIGURE 6.14

p(1 2 p) n

(6.9)

Normal Approximation of the Sampling Distribution of p

Sampling distribution of p

p 5

p(1 2 p) n

/2

/2

p

p z/2 p

z/2 p

281

6.4 Interval Estimation

With this margin of error, the general expression for an interval estimate of a population proportion is as follows. Interval Estimate of a Population Proportion

p 6 za / 2

p(1 2 p) , n

(6.10)

where a is the level of significance and za / 2 is the z value providing an area of a /2 in the upper tail of the standard normal distribution.

TeeTimes

The Excel formula 5NORM.S.INV(1 2 a /2)

computes the value of za /2 . For example, for a 5 0.05, z 0.025 5NORM.S.INV (1 2 .05/2) 5 1.96 .

The following example illustrates the computation of the margin of error and interval estimate for a population proportion. A national survey of 900 women golfers was conducted to learn how women golfers view their treatment at golf courses in the United States. The survey found that 396 of the women golfers were satisfied with the availability of tee times. Thus, the point estimate of the proportion of the population of women golfers who are satisfied with the availability of tee times is 396/900 5 0.44. Using equation (6.10) and a 95% confidence level: p 6 za / 2

p(1 2 p) n

0.44(1 2 0.44) 900 0.44 6 0.0324 0.44 6 1.96

Thus, the margin of error is 0.0324 and the 95% confidence interval estimate of the population proportion is 0.4076 to 0.4724. Using percentages, the survey results enable us to state with 95% confidence that between 40.76% and 47.24% of all women golfers are satisfied with the availability of tee times.

The file TeeTimes displayed in Figure6.15 can be used as a template for developing confidence intervals about a population proportion p, by entering new problem data in column A and appropriately adjusting the formulas in column D.

Using Excel Excel can be used to construct an interval estimate of the population proportion of women golfers who are satisfied with the availability of tee times. The responses in the survey were recorded as a Yes or No in the file TeeTimes for each woman surveyed. Refer to Figure 6.15 as we describe the tasks involved in constructing a 95% confidence interval. The formula worksheet is on the left; the value worksheet appears on the right. The descriptive statistics we need and the response of interest are provided in cells D3:D6. Because Excel’s COUNT function works only with numerical data, we used the COUNTA function in cell D3 to compute the sample size. The response for which we want to develop an interval estimate, Yes or No, is entered into cell D4. Figure 6.15 shows that Yes has been entered into cell D4, indicating that we want to develop an interval estimate of the population proportion of women golfers who are satisfied with the availability of tee times. If we had wanted to develop an interval estimate of the population proportion of women golfers who are not satisfied with the availability of tee times, we would have entered No in cell D4. With Yes entered in cell D4, the COUNTIF function in cell D5 counts the number of Yes responses in the sample. The sample proportion is then computed in cell D6 by dividing the number of Yes responses in cell D5 by the sample size in cell D3. Cells D8:D10 are used to compute the appropriate z value. The confidence coefficient (0.95) is entered into cell D8 and the level of significance (a ) is computed in cell D9 by entering the formula 51-D8. The z value corresponding to an upper-tail area of a /2 is computed by entering the formula 5NORM.S.INV(1-D9/2) into cell D10. The value worksheet shows that z0.025 5 1.96 . Cells D12:D13 provide the estimate of the standard error and the margin of error. In cell D12, we entered the formula 5SQRT(D6*(1-D6)/D3) to compute the standard error using

282

Chapter 6 Statistical Inference

FIGURE 6.15 A 1 Response Yes 2 No 3 4 Yes 5 Yes No 6 No 7 No 8 Yes 9 Yes 10 Yes 11 No 12 No 13 Yes 14 No 15 No 16 Yes 17 No 18 Yes 900 Yes 901 902

95% Confidence Interval for Survey of Women Golfers B

C D Interval Estimate of a Population Proportion Sample Size 5COUNTA(A2:A901) Yes Response of Interest Count for Response 5COUNTIF(A2:A901,D4) Sample Proportion 5D5/D3 Confidence Coefficient 0.95 Level of Significance (alpha) 512D8 z Value 5NORM.S.INV(12D9/2) Standard Error 5SQRT(D6*(12D6)/D3) Margin of Error 5D10*D12 Point Estimate 5D6 Lower Limit 5D152D13 Upper Limit 5D151D13

A 1 Response Yes 2 No 3 Yes 4 Yes 5 No 6 No 7 No 8 Yes 9 Yes 10 Yes 11 No 12 No 13 Yes 14 No 15 No 16 Yes 17 No 18 Yes 900 Yes 901 902

B

C D E F Interval Estimate of a Population Proportion Sample Size Response of Interest Count for Response Sample Proportion

900 Yes 396 0.44

Confidence Coefficient Level of Significance z Value

0.95 0.05 1.96

Standard Error Margin of Error

0.0165 0.0324

Point Estimate Lower Limit Upper Limit

0.44 0.4076 0.4724

G

Enter Yes as the Response of Interest

the sample proportion and the sample size as inputs. The formula 5D10*D12 is entered into cell D13 to compute the margin of error corresponding to equation (6.9). Cells D15:D17 provide the point estimate and the lower and upper limits for a confidence interval. The point estimate in cell D15 is the sample proportion. The lower and upper limits in cells D16 and D17 are obtained by subtracting and adding the margin of error to the point estimate. We note that the 95% confidence interval for the proportion of women golfers who are satisfied with the availability of tee times is 0.4076 to 0.4724. N otes

+

C o m m ents

1. The reason the number of degrees of freedom associated with the t value in equation (6.7) is n 2 1 concerns the use of s as an estimate of the population standard deviation s . The expression for the sample standard deviation is s5

S( xi 2 x )2 . n 21

Degrees of freedom refer to the number of independent pieces of information that go into the computation of S( xi 2 x )2 . The n pieces of information involved in computing S( xi 2 x )2 are as follows: x1 2 x , x 2 2 x ,…, xn 2 x . Note that S( xi 2 x ) 5 0 for any data set. Thus, only n 2 1 of the xi 2 x values are independent; that is, if we know n 2 1 of the values, the remaining value can be determined exactly by using the condition that the sum of the xi 2 x values must be 0. Thus, n 2 1 is the number of degrees of freedom associated with S( xi 2 x )2 and hence the number of degrees of freedom for the t distribution in equation (6.7). 2. In most applications, a sample size of n $ 30 is adequate when using equation (6.7) to develop an interval estimate of a population mean. However, if the population distribution is highly skewed or contains outliers, most statisticians would recommend increasing the sample size to

50 or more. If the population is not normally distributed but is roughly symmetric, sample sizes as small as 15 can be expected to provide good approximate confidence intervals. With smaller sample sizes, equation (6.7) should be used if the analyst believes, or is willing to assume, that the population distribution is at least approximately normal. 3. What happens to confidence interval estimates of x when the population is skewed? Consider a population that is skewed to the right, with large data values stretching the distribution to the right. When such skewness exists, the sample mean x and the sample standard deviation s are positively correlated. Larger values of s tend to be associated with larger values of x . Thus, when x is larger than the population mean, s tends to be larger than s . This skewness causes the margin of error, t a /2 (s / n ), to be larger than it would be with s known. The confidence interval with the larger margin of error tends to include the population mean more often than it would if the true value of s were used. But when x is smaller than the population mean, the correlation between x and s causes the margin of error to be small. In this case, the confidence interval with the smaller margin of error tends to miss the population mean

283

6.5 Hypothesis Tests

more than it would if we knew s and used it. For this reason, we recommend using larger sample sizes with highly skewed population distributions. 4. We can find the sample size necessary to provide the desired margin of error at the chosen confidence level. Let E 5 the desired margin of error. Then • the sample size for an interval estimate of a population ( z a /2 )2 s 2 mean is n 5 , where E is the margin of error E2 that the user is willing to accept, and the value of z a /2 follows directly from the confidence level to be used in developing the interval estimate. • the sample size for an interval estimate of a population ( z a /2 )2 p * (1 2 p *) proportion is n 5 , where the planE2 ning value p* can be chosen by use of (i) the sample

proportion from a previous sample of the same or similar units, (ii) a pilot study to select a preliminary sample, (iii) judgment or a “best guess” for the value of p*, or (iv) if none of the preceding alternatives apply, use of the planning value of p * 5 0.50 . 5. The desired margin of error for estimating a population proportion is almost always 0.10 or less. In national public opinion polls conducted by organizations such as Gallup and Harris, a 0.03 or 0.04 margin of error is common. With such margins of error, the sample found with ( z a /2 )2 p *(1 2 p *) n5 will almost always provide a size E2 that is sufficient to satisfy the requirements of np $ 5 and n(1 2 p ) $ 5 for using a normal distribution as an approximation for the sampling distribution of p.

6.5 Hypothesis Tests Throughout this chapter, we have shown how a sample could be used to develop point and interval estimates of population parameters such as the mean m and the proportion p. In this section, we continue the discussion of statistical inference by showing how hypothesis testing can be used to determine whether a statement about the value of a population parameter should or should not be rejected. In hypothesis testing, we begin by making a tentative conjecture about a population parameter. This tentative conjecture is called the null hypothesis and is denoted by H 0. We then define another hypothesis, called the alternative hypothesis, which is the opposite of what is stated in the null hypothesis. The alternative hypothesis is denoted by H a . The hypothesis testing procedure uses data from a sample to test the validity of the two competing statements about a population that are indicated by H 0 and H a . This section shows how hypothesis tests can be conducted about a population mean and a population proportion. We begin by providing examples that illustrate approaches to developing null and alternative hypotheses.

Developing Null and Alternative Hypotheses

Learning to formulate hypotheses correctly will take some practice. Expect some initial confusion about the proper choice of the null and alternative hypotheses. The examples in this section are intended to provide guidelines.

It is not always obvious how the null and alternative hypotheses should be formulated. Care must be taken to structure the hypotheses appropriately so that the hypothesis testing conclusion provides the information the researcher or decision maker wants. The context of the situation is very important in determining how the hypotheses should be stated. All hypothesis testing applications involve collecting a random sample and using the sample results to provide evidence for drawing a conclusion. Good questions to consider when formulating the null and alternative hypotheses are, What is the purpose of collecting the sample? What conclusions are we hoping to make? In the introduction to this section, we stated that the null hypothesis H 0 is a tentative conjecture about a population parameter such as a population mean or a population proportion. The alternative hypothesis H a is a statement that is the opposite of what is stated in the null hypothesis. In some situations it is easier to identify the alternative hypothesis first and then develop the null hypothesis. In other situations, it is easier to identify the null hypothesis first and then develop the alternative hypothesis. We will illustrate these situations in the following examples. The Alternative Hypothesis as a Research Hypothesis Many applications of hypothesis

testing involve an attempt to gather evidence in support of a research hypothesis. In these situations, it is often best to begin with the alternative hypothesis and make it the conclusion that the researcher hopes to support. Consider a particular automobile that currently attains a fuel Copyright 2021 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.

284

Chapter 6 Statistical Inference

efficiency of 24 miles per gallon for city driving. A product research group has developed a new fuel injection system designed to increase the miles-per-gallon rating. The group will run controlled tests with the new fuel injection system looking for statistical support for the conclusion that the new fuel injection system provides more miles per gallon than the current system. Several new fuel injection units will be manufactured, installed in test automobiles, and subjected to research-controlled driving conditions. The sample mean miles per gallon for these automobiles will be computed and used in a hypothesis test to determine whether it can be concluded that the new system provides more than 24 miles per gallon. In terms of the population mean miles per gallon m , the research hypothesis m . 24 becomes the alternative hypothesis. Since the current system provides an average or mean of 24 miles per gallon, we will make the tentative conjecture that the new system is no better than the current system and choose m # 24 as the null hypothesis. The null and alternative hypotheses are as follows:

The conclusion that the research hypothesis is true is made if the sample data provide sufficient evidence to show that the null hypothesis can be rejected.

H 0 : m # 24 H a : m . 24 If the sample results lead to the conclusion to reject H 0, the inference can be made that H a : m . 24 is true. The researchers have the statistical support to state that the new fuel injection system increases the mean number of miles per gallon. The production of automobiles with the new fuel injection system should be considered. However, if the sample results lead to the conclusion that H 0 cannot be rejected, the researchers cannot conclude that the new fuel injection system is better than the current system. Production of automobiles with the new fuel injection system on the basis of better gas mileage cannot be justified. Perhaps more research and further testing can be conducted. Successful companies stay competitive by developing new products, new methods, and new services that are better than what is currently available. Before adopting something new, it is desirable to conduct research to determine whether there is statistical support for the conclusion that the new approach is indeed better. In such cases, the research hypothesis is stated as the alternative hypothesis. For example, a new teaching method is developed that is believed to be better than the current method. The alternative hypothesis is that the new method is better; the null hypothesis is that the new method is no better than the old method. A new sales force bonus plan is developed in an attempt to increase sales. The alternative hypothesis is that the new bonus plan increases sales; the null hypothesis is that the new bonus plan does not increase sales. A new drug is developed with the goal of lowering blood pressure more than an existing drug. The alternative hypothesis is that the newdrug lowers blood pressure more than the existing drug; the null hypothesis is that thenew drug does not provide lower blood pressure than the existing drug. In each case, rejection of the null hypothesis H 0 provides statistical support for the research hypothesis. We will see many examples of hypothesis tests in research situations such as these throughout this chapter and in the remainder of the text. The Null Hypothesis as a Conjecture to Be Challenged Of course, not all hypothesis tests involve research hypotheses. In the following discussion we consider applications of hypothesis testing where we begin with a belief or a conjecture that a statement about the value of a population parameter is true. We will then use a hypothesis test to challenge the conjecture and determine whether there is statistical evidence to conclude that the conjecture is incorrect. In these situations, it is helpful to develop the null hypothesis first. The null hypothesis H 0 expresses the belief or conjecture about the value of the population parameter. The alternative hypothesis H a is that the belief or conjecture is incorrect. As an example, consider the situation of a manufacturer of soft drink products. The label on a soft drink bottle states that it contains 67.6 fluid ounces. We consider the label correct provided the population mean filling weight for the bottles is at least 67.6 fluid ounces. With no reason to believe otherwise, we would give the manufacturer the benefit of the doubt and assume that the statement provided on the label is correct. Thus, in a hypothesis test about the population mean fluid weight per bottle, we would begin with the conjecture that the label is correct and state the null hypothesis as m $ 67.6. The challenge to

285

6.5 Hypothesis Tests

this conjecture would imply that the label is incorrect and the bottles are being underfilled. This challenge would be stated as the alternative hypothesis m , 67.6. Thus, the null and alternative hypotheses are as follows: A manufacturer’s product information is usually assumed to be true and stated as the null hypothesis. The conclusion that the information is incorrect can be made if the null hypothesis is rejected.

H 0 : m $ 67.6 H a : m , 67.6 A government agency with the responsibility for validating manufacturing labels could select a sample of soft drink bottles, compute the sample mean filling weight, and use the sample results to test the preceding hypotheses. If the sample results lead to the conclusion to reject H 0, the inference that H a : m , 67.6 is true can be made. With this statistical support, the agency is justified in concluding that the label is incorrect and that the bottles are being underfilled. Appropriate action to force the manufacturer to comply with labeling standards would be considered. However, if the sample results indicate H 0 cannot be rejected, the conjecture that the manufacturer’s labeling is correct cannot be rejected. With this conclusion, no action would be taken. Let us now consider a variation of the soft drink bottle-filling example by viewing the same situation from the manufacturer’s point of view. The bottle-filling operation has been designed to fill soft drink bottles with 67.6 fluid ounces as stated on the label. The company does not want to underfill the containers because that could result in complaints from customers or, perhaps, a government agency. However, the company does not want to overfill containers either because putting more soft drink than necessary into the containers would be an unnecessary cost. The company’s goal would be to adjust the bottle-filling operation so that the population mean filling weight per bottle is 67.6 fluid ounces as specified on the label. Although this is the company’s goal, from time to time any production process can get out of adjustment. If this occurs in our example, underfilling or overfilling of the soft drink bottles will occur. In either case, the company would like to know about it in order to correct the situation by readjusting the bottle-filling operation to result in the designated 67.6 fluid ounces. In this hypothesis testing application, we would begin with the conjecture that the production process is operating correctly and state the null hypothesis as m 5 67.6 fluid ounces. The alternative hypothesis that challenges this conjecture is that m ± 67.6 , which indicates that either overfilling or underfilling is occurring. The null and alternative hypotheses for the manufacturer’s hypothesis test are as follows: H 0 : m 5 67.6 H a : m ± 67.6 Suppose that the soft drink manufacturer uses a quality-control procedure to periodically select a sample of bottles from the filling operation and computes the sample mean filling weight per bottle. If the sample results lead to the conclusion to reject H 0, the inference is made that H a : m ± 67.6 is true. We conclude that the bottles are not being filled properly and the production process should be adjusted to restore the population mean to 67.6 fluid ounces per bottle. However, if the sample results indicate H 0 cannot be rejected, the conjecture that the manufacturer’s bottle-filling operation is functioning properly cannot be rejected. In this case, no further action would be taken and the production operation would continue to run. The two preceding forms of the soft drink manufacturing hypothesis test show that the null and alternative hypotheses may vary depending on the point of view of the researcher or decision maker. To formulate hypotheses correctly, it is important to understand the context of the situation and to structure the hypotheses to provide the information the researcher or decision maker wants. Summary of Forms for Null and Alternative Hypotheses The hypothesis tests in this

chapter involve two population parameters: the population mean and the population proportion. Depending on the situation, hypothesis tests about a population parameter may take one of three forms: Two use inequalities in the null hypothesis; the third uses an equality in the null hypothesis. For hypothesis tests involving a population mean, we let

286

Chapter 6 Statistical Inference

The three possible forms of hypotheses H0 and Ha are shown here. Note that the equality always appears in the null hypothesis H0 .

m0 denote the hypothesized value of the population mean and we must choose one of the following three forms for the hypothesis test: H 0 : m $ m0 H a : m , m0

H 0 : m # m0 H a : m . m0

H 0 : m 5 m0 H a : m ± m0

For reasons that will be clear later, the first two forms are called one-tailed tests. The third form is called a two-tailed test. In many situations, the choice of H 0 and H a is not obvious and judgment is necessary to select the proper form. However, as the preceding forms show, the equality part of the expression (either $, #, or 5) always appears in the null hypothesis. In selecting the proper form of H 0 and H a , keep in mind that the alternative hypothesis is often what the test is attempting to establish. Hence, asking whether the user is looking for evidence to support m , m0, m . m0, or m ± m0 will help determine H a .

Type I and Type II Errors The null and alternative hypotheses are competing statements about the population. Either the null hypothesis H 0 is true or the alternative hypothesis H a is true, but not both. Ideallythe hypothesis testing procedure should lead to the acceptance of H 0 when H 0 istrue and the rejection of H 0 when H a is true. Unfortunately, the correct conclusions are not always possible. Because hypothesis tests are based on sample information, we must allow for the possibility of errors. Table 6.6 illustrates the two kinds of errors that can be made in hypothesis testing. The first row of Table 6.6 shows what can happen if the conclusion is to accept H 0. If H 0 is true, this conclusion is correct. However, if H a is true, we made a Type II error; that is, we accepted H 0 when it is false. The second row of Table 6.6 shows what can happen if the conclusion is to reject H 0. If H 0 is true, we made a Type I error; that is, we rejected H 0 when it is true. However, if H a is true, rejecting H 0 is correct. Recall the hypothesis testing illustration in which an automobile product research group developed a new fuel injection system designed to increase the miles-per-gallon rating of a particular automobile. With the current model obtaining an average of 24 miles per gallon, the hypothesis test was formulated as follows: H 0 : m # 24 H a : m . 24 The alternative hypothesis, H a : m . 24, indicates that the researchers are looking for sample evidence to support the conclusion that the population mean miles per gallon with the new fuel injection system is greater than 24. In this application, the Type I error of rejecting H 0 when it is true corresponds to the researchers claiming that the new system improves the miles-per-gallon rating (m . 24) when in fact the new system is no better than the current system. In contrast, the Type II error of accepting H 0 when it is false corresponds to the researchers concluding that the new system is no better than the current system (m # 24) when in fact the new system improves miles-per-gallon performance. Table 6.6

Errors and Correct Conclusions in Hypothesis Testing Population Condition

Do Not Reject H 0 Conclusion

Reject H 0

H 0 True

H a True

Correct conclusion

Type II error

Type I error

Correct conclusion

287

6.5 Hypothesis Tests

For the miles-per-gallon rating hypothesis test, the null hypothesis is H 0 : m # 24. Suppose the null hypothesis is true as an equality; that is, m 5 24. The probability of making a Type I error when the null hypothesis is true as an equality is called the level of significance. Thus, for the miles-per-gallon rating hypothesis test, the level of significance is the probability of rejecting H 0 : m # 24 when m 5 24. Because of the importance of this concept, we now restate the definition of level of significance. The Greek symbol a (alpha) is used to denote the level of significance, and common choices for a are 0.05 and 0.01. Level of Significance

The level of significance is the probability of making a Type I error when the null hypothesis is true as an equality.

If the sample data are consistent with the null hypothesis H0 , we will follow the practice of concluding “do not reject H0 .” This conclusion is preferred over “accept H0 ,” because the conclusion to accept H0 puts us at risk of making aType II error.

In practice, the person responsible for the hypothesis test specifies the level of significance. By selecting a , that person is controlling the probability of making a Type I error. If the cost of making a Type I error is high, small values of a are preferred. If the cost of making a Type I error is not too high, larger values of a are typically used. Applications of hypothesis testing that only control the Type I error are called significance tests. Many applications of hypothesis testing are of this type. Although most applications of hypothesis testing control the probability of making aType I error, they do not always control the probability of making a Type II error. Hence, if we decide to accept H 0, we cannot determine how confident we can be with that decision. Because of the uncertainty associated with making a Type II error when conducting significance tests, statisticians usually recommend that we use the statement “do not reject H 0” instead of “accept H 0.” Using the statement “do not reject H 0” carries the recommendation to withhold both judgment and action. In effect, by not directly accepting H 0, the statistician avoids the risk of making a Type II error. Whenever the probability of making a Type II error has not been determined and controlled, we will not make the statement “accept H 0.” In such cases, only two conclusions are possible: do not reject H 0 or reject H 0. Although controlling for a Type II error in hypothesis testing is not common, it can be done. Specialized texts describe procedures for determining and controlling the probability of making a Type II error.1 If proper controls have been established for this error, action based on the “accept H 0” conclusion can be appropriate.

Hypothesis Test of the Population Mean In this section, we describe how to conduct hypothesis tests about a population mean for the practical situation in which the sample must be used to develop estimates of both m and s . Thus, to conduct a hypothesis test about a population mean, the sample mean x is used as an estimate of m and the sample standard deviation s is used as an estimate of s . One-Tailed Test One-tailed tests about a population mean take one of the following two forms: Lower-Tail Test

Upper-Tail Test

H 0 : m $ m0 H a : m , m0

H 0 : m # m0 H a : m . m0

Let us consider an example involving a lower-tail test. The Federal Trade Commission (FTC) periodically conducts statistical studies designed to test the claims that manufacturers make about their products. For example, the label on a large can of Hilltop Coffee states that the can contains 3 pounds of coffee. The FTC knows See, for example, D. R. Anderson, D. J. Sweeney, T. A. Williams, J. D. Camm, J. J. Cochran, M. J. Fry, and J. W. Ohlmann Statistics for Business and Economics, 14th ed. (Mason, OH: Cengage Learning, 2020).

1

288

Chapter 6 Statistical Inference

that Hilltop’s production process cannot place exactly 3 pounds of coffee in each can, even if the mean filling weight for the population of all cans filled is 3 pounds per can. However, as long as the population mean filling weight is at least 3 pounds per can, the rights of consumers will be protected. Thus, the FTC interprets the label information on a large can of coffee as a claim by Hilltop that the population mean filling weight is at least 3 pounds per can. We will show how the FTC can check Hilltop’s claim by conducting a lower-tail hypothesis test. The first step is to develop the null and alternative hypotheses for the test. If the population mean filling weight is at least 3 pounds per can, Hilltop’s claim is correct. This establishes the null hypothesis for the test. However, if the population mean weight is less than 3 pounds per can, Hilltop’s claim is incorrect. This establishes the alternative hypothesis. With m denoting the population mean filling weight, the null and alternative hypotheses are as follows: H0 : m $ 3 Ha : m , 3

Coffee

The standard error of x is the standard deviation of the sampling distribution of x .

Note that the hypothesized value of the population mean is m0 5 3. If the sample data indicate that H 0 cannot be rejected, the statistical evidence does not support the conclusion that a label violation has occurred. Hence, no action should be taken against Hilltop. However, if the sample data indicate that H 0 can be rejected, we will conclude that the alternative hypothesis, H a : m , 3, is true. In this case a conclusion of underfilling and a charge of a label violation against Hilltop would be justified. Suppose a sample of 36 cans of coffee is selected and the sample mean x is computed as an estimate of the population mean m . If the value of the sample mean x is less than 3pounds, the sample results will cast doubt on the null hypothesis. What we want to know is how much less than 3 pounds must x be before we would be willing to declare the difference significant and risk making a Type I error by falsely accusing Hilltop of a label violation. A key factor in addressing this issue is the value the decision maker selects for the level of significance. As noted in the preceding section, the level of significance, denoted by a , is the probability of making a Type I error by rejecting H 0 when the null hypothesis is true as an equality. The decision maker must specify the level of significance. If the cost of making a Type I error is high, a small value should be chosen for the level of significance. If the cost is not high, a larger value is more appropriate. In the Hilltop Coffee study, the director of the FTC’s testing program made the following statement: “If the company is meeting its weight specifications at m 5 3, I do not want to take action against them. But I am willing to risk a 1% chance of making such an error.” From the director’s statement, we set the level of significance for the hypothesis test at a 5 0.01. Thus, we must design the hypothesis test so that the probability of making a Type I error when m 5 3 is 0.01. For the Hilltop Coffee study, by developing the null and alternative hypotheses and specifying the level of significance for the test, we carry out the first two steps required in conducting every hypothesis test. We are now ready to perform the third step of hypothesis testing: collect the sample data and compute the value of what is called a test statistic. Test Statistic From the study of sampling distributions in Section 6.3 we know that as the sample size increases, the sampling distribution of x will become normally distributed. Figure 6.16 shows the sampling distribution of x when the null hypothesis is true as an equality, that is, when m 5 m0 5 3.2 Note that s x , the standard error of x , is estimated by sx 5 s / n 5 0.17 36 5 0.028. Recall that in Section 6.4, we showed that an interval estimate of a population mean is based on a probability distribution known as the t distribution. The t distribution is similar to the standard normal distribution, but accounts for the additional variability introduced when using a sample to estimate both the population mean and population standard deviation. Hypothesis tests about a population mean are also based on the t distribution. Specifically, if x is normally distributed, the sampling distribution of

t5

x 2 m0 x 2 m0 x 23 5 5 0.028 sx s/ n

In constructing sampling distributions for hypothesis tests, it is assumed that H0 is satisfied as an equality.

2

289

6.5 Hypothesis Tests

FIGURE6.16

Sampling Distribution of x for the Hilltop Coffee Study When the Null Hypothesis Is True as an Equality (m 5 3)

Sampling distribution of x

53

Although the t distribution is based on an conjecture that the population from which we are sampling is normally distributed, research shows that when the sample size is large enough, this conjecture can be relaxed considerably.

x

is a t distribution with n 2 1 degrees of freedom. The value of t represents how much the sample mean is above or below the hypothesized value of the population mean as measured in units of the standard error of the sample mean. A value of t 5 21 means that the value of x is 1 standard error below the hypothesized value of the mean, a value of t 5 22 means that the value of x is 2 standard errors below the hypothesized value of the mean, and so on. For this lower-tail hypothesis test, we can use Excel to find the lower-tail probability corresponding to any t value (as we show later in this section). For example, Figure 6.17 illustrates that the lower tail area at t 5 23.00 is 0.0025. Hence, the probability of obtaining a value of t that is three or more standard errors below the mean is 0.0025. As a result, if the null hypothesis is true (i.e., if the population mean is 3), the probability of obtaining a value of x that is 3 or more standard errors below the hypothesized population mean m0 5 3 is also 0.0025. Because such a result is unlikely if the null hypothesis is true, this leads us to doubt our null hypothesis. We use the t-distributed random variable t as a test statistic to determine whether x deviates from the hypothesized value of m enough to justify rejecting the null hypothesis. With sx 5 s / n , the test statistic is as follows: Test Statistic for Hypothesis Tests about a Population Mean

t5

x 2 m0 s/ n

(6.11)

The key question for a lower-tail test is, How small must the test statistic t be before we choose to reject the null hypothesis? We will draw our conclusion by using the value of the test statistic t to compute a probability called a p value. A small p value indicates that the value of the test statistic is unusual given the conjecture that H0 is true.

p Value

A p value is the probability, assuming that H0 is true, of obtaining a random sample of size n that results in a test statistic at least as extreme as the one observed in the current sample. The p value measures the strength of the evidence provided by the sample against the null hypothesis. Smaller p values indicate more evidence against H0 as they suggest that it is increasingly more unlikely that the sample could occur if the H0 is true. Let us see how the p value is computed and used. The value of the test statistic is used to compute the p value. The method used depends on whether the test is a lower-tail, an upper-tail, or a two-tailed test. For a lower-tail test, the p value is the probability of obtaining a value for the test statistic as small as or smaller than that provided by the sample. Thus, to compute the p value for the lower-tail test, we must use the t distribution to find

290

Chapter 6 Statistical Inference

FIGURE6.17

Lower-Tail Probability for t 5 23 from a t Distribution with 35 Degrees of Freedom

0.0025

t = 23

the probability that t is less than or equal to the value of the test statistic. After computing the p value, we must then decide whether it is small enough to reject the null hypothesis; as we will show, this decision involves comparing the p value to the level of significance. Using Excel Excel can be used to conduct one-tailed and two-tailed hypothesis tests about

a population mean. The sample data and the test statistic (t) are used to compute three pvalues: p value (lower tail), p value (upper tail), and p value (two tail). The user can then choose a and draw a conclusion using whichever p value is appropriate for the type of hypothesis test being conducted. Let’s start by showing how to use Excel’s T.DIST function to compute a lower-tail pvalue. The T.DIST function has three inputs; its general form is as follows: 5T.DIST(test statistic, degrees of freedom, cumulative).

CoffeeTest

For the first input, we enter the value of the test statistic; for the second input we enter the degrees of freedom for the associated t distribution; for the third input, we enter TRUE to compute the cumulative probability corresponding to a lower-tail p value. Once the lower-tail p value has been computed, it is easy to compute the upper-tail and the two-tailed p values. The upper-tail p value is 1 minus the lower-tail p value, and the two-tailed p value is two times the smaller of the lower- and upper-tail p values. Let us now compute the p value for the Hilltop Coffee lower-tail test. Refer to Figure6.18 as we describe the tasks involved. The formula sheet is in the backgrou