### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# Data Analysis and Decission Making

### View Full Document

## 9562

## 2

## Popular in Course

## Popular in Department

This 1090 page Reader was uploaded by mpy_21193 on Thursday February 6, 2014. The Reader belongs to a course at a university taught by a professor in Fall. Since its upload, it has received 9562 views.

## Similar to Course at University

## Reviews for Data Analysis and Decission Making

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 02/06/14

This is an electronic version of the print textbook Due to electronic rights restrictions some third party content may be suppressed Editorial review has deemed that any suppressed content does not materially affect the overall learning experience The publisher reserves the right to remove content from this title at any time if subsequent rights restrictions require it For valuable information on pricing previous editions changes to current editions and alternate formats please visit wwwcengagecomhighered to search by ISBN author title or keyword for materials in your areas of interest Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it To my wonderful family To my wonderful wife Mary my best friend and constant companion to Sam Lindsay and Teddy our new and adorable grandson and to Bryn our wild and crazy Welsh corgi who can t wait for Teddy to be able to play ball with her SCA To my wonderful family WLW To my wonderful family Ieannie Matthew and lack And to my late sister Ienny and son Iake who live eternally in our loving memories C Z Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it 4TH EDITION Data Analysis and Decision Making S Christian Albright Kelley School of Business Indiana University Wayne L Winston Kelley School of Business Indiana University Christopher I Zappe Backnell University With cases by Mark Broadie Graduate School of Business Columbia University Peter Kolesar Graduate School of Business Columbia University Lawrence L Lapin San Iose State University William D Whisler California State University Hayward SOUTHWESTERN Iii CENGAGELearning39 Australia Brazil Japan Korea Mexico Singapore Spain United Kingdom United States Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it SOUTHWESTERN CE NGAGE Learningquot Data Analysis and Decision Making Fourth Edition S Christian Albright Wayne L Winston Christopher Zappe Vice President of Editorial Business Jack W Calhoun Publisher Joe Sabatino Sr Acquisitions Editor Charles McCormick Jr Sr Developmental Editor Laura Ansara Editorial Assistant Nora Heink Marketing Manager Adam Marsh Marketing Coordinator Suellen Ruttkay Sr Content Project Manager Tim Bailey Media Editor Chris Valentine Frontlist Buyer Manufacturing Miranda Klapper Sr Marketing Communications Manager Libby Shipp Production Service MPS Limited A Macmillan company Sr Art Director Stacy Jenkins Shirley Cover Designer Lou Ann Thesing Cover Image iStock Photo 2011 2009 SouthWestern Cengage Learning ALL RIGHTS RESERVED No part of this work covered by the copyright herein may be reproduced transmitted stored or used in any form or by any means graphic electronic or mechanical including but not limited to photocopying recording scanning digitizing taping web distribution information networks or information storage and retrieval systems except as permitted under Section 107 or 108 of the 1976 United States Copyright Act without the prior written permission of the publisher For product information and technology assistance contact us at Cengage Learning Customer amp Sales Support 18003549706 For permission to use material from this text or product submit all requests online at wwwcengagecompermissions Further permissions questions can be emailed to permissionrequestcengagecom ExamView is a registered trademark of elnstruction Corp Microsoft and Exce spreadsheet software are registered trademarks of Microsoft Corporation used herein under license Library of Congress Control Number 2010930495 Student Edition Package ISBN 13 9780538476126 Student Edition Package ISBN 10 0538476125 Student Edition ISBN 13 9780538476102 Student Edition ISBN 10 0538476109 SouthWestern Cengage Learning 5191 Natorp Boulevard Mason OH 45040 USA Cengage Learning products are represented in Canada by Nelson Education Ltd For your course and learning solutions visit wwwcengagecom Purchase any of our products at your local college store or at our preferred online store wwwcengagebraincom Printed in the United States of America 12345671413121110 Copyright 2010 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it About the Authors S Christian Albright got his BS degree in Mathematics from Stanford in 1968 and his PhD in Operations Research from Stanford in 1972 Since then he has been teaching in the Operations amp Decision Technologies Department in the Kelley School of Business at Indiana University IU He has taught courses in management science computer simulation statistics and computer programming to all levels of business students undergraduates MBAs and doctoral students In addition he has taught simulation modeling at General Motors and Whirlpool and he has taught database analysis for the Army He has published over 20 articles in leading operations research joumals in the area of applied probability and he has authored the books Statistics for Business and Economics Practical Management Science Spreadsheet Modeling and Applications Data Analysis for Managers and VBA for Modelers He also works with the Palisade Corporation on the commercial version StatTools of his statistical StatPro add in for Excel His current interests are in spreadsheet modeling the development of VBA applications in Excel and programming in the N ET environment On the personal side Chris has been married for 39 years to his wonderful wife Mary who retired several years ago after teaching 7th grade English for 30 years and is now working as a supervisor for student teachers at IU They have one son Sam who lives in Philadelphia with his wife Lindsay and their newly born son Teddy Chris has many interests outside the academic area They include activities with his family especially traveling with Mary going to cultural events at IU power walking while listening to books on his iPod and reading And although he earns his livelihood from statistics and management science his real passion is for playing classical piano music Wayne L Winston is Professor of Operations amp Decision Technologies in the Kelley School of Business at Indiana University where he has taught since 1975 Wayne received his BS degree in Mathematics from MIT and his PhD degree in Operations Research from Yale He has written the successful textbooks Operations Research Applications and Algorithms Mathematical Programming Applications and Algorithms Simulation Modeling Using RISK Practical Management Science Data Analysis and Decision Making and Financial Models Using Simulation and Optimization Wayne has published over 20 articles in leading journals and has won many teaching awards including the schoolwide MBA award four times He has taught classes at Microsoft GM Ford Eli Lilly Bristol Myers Squibb Arthur Andersen Roche PricewaterhouseCoopers and N CR His current interest is showing how spreadsheet models can be used to solve business problems in all disciplines particularly in nance and marketing Wayne enjoys swimming and basketball and his passion for trivia won him an appearance several years ago on the television game show Jeopardy where he won two games He is married to the lovely and talented Vivian They have two children Gregory and Jennifer Christopher 1 Zappe eamed his BA in Mathematics from DePauw University in 1983 and his MBA and PhD in Decision Sciences from Indiana University in 1987 and 1988 respectively Between 1988 and 1993 he performed research and taught various decision sciences courses at the University of Florida in the College of Business Administration From 1993 until 2010 Professor Zappe taught decision sciences in the Department of Management at Bucknell University and in 2010 he was named provost at Gettysburg College Professor Zappe has taught undergraduate courses in business statistics decision modeling and analysis and computer simulation He also developed and taught a number of interdisciplinary Capstone Experience courses and Foundation Seminars in sup port of the Common Learning Agenda at Bucknell Moreover he has taught advanced seminars in applied game theory system dynamics risk assessment and mathematical economics He has published articles in scholarly journals such as Managerial and Decision Economics OMEGA Naval Research Logistics and Interfaces Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it 2 Brief Contents Preface xii I Introduction to Data Analysis and Decision Making I Part1 Exploring Data 19 2 Describing the Distribution of a Single Variable 2 3 Finding Relationships among Variables 85 Part 2 Probability and Decision Making under Uncertainty 153 4 Probability and Probability Distributions I55 5 Normal Binomial Poisson and Exponential Distributions 209 6 Decision Making under Uncertainty 273 Part 3 Statistical Inference 349 7 Sampling and Sampling Distributions 35 8 Confidence Interval Estimation 387 9 Hypothesis Testing 455 Part 4 Regression Analysis and Time Series Forecasting 527 I0 Regression Analysis Estimating Relationships 529 II Regression Analysis Statistical Inference 60 I2 Time Series Analysis and Forecasting 669 Part 5 Optimization and Simulation Modeling 743 I3 Introduction to Optimization Modeling 745 I4 Optimization Models 8 I5 Introduction to Simulation Modeling 97 I6 Simulation Models 987 Part 6 Online Bonus Material 2 Using the Advanced Filter and Database Functions 2I I7 Importing Data into Excel I7I Appendix A Statistical Reporting AI References I055 Index I059 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it I Preface xii 27 Excel Tables for Filtering Sorting and I Introduction to Data Analysis and Summarizing 66 Decision Making I 2 lquot quotquotg 7 28 Conclusion 75 Introduction 2 I 2 An Overview of the Book 4 CASE 2 Correct Interpretation of Means 8 39 Ill The Methods 4 CASE 22 The Dow Jones Industrial Average 82 I 22 The Software 7 CASE 23 Home and Condo Prices 83 L3 Modeling and Models 3 Finding Relationships among Variables 85 3 Graphical Models II 3l Introduction 37 PART I EXPLORING DATA I9 2 Describing the Distribution of a vi 32 Algebraic Models I2 33 Spreadsheet Models I2 34 A SevenStep Modeling Process 4 Conclusion I6 CASE Entertainment on a Cruise Ship I7 32 Relationships among Categorical Variables 88 33 Relationships among Categorical Variables I4 and a Numerical Variable 92 33 Stacked and Unstacked Formats 93 34 Relationships among Numerical Variables IOI 34 Scatterplots I02 342 Correlation and Covariance I06 35 Pivot Tables II4 36 An Extended Example I37 Single Variable 2 37 Conclusion I44 2 Introduction 23 CASE 3 Customer Arrivals at BanIlt98 I49 22 Basic concepts 24 CASE 32 Spviipg Spenppnog and Social 22 Populations and Samples 24 lm quotlg CASE 33 Churn in the Cellular Phone 222 Data SetsVariables and Observations 25 Market 39539 223 Types of Data 27 23 Descriptive Measures for Categorical PROBABILITY AND DECISION Variables 0 MAKING UNDER 24 Descriptive Measures for Numerical UNCERIAINIy 53 Variables 33 24 Numerical Summary Measures 34 242 Numerical Summary Measures with StatToos 43 243 Charts for Numerical Variables 48 25 Time Series Data 57 26 Outliers and Missing Values 64 26 Outliers 64 262 Missing Values 65 4 Probability and Probability Distributions I55 4 Introduction I56 42 Probability Essentials I58 42 Rule of Complements I59 422 Addition Rule I59 423 Conditional Probability and the Multiplication Rule I60 424 Probabilistic Independence I62 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or u icated in whole or in part Due to electronic ri s some third par content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overal eaming experience Cengage Leaming reserves e right to remove a i ional content at any time if subsequent rights restrictions require it Editorial review has deemed that any suppressed content does not materially affect the overal 425 Equally Likely Events I63 426 Subjective Versus Objective Probabilities I63 43 Distribution of a Single Random Variable I66 43 Conditional Mean and Variance I70 44 An Introduction to Simulation I73 45 Distribution of Two Random Variables Scenario Approach I77 46 Distribution of Two Random Variableszjoint Probability Approach I83 46 How to Assess Joint Probability Distributions I87 47 Independent Random Variables I89 48 Weighted Sums of Random Variables I93 49 Conclusion 200 CASE 4 Simpson s Paradox 208 5 Normal Binomial Poisson and Exponential Distributions 209 5 Introduction 2 52 The Normal Distribution 2 52 Continuous Distributions and Density Functions 2 522 The Normal Density 23 523 Standardizing IValues 24 524 Normal Tables and ZValues 26 525 Normal Calculations in Excel 27 526 Empirical Rules Revisited 220 53 Applications of the Normal Distribution 22 54 The Binomial Distribution 233 54 Mean and Standard Deviation of the Binomial Distribution 236 542 The Binomial Distribution in the Context of Sampling 236 543 The Normal Approximation to the Binomial 237 55 Applications of the Binomial Distribution 238 56 The Poisson and Exponential Distributions 250 56 The Poisson Distribution 250 562 The Exponential Distribution 252 57 Fitting a Probability Distribution to Data with RS 255 58 Conclusion 26 CASE 5 EuroWatch Company 269 CASE 52 Cashing in on the Lottery 270 6 Decision Making under Uncertainty 273 6 Introduction 274 62 Elements of Decision Analysis 276 62 Payoff Tables 276 622 Possible Decision Criteria 277 623 Expected Monetary Value EMV 278 624 Sensitivity Analysis 280 625 Decision Trees 280 626 Risk Profiles 282 63 The PrecisionTree AddIn 290 64 Bayes Rule 303 65 Multistage Decision Problems 307 65 The Value of Information 3 66 Incorporating Attitudes Toward Risk 323 66 Utility Functions 324 662 Exponential Utility 324 663 Certainty Equivalents 328 664 Is Expected Utility Maximization Used 330 67 Conclusion 33 CASE 6 Jogger Shoe Company 345 CASE 62 Westhouser Parer Company 346 CASE 63 Biotechnical Engineering 347 PART 3 STATISTICAL INFERENCE 349 7 Sampling and Sampling Distributions 35 7 Introduction 352 72 Sampling Terminology 353 73 Methods for Selecting Random Samples 354 73 Simple Random Sampling 354 732 Systematic Sampling 360 733 Stratified Sampling 36I 734 Cluster Sampling 364 735 Multistage Sampling Schemes 365 Contents Vii Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic ri s some third par content may be suppressed from the eBook andor eChapters l leaming experience Cengage Leaming reserves e right to remove a i ional content at any time if subsequent rights restrictions require it 74 An Introduction to Estimation 366 74 Sources of Estimation Error 367 742 Key Terms in Sampling 368 743 Sampling Distribution of the Sample Mean 369 744 The Central Limit Theorem 374 745 Sample Size Determination 379 746 Summary of Key Ideas for Simple Random Sampling 380 75 Conclusion 382 CASE 7 Sampling from DVD Movie Renters 386 8 Confidence Interval Estimation 387 8 Introduction 388 82 Sampling Distributions 390 82 The t Distribution 390 822 Other Sampling Distributions 393 83 Confidence Interval for a Mean 394 84 Confidence Interval for a Total 400 85 Confidence Interval for a Proportion 403 86 Confidence Interval for a Standard Deviation 409 87 Confidence Interval for the Difference between Means 42 87 Independent Samples 43 872 Paired Samples 42 88 Confidence Interval for the Difference between Proportions 427 89 Controlling Confidence Interval Length 433 89 Sample Size for Estimation of the Mean 434 892 Sample Size for Estimation of Other Parameters 436 80 Conclusion 44 CASE 8 Harrigan University Admissions 449 CASE 82 Employee Retention at DampY 450 CASE 83 Delivery Times at SnowPea Restaurant 45 CASE 84 The Bodfish Lot Cruise 452 9 Hypothesis Testing 455 9 Introduction 456 viii Contents Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in W Editorial review has deemed that any suppressed content does not materially affect the overall leaming experie 92 Concepts in Hypothesis Testing 457 92 Null and Alternative Hypotheses 458 922 OneTailed Versus TwoTailed Tests 459 923 Types of Errors 459 924 Significance Level and Rejection Region 460 925 Significance from pvalues 46 926 Type II Errors and Power 462 927 Hypothesis Tests and Confidence Intervals 463 928 Practical Versus Statistical Significance 463 93 Hypothesis Tests for a Population Mean 464 94 Hypothesis Tests for Other Parameters 472 94 Hypothesis Tests for a Population Proportion 472 942 Hypothesis Tests for Differences between Population Means 475 943 Hypothesis Test for Equal Population Variances 485 944 Hypothesis Tests for Differences between Population Proportions 486 95 Tests for Normality 494 96 ChiSquare Test for Independence 500 97 OneWay ANOVA 505 98 Conclusion 53 CASE 9 Regression Toward the Mean 59 CASE 92 Baseball Statistics 520 CASE 93 The Wichita Anti Drunk Driving Advertising Campaign 52 CASE 94 Deciding Whether to Switch to a New Toothpaste Dispenser 523 CASE 95 Removing Vioxx from the Market 526 REGRESSION ANALYSIS AND TIME SERIES FORECASTING 527 I0 Regression Analysis Estimating Relationships 529 0I Introduction 53 hole or in part Due to electronic ri s some third par content may be suppressed from the eBook andor eChapters nce Cengage Leaming reserves e right to remove a i ional content at any time if subsequent rights restrictions require it 02 Scatterplotsz Graphing Relationships 533 02 Linear Versus Nonlinear Relationships 538 022 Outliers 538 023 Unequal Variance 539 O24 No Relationship 540 03 Correlations Indicators of Linear Relationships 540 O4 Simple Linear Regression 542 04 Least Squares Estimation 542 042 Standard Error of Estimate 549 043 The Percentage of Variation Explained R2 550 05 Multiple Regression 553 05 Interpretation of Regression Coef cients 554 052 Interpretation of Standard Error of Estimate and R2 556 O6 Modeling Possibilities 560 O6 Dummy Variables 560 062 Interaction Variables 566 063 Nonlinear Transformations 57 07 Validation of the Fit 586 08 Conclusion 588 CASE O Quantity Discounts at the Firm Chair Company 596 CASE O2 Housing Price Structure in Mid City 597 CASE O3 Demand for French Bread at Howie s Bakery 598 CASE O4 Investing for Retirement 599 II Regression Analysis Statistical Inference 60 Introduction 603 2 The Statistical Model 603 3 Inferences about the Regression Coefficients 607 3 Sampling Distribution of the Regression Coefficients 608 32 Hypothesis Tests for the Regression Coefficients and pValues 60 33 A Test for the Overall Fit The ANOVA Table 6 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or 4 Multicollinearity 66 5 IncludeExclude Decisions 620 6 Stepwise Regression 625 7 The Partial F Test 630 8 Outliers 638 9 Violations of Regression Assumptions 644 9 Nonconstant Error Variance 644 92 Nonnormality of Residuals 645 93 Autocorrelated Residuals 645 0 Prediction 648 Conclusion 653 CASE I The Artsy Corporation 663 CASE I 2 Heating Oil at Dupree Fuels Company 665 CASE I 3 Developing a Flexible Budget at the Gunderson Plant 666 CASE I 4 Forecasting Overhead at Wagner Printers 667 I2 Time Series Analysis and Forecasting 669 2 Introduction 67 22 Forecasting Methods An Overview 67 22 Extrapolation Methods 672 222 Econometric Models 672 223 Combining Forecasts 673 224 Components of Time Series Data 673 225 Measures of Accuracy 676 23 Testing for Randomness 678 23 The Runs Test 68 232 Autocorrelation 683 24 RegressionBased Trend Models 687 24 Linear Trend 687 242 Exponential Trend 690 25 The Random Walk Model 695 26 Autoregression Models 699 27 Moving Averages 704 28 Exponential Smoothing 7O 28 Simple Exponential Smoothing 7O 282 Hot s Model for Trend 75 29 Seasonal Models 720 Contents ix duplicated in whole or in part Due to electronic ri s some third par content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves e right to remove a i ional content at any time if subsequent rights restrictions require it 29 Winters Exponential Smoothing I4 Optimization Models 3Il M d l 7239 4 Introduction 82 I292 Deseasonaizing The RatiotoMoving 42 worker Scheduling Models 3393 W895 quot 725 143 Blending Models 82 293 Estimating Seasonality with Regression 729 MA Logistics Models 828 T239T0 Conclusion T35 44 Transportation Models 828 3 G AT39TlV3I at The 0 Union other Logistics Models G CASE 22 Forecasting Weekly Sales at 3945 Aggregate Planning Models 848 Amanta 74 46 Financial Models 857 47 Integer Programming Models 868 OPTIMIZATION AND 47 Capital Budgeting Models 869 PART 5 39 SIMULATION MODELING 743 472 Fixed Cost Models 875 473 SetCovering Models 883 I3 Introduction to Optimization Modeling 745 48 Nonlinear Programming Models 89 3 Introduction 746 48 Basic Ideas of Nonlinear 32 Introduction to Optimization 747 Optimization 39l I33 A TwoVariable Product Mix Model 748 482 Managerial Economics Models 89 34 Sensitivity Analysis 76 483 Portfolio Optimization Models 896 34 Sover s Sensitivity Report 76 49 Conclusion 905 342 SoverTabe AddIn 765 CASE 4 Giant Motor Company 92 343 Comparison of Sover s Sensitivity Report CASE 42 GMS Stock Hedging 94 and SolverTabe 770 35 Properties of Linear Models 772 I5 Introduction to Simulation Modeling 9I7 35 Proportionality 773 5 Introduction 98 352 Additivity 773 52 Probability Distributions for Input 353 Divisibility 773 Variables 920 354 Discussion of Linear Properties 773 52 Types of Probability Distributions 92 355 Linear Models and Scaling 774 522 Common Probability Distributions 925 36 Infeasibility and Unboundedness 775 523 Using RS to Explore Probability 36 Infeasibility 775 Distributions 929 362 Unboundedness 775 53 Simulation and the Flaw of 363 Comparison of Infeasibility and Averages 939 Unboundedness 776 54 Simulation with BuiltIn Excel Tools 942 37 A Larger Product Mix Model 778 55 Introduction to the RS Addin 953 38 A Multiperiod Production Model 786 55 RS Features 953 39 A Comparison of Algebraic and Spreadsheet 552 Loading RS 954 Models 796 553 RS Models with a Single Random 30 A Decision Support System 796 Input Variable 954 3 Conclusion 799 554 Some Limitations of RS 963 CASE 3 Shelby Shelving 807 555 RS Models with Several Random CASE 32 Sonoma Valley Wines 809 Input Variables 964 X Contents Copyright 2010 Cenga e eaming All Rights Reserved May not be copied scanned or hole or in part Due to electronic ri s some third par content may be suppressed from the eBook andor eChapters duplicated in W Editorial review has deeme at any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves e right to remove a i ional content at any time if subsequent rights restrictions require it Editorial review has deemed that any suppressed content does not materially affect the overal 56 The Effects of Input Distributions on Results 969 56 Effect of the Shape of the Input Distributions 969 562 Effect of Correlated Input Variables 972 5T Conclusion 978 CASE 5 Ski Jacket Production 985 CASE 52 Ebony Bath Soap 986 I6 Simulation Models 987 6 Introduction 989 62 Operations Models 989 62 Bidding for Contracts 989 622 Warranty Costs 993 623 Drug Production with Uncertain Yield 998 63 Financial Models I004 63 Financial Planning Models I004 632 Cash Balance Models I009 633 Investment Models 04 64 Marketing Models I020 64 Models of Customer Loyalty I020 642 Marketing and Sales Models I030 65 Simulating Games of Chance I036 65 Simulating the Game of Craps I036 652 Simulating the NCAA Basketball Tournament I039 66 An Automated Template for RS Modeh I044 6T Conclusion I045 CASE 6 College Fund Investment I053 CASE 62 Bond Investment Strategy I054 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whol l leaming exper PART 6 2 Using the Advanced Filter and Database Functions 2I IT Importing Data into Excel ITI 7 Introduction I73 72 Rearranging Excel Data I74 T3 Importing Text Data IT8 74 Importing Relational Database Data I7I4 T4 A Brief Introduction to Relational Databases ITI4 742 Using Microsoft Query ITI5 743 SQL Statements IT28 T5 Web Queries IT30 76 Cleansing Data IT34 T7 Conclusion IT42 CASE 7 EduToys Inc IT46 Appendix A Statistical Reporting AI A Introduction AI A2 Suggestions for Good Statistical Reporting A2 A2 Planning A2 A22 Developing a Report A3 A23 Be Clear A4 A24 Be Concise A5 A25 Be Precise A5 A3 Examples of Statistical Reports A6 A4 Conclusion AI8 References I055 Index I059 Contents Xi e or in part Due to electronic ri s some third par content may be suppressed from the eBook andor eChapters ience Cengage Leaming reserves e right to remove a i ional content at any time if subsequent rights restrictions require it Preface With today s technology companies are able to collect tremendous amounts of data with relative ease Indeed many companies now have more data than they can handle However the data are usually meaningless until they are analyzed for trends patterns relationships and other useful information This book illustrates in a practical way a variety of methods from simple to complex to help you ana lyze data sets and uncover important information In many business contexts data analysis is only the first step in the solution of a problem Acting on the solution and the information it provides to make good decisions is a critical next step Therefore there is a heavy emphasis throughout this book on analytical methods that are useful in decision mak ing Again the methods vary considerably but the objective is always the same to equip you with decision making tools that you can apply in your business careers We recognize that the majority of students in this type of course are not majoring in a quantitative area They are typically business majors in finance marketing operations management or some other business discipline who will need to analyze data and make quantitative based decisions in their jobs We offer a hands on examplebased approach and introduce fundamental concepts as they are needed Our vehicle is spreadsheet software specifically Microsoft Excel This is a package that most students already know and will undoubtedly use in their careers Our MBA students at Indiana University are so turned on by the required course that is based on this book that almost all of them mostly finance and marketing majors take at least one of our followup elective courses in spreadsheet modeling We are convinced that students see value in quantitative analysis when the course is taught in a practical and examplebased approach Rationale for writing this book Data Analysis and Decision Making is different from the many fine textbooks written for statistics and man agement science Our rationale for writing this book is based on three fundamental objectives Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it 1 Integrated coverage and applications The book provides a unified approach to businessrelated problems by integrating methods and applications that have been traditionally taught in separate courses specifically statistics and management science 2 Practical in approach The book emphasizes realistic business examples and the processes managers actually use to analyze business problems The emphasis is not on abstract theory or computational methods 3 Spreadsheetbased The book provides students with the skills to analyze business problems with tools they have access to and will use in their careers To this end we have adopted Excel and commercial spreadsheet add ins Integrated coverage and applications In the past many business schools including ours at Indiana University have offered a required statistics course a required decision making course and a required management science course or some subset of these One current trend however is to have only one required course that covers the basics of statistics some regression analysis some decision making under uncertainty some linear programming some simulation and possibly others Essentially we fac ulty in the quantitative area get one opportunity to teach all business students so we attempt to cover a variety of useful quantitative methods We are not nec essarily arguing that this trend is ideal but rather that it is a re ection of the reality at our university and we suspect at many others After several years of teaching this course we have found it to be a great opportunity to attract students to the subject and more advanced study The book is also integrative in another important aspect It not only integrates a number of analytical methods but it also applies them to a wide variety of business problems that is it analyzes realistic examples from many business disciplines We include examples problems and cases that deal with portfolio optimization workforce scheduling market share analysis capital budgeting new product analysis and many others Practical in approach We want this book to be very examplebased and prac tical We strongly believe that students leam best by working through examples and they appreciate the material most when the examples are realistic and inter esting Therefore our approach in the book differs in two important ways from many competitors First there is just enough conceptual development to give students an understanding and appreciation for the issues raised in the examples We often introduce important con cepts such as multicollinearity in regression in the context of examples rather than discussing them in the abstract Our experience is that students gain greater intuition and understanding of the concepts and appli cations through this approach Second we place virtually no emphasis on hand calculations We believe it is more important for students to understand why they are conducting an analysis and what it means than to emphasize the tedious calculations associated with many analytical techniques Therefore we illustrate how powerful software can be used to create graphical and numeri cal outputs in a matter of seconds freeing the rest of the time for indepth interpretation of the output sensitivity analysis and alternative modeling approaches In our own courses we move directly into a discussion of examples where we focus almost exclusively on interpretation and modeling issues and let the software perform the number crunching Spreadsheetbased teaching We are strongly committed to teaching spreadsheet based exampledriven courses regardless of whether the basic area is data analysis or management science We have found tremendous enthusiasm for this approach both from students and from faculty around the world who have used our books Students learn and remember more and they appreciate the material more In addition instructors typically enjoy teaching more and they usually receive immediate reinforce ment through better teaching evaluations We were among the first to move to spreadsheet based teaching almost two decades ago and we have never regretted the move What we hope to accomplish in this book Condensing the ideas in the above paragraphs we hope to I Reverse negative student attitudes about statistics and quantitative methods by making these topics real accessible and interesting I Give students lots of hands on experience with real problems and challenge them to develop their intuition logic and problemsolving skills I Expose students to real problems in many business disciplines and show them how these problems can be analyzed with quantitative methods I Develop spreadsheet skills including experience with powerful spreadsheet addins that add immediate value in students other courses and their future careers New in the fourth edition There are two major changes in this edition I We have completely rewritten and reorganized Chapters 2 and 3 Chapter 2 now focuses on the description of one variable at a time and Chapter 3 focuses on relationships between variables We believe this reorganization is more logical In addition both of these chapters have more coverage of categorical variables and they have new examples with more interesting data sets I We have made major changes in the problems particularly in Chapters 2 and 3 Many of the problems in previous editions were either uninteresting or outdated so in most cases we deleted or updated such problems and we added a number of brandnew problems We also created a le essentially a database of prob lems that is available to instructors This le Problem Databasexlsx indicates the context of each of the problems and it also shows the correspondence between problems in this edition and problems in the previous edition Besides these two major changes there are a number of smaller changes including the following I Due to the length of the book we decided to delete the old Chapter 4 Getting the Right Preface xiii Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Data from the printed book and make it available online as Chapter 17 This chapter now called Importing Data into Excel has been completely rewritten and its section on Excel tables is now in Chapter 2 The old Chapters 517 were renumbered 416 I The book is still based on Excel 2007 but where it applies notes about changes in Excel 2010 have been added Specifically there is a small section on the new slicers for pivot tables and there are several mentions of the new statistical functions although the old functions still work I Each chapter now has 1020 Conceptual Questions in the end of chapter section There were a few Conceptual Exercises in some chapters in previous editions but the new versions are more numerous consistent and relevant I The first two linear programming LP examples in Chapter 13 the old Chapter 14 have been replaced by two product mix models where the second is an extension of the first Our thinking was that the previous dietthemed model was overly complex as a first LP example I Several of the chapteropening vignettes have been replaced by newer and more interesting ones I There are now many short fundamental insights throughout the chapters We hope these allow the students to step back from the details and see the really important ideas Software This book is based entirely on Microsoft Excel the spreadsheet package that has become the standard analytical tool in business Excel is an extremely powerful package and one of our goals is to convert casual users into power users who can take full advantage of its features If we accomplish no more than this we will be providing a valuable skill for the business world However Excel has some limitations Therefore this book includes several Excel addins that greatly enhance Excel s capabilities As a group these addins comprise what is arguably the most impressive assortment of spreadsheet based software accompanying any book on the market xiv Preface Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it DecisionTools addin The textbook Web site for Data Analysis and Decision Making provides a link to the powerful DecisionTools Suite by Palisade Corporation This suite includes seven separate addins the first three of which we use extensively RISK an addin for simulation StatTools an addin for statistical data analysis I PrecisionTree a graphicalbased addin for creating and analyzing decision trees I TopRank an addin for performing whatif analyses I RISKOptimizer an addin for performing optimization on simulation models I NeuralTools an addin for finding complex nonlinear relationships I EVolVerTM an addin for performing optimiza tion on complex nonsmoot models Online access to the DecisionTools Suite avail able with new copies of the book is an academic ver sion slightly scaled down from the professional version that sells for hundreds of dollars and is used by many leading companies It functions for two years when properly installed and it puts only modest limitations on the size of data sets or models that can be analyzed Visit wwwkelleyiuedualbrightbooks for specific details on these limitations We use RISK and PrecisionTree extensively in the chapters on simulation and decision making under uncertainty and we use StatTools throughout all of the data analysis chapters SolverTable addin We also include SolverTable a supplement to Excel s builtin Solver for optimiza tion If you have ever had difficulty understanding Solver s sensitivity reports you will appreciate SolverTable It works like Excel s data tables except that for each input or pair of inputs the addin runs Solver and reports the optimal output values SolverTable is used extensively in the optimization chapters The version of SolverTable included in this book has been revised for Excel 2007 Although SolverTable is available on this textbook s Web site it is also available for free from the first author s Web site wwwkelleyiuedualbrightbooks Possible sequences of topics Although we use the book for our own required one semester course there is admittedly more material than can be covered adequately in one semester We have tried to make the book as modular as possible allowing an instructor to cover say simulation before optimization or vice versa or to omit either of these topics The one exception is statistics Due to the natural progression of statistical topics the basic topics in the early chapters should be covered before the more advanced topics regression and time series analysis in the later chapters With this in mind there are several possible ways to cover the topics I For a onesemester required course with no statistics prerequisite or where MBA students have forgotten whatever statistics they learned years ago If data analysis is the primary focus of the course then Chapters 25 711 and possibly the online Chapter 17 all statistics and probability topics should be covered Depending on the time remaining any of the topics in Chapters 6 decision making under uncertainty 12 time series analysis 1314 optimization or 1516 simulation can be covered in practically any order I For a onesemester required course with a statistics prerequisite Assuming that students know the basic elements of statistics up through hypothesis testing say the material in Chapters 25 and 79 can be reviewed quickly primarily to illustrate how Excel and addins can be used to do the number crunching Then the instructor can choose among any of the topics in Chapters 6 1011 12 1314 or 1516 in practically any order to fill the remainder of the course I For a twosemester required sequence Given the luxury of spreading the topics over two semesters the entire book can be covered The statistics topics in Chapters 25 and 79 should be covered in order before other statistical topics regression and time series analysis but the remaining chapters can be covered in practically any order Custom publishing If you want to use only a subset of the text or add chapters from the authors other texts or your own materials you can do so through Cengage Learning Custom Publishing Contact your local Cengage Learning representative for more details Student ancillaries Textbook Web Site Every new student edition of this book comes with an Instant Access Code bound inside the book The code provides access to the Data Analysis and Decision Making 4e textbook Web site that links to all of the following files and tools I DecisionTools Suite software by Palisade Corporation described earlier I Excel files for the examples in the chapters usually two versions of each a template or dataonly version and a nished version Data files required for the problems and cases Excel Tutorialxlsx which contains a useful tutorial for getting up to speed in Excel 2007 Students who do not have a new book can purchase access to the textbook Web site at www CengageBraincom Student Solutions Student Solutions to many of the odd numbered prob lems indicated in the text with a colored box on the problem number are available in Excel format Students can purchase access to Student Solutions files on wwwCengageBraincom ISBN10 1111 529051 ISBN13 9781111529055 Instructor ancillaries Adopting instructors can obtain the Instructors Reso urce CD IRCD from your regional Cengage Learning Sales Representative The IRCD includes I Problem Databasexlsx file contains informa tion about all problems in the book and the correspondence between them and those in the previous edition I Example files for all examples in the book including annotated versions with addi tional explanations and a few extra examples that extend the examples in the book I Solution files in Excel format for all of the problems and cases in the book and solution shells templates for selected problems in the modeling chapters I PowerPoint presentation files for all of the examples in the book Preface XV Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it I Test Bank in Word format and now also in ExamView Testing Software new to this edition The book s password protected instructor Web site wwwcengagecorndecisionsciencesalbright includes the above items Test Bank in Word format only as well as software updates errata additional problems and solutions and additional resources for both stu dents and faculty The first author also maintains his own Web site at wwwkelleyiuedualbrightbooks Acknowledgments The authors would like to thank several people who helped make this book a reality First the authors are indebted to Peter Kolesar Mark Broadie Lawrence Lapin and William Whisler for contributing some of the excellent case studies that appear throughout the book There are more people who helped to produce this book than we can list here However there are a few special people whom we were happy and lucky to have on our team First we would like to thank our editor Charles McCormick Charles stepped into this project after two editions had already been published but the transition has been smooth and rewarding We appreciate his tireless efforts to make the book a continued success We are also grateful to many of the professionals who worked behind the scenes to make this book a success Adam Marsh Marketing Manager Laura Ansara Senior Developmental Editor Nora Heink Editorial Assistant Tim Bailey Senior Content Project Manager Stacy Shirley Senior Art Director and Gunjan Chandola Senior Project Manager at MPS Limited xvi Preface Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it We also extend our sincere appreciation to the reviewers who provided feedback on the authors pro posed changes that resulted in this fourth edition Henry F Ander Arizona State University James D Behel Harding University Dan Brooks Arizona State University Robert H Burgess Georgia Institute of Technology George Cunningham III Northwestem State University Rex Cutshall Indiana University Robert M Escudero Pepperdine University Theodore S Glickman George Washington University John Gray The Ohio State University Joe Hahn Pepperdine University Max Peter Hoefer Pace University Tim James Arizona State University Teresa J ostes Capital University Jeffrey Keisler University of Massachusetts Boston David Kelton University of Cincinnati Shreevardhan Lele University of Maryland Ray Nelson Brigham Young University William Pearce Geneva College Thomas R Sexton Stony Brook University Malcolm T Whitehead Northwestern State University Laura A Wilson Gentry University of Baltimore Jay Zagorsky Boston University S Christian Albright Wayne L Winston Christopher J Zappe May 2010 CHAPTER HOTTEST NEW JOBS STATISTICS AND MATHEMATICS uch of this book as the title implies is about data anaysisThe term data analysis has long been synonymous with the term statistics but in today s world with massive amounts of data available in business and many other fields such as health and science data analysis goes beyond the more narrowly focused area of traditional statistics But regardless of what we call it data analysis is currently a hot topic and promises to get even hotter in the futureThe data analysis skills you learn here and possibly in followup quantitative courses might just land you a very interesting and lucrative job This is exactly the message in a recent NewYork Times artice For Today s Graduateust One Word Statistics by Steve Lohr A similar article Math Will RockYourWord by Stephen Baker was the cover story for Copyright 2010 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters George DoyleJupiter Images Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it BusinessWeek Both articles are available online by searching for their titles The statistics article begins by chronicling a Harvard anthropology and archaeology graduate Carrie Grimes who began her career by mapping the locations of Mayan artifacts in places like Honduras As she states Peope think of eld archaeology as Indiana Jones but much of what you really do is data analysis Since then Grimes has leveraged her data analysis skills to get a job with Google where she and many other people with a quantitative background are analyzing huge amounts of data to improve the company s search engine As the chief economist at Google Hal Varian states keep saying that the sexy job in the next I0 years will be statisticiansAnd I m not kidding The salaries for statisticians with doctoral degrees currently start at I 25000 and they will probably continue to increase The math article indicates that mathematicians are also in great demand Why is this trend occurringThe reason is the explosion of digital data data from sensor signals surveillance tapesWeb clicks bar scans public records financial transactions and more In years past statisticians typically analyzed relatively small data sets such as opinion polls with about I000 responsesToday s massive data sets require new statistical methods new computer software and most importantly for you more young people trained in these methods and the corresponding software Several particular areas mentioned in the articles include I improving Internet search and online advertising 2 unraveling gene sequencing information for cancer research 3 analyzing sensor and location data for optimal handling of food shipments and 4 the recent Netflix contest for improving the company s recommendation system The statistics article mentions three speci c organizations in need of data analysts and lots of themThe rst is government where there is an increasing need to sift through mounds of data as a rst step toward dealing with longterm economic needs and key policy prioritiesThe second is BMwhich created a Business Analytics and Optimization Services group in April 2009This group will use the more than 200 mathematicians statisticians and data analysts already employed by the company but IBM intends to retrain or hire 4000 more analysts to meet its needsThe third is Google which needs more data analysts to improve its search engineYou may think that today s search engines are unbelievably efficient but Google knows they can be improvedAs Ms Grimes states Even an improve ment of a percent or two can be huge when you do things over the millions and billions of times we do things at Google Of course these three organizations are not the only organizations that need to hire more skilled people to perform data analysis and other analytical procedures It is a need faced by all large organizationsVarious recent technologies the most prominent by far being the Web have given organizations the ability to gather massive amounts of data easily Now they need people to make sense of it all and use it to their competitive advantage I K 1 1 INTRODUCTION We are living in the age of technology This has two important implications for everyone entering the business world First technology has made it possible to collect huge amounts of data Retailers collect pointofsale data on products and customers every time a trans action occurs credit agencies have all sorts of data on people who have or would like to obtain credit investment companies have a limitless supply of data on the historical patterns of stocks bonds and other securities and government agencies have data on economic trends the environment social welfare consumer product safety and virtually 2 Chapter I Introduction to Data Analysis and Decision Making Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it everything else imaginable It has become relatively easy to collect the data As a result data are plentiful However as many organizations are now beginning to discover it is quite a challenge to analyze and make sense of all the data they have collected A second important implication of technology is that it has given many more people the power and responsibility to analyze data and make decisions on the basis of quantita tive analysis People entering the business world can no longer pass all of the quantitative analysis to the quant jocks the technical specialists who have traditionally done the number crunching The vast majority of employees now have a desktop or laptop computer at their disposal access to relevant data and training in easy to use software particularly spreadsheet and database software For these employees statistics and other quantitative methods are no longer forgotten topics they once learned in college Quantitative analysis is now an integral part of their daily jobs A large amount of data already exists and it will only increase in the future Many companies already complain of swimming in a sea of data However enlightened compa nies are seeing this expansion as a source of competitive advantage By using quantitative methods to uncover the information in the data and then acting on this information again guided by quantitative analysis they are able to gain advantages that their less enlight ened competitors are not able to gain Several pertinent examples of this follow I Direct marketers analyze enormous customer databases to see which customers are likely to respond to various products and types of promotions Marketers can then target different classes of customers in different ways to maximize profits and give their customers what they want I Hotels and airlines also analyze enormous customer databases to see what their customers want and are willing to pay for By doing this they have been able to devise very clever pricing strategies where different customers pay different prices for the same accommodations For example a business traveler typically makes a plane reservation closer to the time of travel than a vacationer The airlines know this Therefore they reserve seats for these business travelers and charge them a higher price for the same seats The airlines profit from clever pricing strategies and the customers are happy I Financial planning services have a virtually unlimited supply of data about security prices and they have customers with widely differing preferences for various types of investments Trying to nd a match of investments to customers is a very challenging problem However customers can easily take their business elsewhere if good decisions are not made on their behalf Therefore financial planners are under extreme competitive pressure to analyze masses of data so that they can make informed decisions for their customers1 I We all know about the pressures US manufacturing companies have faced from foreign competition in the past couple of decades The automobile companies for example have had to change the way they produce and market automobiles to stay in business They have had to improve quality and cut costs by orders of magnitude Although the struggle continues much of the success they have had can be attributed to data analysis and wise decision making Starting on the shop oor and moving up through the organization these companies now measure almost everything analyze these measurements and then act on the results of their analysis 1For a great overview of how quantitative techniques have been used in the financial world read the book The Qucmts by Scott Patterson Random House 2010 It describes how quantitative models made millions for a lot of bright young analysts but it also describes the dangers of relying totally on quantitative models at least in the complex and global world of nance I I Introduction 3 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it We talk about companies analyzing data and making decisions However companies don t really do this people do it And who will these people be in the future They will be you We know from experience that students in all areas of business at both the undergraduate and graduate level will soon be required to describe large complex data sets run regression analyses make quantitative forecasts create optimization models and run simulations You are the person who will soon be analyzing data and making important decisions to help your company gain a competitive advantage And if you are not willing or able to do so there will be plenty of other technically trained people who will be more than happy to replace you Our goal in this book is to teach you how to use a variety of quantitative methods to analyze data and make decisions We will do so in a very hands on way We will discuss a number of quantitative methods and illustrate their use in a large variety of realistic business situations As you will see this book includes many examples from finance marketing operations accounting and other areas of business To analyze these examples we will take advantage of the Microsoft Excel spreadsheet software together with a number of powerful Excel add ins In each example we will provide step bystep details of the method and its implementation in Excel This is not a theory book It is also not a book where you can lean comfortably back in your chair prop your legs up on a table and read about how other people use quantita tive methods It is a get your hands dirty book where you will learn best by actively following the examples throughout the book at your own PC In short you will learn by doing By the time you have finished you will have acquired some very useful skills for today s business world 12 AN OVERVIEW OF THE BOOK This book is packed with quantitative methods and examples probably more than can be covered in any single course Therefore we purposely intend to keep this introductory chapter brief so that you can get on with the analysis Nevertheless it is useful to introduce the methods you will be learning and the tools you will be using In this section we provide an overview of the methods covered in this book and the software that is used to implement them Then in the next section we present a brief discussion of models and the modeling process Our primary purpose at this point is to stimulate your interest in what is to follow 121 The Methods This book is rather unique in that it combines topics from two separate fields statistics and management science In a nutshell statistics is the study of data analysis whereas management science is the study of model building optimization and decision making In the academic arena these two fields have traditionally been separated sometimes widely Indeed they are often housed in separate academic departments However from a user s standpoint it makes little sense to separate them Both are useful in accomplishing what the title of this book promises data analysis and decision making Therefore we do not distinguish between the statistics and the management science parts of this book Instead we view the entire book as a collection of useful quantitative methods that can be used to analyze data and help make business decisions In addition our choice of software helps to integrate the various topics By using a single package Excel together with a number of add ins you will see that the methods of statistics and manage ment science are similar in many important respects Most importantly their combination gives you the power and exibility to solve a wide range of business problems 4 Chapter I Introduction to Data Analysis and Decision Making Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Three important themes run through this book Two of them are in the title data analysis and decision making The third is dealing with uncertainty Each of these themes has subthemes Data analysis includes data description data inference and the search for rela tionships in data Decision making includes optimization techniques for problems with no uncertainty decision analysis for problems with uncertainty and structured sensitivity analysis Dealing with uncertainty includes measuring uncertainty and modeling uncertainty explicitly There are obvious overlaps between these themes and subthemes When you make inferences from data and search for relationships in data you must deal with uncertainty When you use decision trees to help make decisions you must deal with uncertainty When you use simulation models to help make decisions you must deal with uncertainty and then you often make inferences from the simulated data Figure 11 shows where you will find these themes and subthemes in the remaining chapters of this book In the next few paragraphs we discuss the book s contents in more detail Themes Subthemes Chapters Where Emphasized Figure Themes and Subthemes Relationships Optimization Decision Analysis with Uncertainty Sensitivity Analysis Modeling We begin in Chapters 2 and 3 by illustrating a number of ways to summarize the infor mation in data sets These include graphical and tabular summaries as well as numerical summary measures such as means medians and standard deviations The material in these two chapters is elementary from a mathematical point of view but it is extremely important As we stated at the beginning of this chapter organizations are now able to collect huge amounts of raw data but what does it all mean Although there are very sophisticated methods for analyzing data sets some of which we cover in later chapters the simple methods in Chapters 2 and 3 are crucial for obtaining an initial understanding of the data Fortunately Excel and available addins now make what was once a very tedious task quite easy For example Excel s pivot table tool for slicing and dicing data is an analyst s 2The fact that the uncertainty theme did not find its way into the title of this book does not detract from its impor tance We just wanted to keep the title reasonably short 2 An Overview of the Book 5 Copyright 2010 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it dream come true You will be amazed at the complex analysis pivot tables enable you to perform with almost no effort3 Uncertainty is a key aspect of most business problems To deal with uncertainty you need a basic understanding of probability We provide this understanding in Chapters 4 and 5 Chapter 4 covers basic rules of probability and then discusses the extremely impor tant concept of probability distributions Chapter 5 follows up this discussion by focusing on two of the most important probability distributions the normal and binomial distribu tions It also brie y discusses the Poisson and exponential distributions which have many applications in probability models We have found that one of the best ways to make probabilistic concepts come alive and easier to understand is by using computer simulation Therefore simulation is a common theme that runs through this book beginning in Chapter 4 Although the final two chapters of the book are devoted entirely to simulation we do not hesitate to use simula tion early and often to illustrate statistical concepts In Chapter 6 we apply our knowledge of probability to decision making under uncertainty These types of problems faced by all companies on a continual basis are characterized by the need to make a decision now even though important information such as demand for a product or returns from investments will not be known until later The material in Chapter 6 provides a rational basis for making such decisions The methods we illustrate do not guarantee perfect outcomes the future could unluckily tum out differently than expected but they do enable you to proceed rationally and make the best of the given circumstances Additionally the software used to implement these methods allows you with very little extra work to see how sensitive the optimal decisions are to inputs This is crucial because the inputs to many business problems are at best educated guesses Finally we examine the role of risk aversion in these types of decision problems In Chapters 7 8 and 9 we discuss sampling and statistical inference Here the basic problem is to estimate one or more characteristics of a population If it is too expensive or time consuming to learn about the entire population and it usually is we instead select a random sample from the population and then use the information in the sample to infer the characteristics of the population You see this continually on news shows that describe the results of various polls You also see it in many business contexts For example auditors typically sample only a fraction of a company s records Then they infer the characteristics of the entire population of records from the results of the sample to conclude whether the company has been following acceptable accounting standards In Chapters 10 and 11 we discuss the extremely important topic of regression analysis which is used to study relationships between variables The power of regression analysis is its generality Every part of a business has variables that are related to one another and regression can often be used to estimate possible relationships between these variables In managerial accounting regression is used to estimate how overhead costs depend on direct labor hours and production volume In marketing regression is used to estimate how sales volume depends on advertising and other marketing variables In finance regression is used to esti mate how the return of a stock depends on the market retum In real estate studies regres sion is used to estimate how the selling price of a house depends on the assessed valuation of the house and characteristics such as the number of bedrooms and square footage Regression analysis nds perhaps as many uses in the business world as any method in this book From regression we move to time series analysis and forecasting in Chapter 12 This topic is particularly important for providing inputs into business decision problems For example manufacturing companies must forecast demand for their products to make 3Users of the previous edition will notice that the old Chapter 4 getting data into Excel is no longer in the book We did this to keep the book from getting even longer However an updated version of this chapter is available at this textbook s Web site Go to wwwcengagecomdecisionsciencesalbright for access instructions 6 Chapter I Introduction to Data Analysis and Decision Making Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it sensible decisions about quantities to order from their suppliers Similarly fast food restau rants must forecast customer arrivals sometimes down to the level of 15minute intervals so that they can staff their restaurants appropriately There are many approaches to forecasting ranging from simple to complex Some involve regression based methods in which one or more time series variables are used to forecast the variable of interest whereas other methods are based on extrapolation In an extrapolation method the historical patterns of a time series variable such as product demand or customer arrivals are studied carefully and are then extrapolated into the future to obtain forecasts A number of extrapolation methods are avail able In Chapter 12 we study both regression and extrapolation methods for forecasting Chapters 13 and 14 are devoted to spreadsheet optimization with emphasis on linear programming We assume a company must make several decisions and there are constraints that limit the possible decisions The job of the decision maker is to choose the decisions such that all of the constraints are satisfied and an objective such as total profit or total cost is optimized The solution process consists of two steps The first step is to build a spreadsheet model that relates the decision variables to other relevant quantities by means of logical for mulas In this first step there is no attempt to find the optimal solution its only purpose is to relate all relevant quantities in a logical way The second step is then to find the optimal solu tion Fortunately Excel contains a Solver add in that performs this step All you need to do is specify the objective the decision variables and the constraints Solver then uses powerful algorithms to nd the optimal solution As with regression the power of this approach is its generality An enormous variety of problems can be solved by spreadsheet optimization Finally Chapters 15 and 16 illustrate a number of computer simulation models This is not your first exposure to simulation it is used in a number of previous chapters to illustrate statistical concepts but here it is studied in its own right As we discussed previously most business problems have some degree of uncertainty The demand for a product is unknown future interest rates are unknown the delivery lead time from a supplier is unknown and so on Simulation allows you to build this uncertainty explicitly into spreadsheet models Essentially some cells in the model contain random values with given probability distribu tions Every time the spreadsheet recalculates these random values change which causes bottomline output cells to change as well The trick then is to force the spreadsheet to recal culate many times and keep track of interesting outputs In this way you can see which output values are most likely and you can see best case and worst case results Spreadsheet simulations can be performed entirely with Excel s built in tools However this is quite tedious Therefore we use a spreadsheet add in to streamline the process In particular you will learn how the RISK add in can be used to run replications of a simulation keep track of outputs create useful charts and perform sensitivity analyses With the inherent power of spreadsheets and the ease of using such add ins as RISK spreadsheet simulation is becoming one of the most popular quantitative tools in the business world 122 The Software The quantitative methods in this book can be used to analyze a wide variety of business problems However they are not of much practical use unless you have the software to do the number crunching Very few business problems are small enough to be solved with pencil and paper They require powerful software The software included in new copies of this book together with Microsoft Excel provides you with a powerful combination This software is being used and will continue to be used by leading companies all over the world to analyze large complex problems We firmly believe that the experience you obtain with this software through working the examples and problems in this book will give you a key competitive advantage in the marketplace 2 An Overview of the Book 7 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it It all begins with Excel All of the quantitative methods that we discuss are implemented in Excel Speci cally in this edition we use Excel 20074 We cannot forecast the state of computer software in the long terrn future but Excel is currently the most heavily used spreadsheet package on the market and there is every reason to believe that this state will persist for many years Most companies use Excel most employees and most students have been trained in Excel and Excel is a very powerful exible and easyto use package Builtin Excel Features Virtually everyone in the business world knows the basic features of Excel but relatively few know some of its more powerful features In short relatively few people are the power users we expect you to become by working through this book To get you started the le Excel Tutorialxlsx explains some of the intermediate features of Excel features that we expect you to be able to use access this le on the textbook s Web site that accompanies new copies of this book These include the SUMPRODUCT VLOOKUP IF NPV and COUN TIF functions They also include range names data tables the Paste Special option the Goal Seek tool and many others Finally although we assume you can perform routine spread sheet tasks such as copying and pasting the tutorial includes many tips to help you perform these tasks more ef ciently In the body of the book we describe several of Excel s advanced features in more detail For example we introduce pivot tables in Chapter 3 This Excel tool enables you to summarize data sets in an almost endless variety of ways Excel has many useful tools but we personally believe that pivot tables are the most ingenious and powerful of all We won t be surprised if you agree As another example we introduce Excel s RAND and RANDBETWEEN functions for generating random numbers in Chapter 4 These functions are used in all spreadsheet simulations at least those that do not take advantage of an add in In short when an Excel tool is useful for a particular type of analysis we provide stepbystep instructions on how to use it Solver Addin In Chapters 13 and 14 we make heavy use of Excel s Solver add in This add in developed by Frontline Systems not Microsoft uses powerful algorithms all behind the scenes to perform spreadsheet optimization Before this type of spreadsheet optimization add in was available specialized nonspreadsheet software was required to solve optimization problems Now you can do it all within a familiar spreadsheet environment SolverTable Addin An important theme throughout this book is sensitivity analysis How do outputs change when inputs change Typically these changes are made in spreadsheets with a data table a built in Excel tool However data tables don t work in optimization models where we would like to see how the optimal solution changes when certain inputs change Therefore we include an Excel add in called SolverTable which works almost exactly like Excel s data tables This add in was developed by Albright In Chapters 13 and 14 we illustrate the use of SolverTable Decision Tools Suite In addition to SolverTable and built in Excel add ins we also have included on the textbook s Web site an educational version of Palisade Corporation s powerful Decision Tools suite All of the programs in this suite are Excel add ins so the learning curve isn t very steep There are seven separate add ins in this suite RISK 4At the time we wrote this edition Excel 2010 was in beta form and was about to be released Fortunately the changes at least for our purposes are not extensive so users familiar with Excel 2007 will have no dif culty in moving to Excel 2010 Where relevant we have pointed out changes in the new version 8 Chapter I Introduction to Data Analysis and Decision Making Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it StatTools PrecisionTree TopRank RISKOptimizer NeuralTools and Evolver5 We will use only the first three in this book but all are useful for certain tasks and are described brie y below RISK The simulation addin RISK enables you to run as many replications of a spreadsheet simulation as you like As the simulation runs RISK automatically keeps track of the outputs you select and it then displays the results in a number of tabular and graphical forms RISK also enables you to perform a sensitivity analysis so that you can see which inputs have the most effect on the outputs Finally RISK provides a number of spreadsheet functions that enable you to generate random numbers from a variety of prob ability distributions StatTools Much of this book discusses basic statistical analysis Here we needed to make an important decision as we developed the book A number of excellent statistical software packages are on the market including Minitab SPSS SAS JMP Stata and others Although there are user friendly Windows versions of these packages they are not spreadsheet based We have found through our own experience that students resist the use of nonspreadsheet packages regardless of their inherent quality so we wanted to use Excel as our statistics package Unfortunately Excel s built in statistical tools are rather limited and the Analysis ToolPak developed by a third party that ships with Excel has significant limitations Fortunately the Palisade suite includes a statistical addin called StatTools StatTools is powerful easy to use and capable of generating output quickly in an easily interpretable form We do not believe you should have to spend hours each time you want to produce some statistical output This might be a good learning experience the first time but it acts as a strong incentive not to perform the analysis at all We believe you should be able to generate output quickly and easily This gives you the time to interpret the output and it also allows you to try different methods of analysis A good illustration involves the construction of histograms scatterplots and time series graphs discussed in Chapters 2 and 3 All of these extremely useful graphs can be created in a straightforward way with Excel s built in tools But by the time you perform all the necessary steps and dress up the charts exactly as you want them you will not be very anxious to repeat the whole process again StatTools does it all quickly and easily You still might want to dress up the resulting charts but that s up to you Therefore if we advise you in a later chapter say to look at several scatterplots as a prelude to a regres sion analysis you can do so in a matter of seconds Precision Tree The PrecisionTree addin is used in Chapter 6 to analyze decision problems with uncer tainty The primary method for performing this type of analysis is to draw a decision tree Decision trees are inherently graphical and they have always been difficult to implement in spreadsheets which are based on rows and columns However PrecisionTree does this in a very clever and intuitive way Equally important once the basic decision tree has been built it is easy to use PrecisionTree to perform a sensitivity analysis on the model s inputs TopRank TopRank is a what if addin used for sensitivity analysis It starts with any spreadsheet model where a set of inputs along with a number of spreadsheet formulas leads to one or 5The Palisade suite has traditionally included two stand alone programs BestFit and RISKview The functional ity of both of these is now included in RISK so they are not included in the suite 2 An Overview of the Book 9 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it more outputs TopRank then performs a sensitivity analysis to see which inputs have the largest effect on a given output For example it might indicate which input affects after tax pro t the most the tax rate the risk free rate for investing the in ation rate or the price charged by a competitor Unlike RISK TopRank is used when uncertainty is not explicitly built into a spreadsheet model However it considers uncertainty implicitly by performing sensitivity analysis on the important model inputs RISKOptimizer RISKOptimizer combines optimization with simulation There are often times when you want to use simulation to model some business problem but you also want to optimize a summary measure such as a mean of an output distribution This optimization can be performed in a trialand error fashion where you try a few values of the decision vari ables and see which provides the best solution However RISKOptimizer provides a more automatic and timeintensive optimization procedure NeuralTools In Chapters 10 and 11 we show how regression can be used to nd a linear equation that quanti es the relationship between a dependent variable and one or more explanatory variables Although linear regression is a powerful tool it is not capable of quantifying all possible relationships The NeuralTools addin mimics the working of the human brain to nd neural networks that quantify complex nonlinear relationships Evolver In Chapters 13 and 14 we show how the built in Solver addin can optimize linear models and even some nonlinear models But there are some non smooth nonlinear models that Solver cannot handle Fortunately there are other optimization algorithms for such models including genetic algorithms The Evolver addin implements these genetic algorithms Software Guide Figure 12 provides a guide to the use of these add ins throughout the book We don t show Excel explicitly in this gure for the simple reason that Excel is used extensively in all chapters Developer AddIn Chapters Where Used Fl gu re I 2 Software Guide Solve able PrecisionTree With Excel and the add ins included in this book you have a wealth of software at your disposal The examples and stepbystep instructions throughout this book will help you become a power user of this software Admittedly this takes plenty of practice and a I0 Chapter I Introduction to Data Analysis and Decision Making Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it willingness to experiment but it is certainly within your grasp When you are nished we will not be surprised if you rate improved software skills as the most valuable thing you have learned from this book 13 MODELING AND MODELS We have already used the term model several times in this chapter Models and the model ing process are key elements throughout this book so we explain them in more detail in this section6 A model is an abstraction of a real problem A model tries to capture the essence and key features of the problem without getting bogged down in relatively unimportant details There are different types of models and depending on an analyst s preferences and skills each can be a valuable aid in solving a real problem We brie y describe three types of models here graphical models algebraic models and spreadsheet models 131 Graphical Models Graphical models are probably the most intuitive and least quantitative type of model They attempt to portray graphically how different elements of a problem are related what affects what A very simple graphical model appears in Figure 13 It is called an in uence diagram It can be constructed with the PrecisionTree add in discussed in Chapter 6 but we will not use in uence diagrams in this book Figure 3 In uence Diagram Supply This particular in uence diagram is for a company that is trying to decide how many souvenirs to order for the upcoming Olympics The essence of the problem is that the com pany will order a certain supply customers will request a certain demand and the combi nation of supply and demand will yield a certain payoff for the company The diagram indicates fairly intuitively what affects what As it stands the diagram does not provide enough quantitative details to solve the company s problem but this is usually not the purpose of a graphical model Instead its purpose is usually to show the important elements of a problem and how they are related For complex problems this can be very helpful and enlightening information for management 6Management scientists tend to use the terms model and modeling more than statisticians However many tradi tional statistics topics such as regression analysis and forecasting are clearly applications of modeling 3 Modeling and Models I I Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it 132 Algebraic Models Algebraic models are at the opposite end of the spectrum By means of algebraic equations and inequalities they specify a set of relationships in a very precise way Their preciseness and lack of ambiguity are very appealing to people with a mathematical background In addition algebraic models can usually be stated concisely and with great generality A typical example is the product mix problem in Chapter 13 A company can make several products each of which contributes a certain amount to profit and consumes certain amounts of several scarce resources The problem is to select the product mix that maximizes profit subject to the limited availability of the resources All product mix prob lems can be stated algebraically as follows I1 max Epjxj 11 j1 11 subject to Eaijxj s 19 1 s i s m 12 j1 0SxSu isjsn 13 Here x is the amount of product j produced u is an upper limit on the amount of product j that can be produced 19 is the unit profit margin for product j a 1 is the amount of resource 139 consumed by each unit of product j bl is the amount of resource 139 available n is the number of products and m is the number of scarce resources This algebraic model states very concisely that we should maximize total profit expression ll subject to consuming no more of each resource than is available inequality 12 and all production quantities should be between 0 and the upper limits inequality 13 Algebraic models appeal to mathematically trained analysts They are concise they spell out exactly which data are required the values of the ujs the pjs the a ljs and the bis would need to be estimated from company data they scale well a problem with 500 products and 100 resource constraints is just as easy to state as one with only ve products and three resource constraints and many software packages accept algebraic models in essentially the same form as shown here so that no translation is required Indeed alge braic models were the preferred type of model for years and still are by many analysts Their main drawback is that they require an ability to work with abstract mathematical symbols Some people have this ability but many perfectly intelligent people do not 133 Spreadsheet Models An alternative to algebraic modeling is spreadsheet modeling Instead of relating various quantities with algebraic equations and inequalities you relate them in a spreadsheet with cell formulas In our experience this process is much more intuitive to most people One of the primary reasons for this is the instant feedback available from spreadsheets If you enter a formula incorrectly it is often immediately obvious from error messages or unrealistic numbers that you have made an error which you can then go back and fix Algebraic models provide no such immediate feedback A specific comparison might help at this point We already saw a general algebraic model of the product mix problem Figure 14 taken from Chapter 13 illustrates a spread sheet model for a specific example of this problem The spreadsheet model should be fairly self explanatory All quantities in shaded cells other than in rows 16 and 25 are inputs to I2 Chapter I Introduction to Data Analysis and Decision Making Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Figure I 4 Optimal Solution for Product Mix Model A B C D E F G 1 Assembling and testing computers Range names used 2 Hoursavaiabe ModeD21D22 3 Cost per labor hour assembling 11 Hoursused ModeB21B22 4 Cost per labor hour testing 15 Maximumsaes ModeB18C18 5 Numbertoproduce ModeB16C16 6 Inputs for assembling and testing a computer Totaprofit ModeD25 7 Basic XP 8 Labor hours for assembly 5 6 9 Labor hours for testing 1 2 10 Cost of component parts 150 225 11 Selling price 300 450 12 Unit margin 80 129 3 14 Assembling testing plan of computers 15 Basic XP Number to produce 560 1200 17 lt lt 18 Maximum sales 600 1200 E 20 Constraints hours per month Hours used Hours available 21 Labor availability for assembling 10000 lt 10000 22 Labor availability for testing 2960 lt 3000 23 24 Net pro t this month Basic XP Total 25 44800 154800 199600 the model the quantities in row 16 are the decision variables they correspond to the xjs in the algebraic model and all other quantities are created through appropriate Excel formulas To indicate constraints inequality signs have been entered as labels in appro priate cells Although a well designed and welldocumented spreadsheet model such as the one in Figure 14 is undoubtedly more intuitive for most people than its algebraic counter part the art of developing good spreadsheet models is not easy Obviously they must be correct The formulas relating the various quantities must have the correct syntax the correct cell references and the correct logic In complex models this can be quite a challenge However we do not believe that correctness is enough If spreadsheet models are to be used in the business world they must also be well designed and well documented Otherwise no one other than you and maybe not even you after a few weeks have passed will be able to understand what your models do or how they work The strength of spreadsheets is their exibility you are limited only by your imagination However this exibility can be a liability in spreadsheet modeling unless you design your models carefully Note the clear design in Figure 14 Most of the inputs are grouped at the top of the spreadsheet All of the nancial calculations are done at the bottom When there are con straints the two sides of the constraints are placed next to each other as in the range B21D22 Colored backgrounds which appear on the screen but not in this book are used 3 Modeling and Models I3 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it for added clarity and descriptive labels are used liberally Excel itself imposes none of these rules but you should impose them on yourself We have made a conscious effort to establish good habits for you to follow throughout this book We have designed and redesigned our spreadsheet models so that they are as clear as possible This does not mean that you have to copy everything we do everyone tends to develop their own spreadsheet style but our models should give you something to emulate Just remember that in the business world you typically start with a blank spreadsheet It is then up to you to develop a model that is not only correct but is also intel ligible to you and others This takes a lot of practicing and a lot of editing but it is a skill well worth developing 134 A SevenStep Modeling Process Most of the modeling you will do in this book is only part of an overall modeling process typically done in the business world We portray it as a seven step process as discussed here But not all problems require all seven steps For example the analysis of survey data might entail primarily steps 2 data analysis and 5 decision making of the process with out the formal model building discussed in steps 3 and 4 The Modeling Process 1 De ne the problem Typically a company does not develop a model unless it believes it has a problem Therefore the modeling process really begins by identifying an underlying problem Perhaps the company is losing money perhaps its market share is declining or perhaps its customers are waiting too long for service Any number of problems might be evident However as several people have warned see Miser 1993 and Volkema 1995 for example this step is not always as straightforward as it might appear The company must be sure that it has identified the correct problem before it spends time effort and money trying to solve it For example Miser cites the experience of an analyst who was hired by the mili tary to investigate overly long tumaround times between fighter planes landing and taking off again to rejoin the battle The military was convinced that the problem was caused by inefficient ground crews if they were faster tumaround times would decrease The analyst nearly accepted this statement of the problem and was about to do classical time and motion studies on the ground crew to pinpoint the sources of their inefficiency However by snooping around he found that the problem obviously lay elsewhere The trucks that refueled the planes were frequently late which in turn was due to the inefficient way they were re lled from storage tanks at another location Once this latter problem was solved and its solution was embarrassingly simple the tumaround times decreased to an acceptable level without any changes on the part of the ground crews If the analyst had accepted the military s statement of the problem the real problem might never have been located or solved 2 Collect and summarize data This crucial step in the process is often the most tedious All organizations keep track of various data on their operations but these data are often not in the form an analyst requires They are also typically scattered in dif ferent places throughout the organization in all kinds of different formats Therefore one of the first jobs of an analyst is to gather exactly the right data and summarize the data appropriately as we discuss in detail in Chapters 2 and 3 for use in the model Collecting the data typically requires asking questions of key people such as the accountants throughout the organization studying existing organizational data bases and performing time consuming observational studies of the organization s I4 Chapter I Introduction to Data Analysis and Decision Making Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it processes In short it entails a lot of legwork Fortunately many companies have understood the need for good clean data and have spent large amounts of time and money to build data warehouses for quantitative analysis 3 Develop a model This is the step we emphasize especially in the latter chapters of the book The form of the model varies from one situation to another It could be a graphical model an algebraic model or a spreadsheet model The key is that the model should capture the important elements of the business problem in such a way that it is understandable by all parties involved This latter requirement is why we favor spreadsheet models especially when they are well designed and well documented 4 Verify the model Here the analyst tries to determine whether the model developed in the previous step is an accurate representation of reality A first step in determin ing how well the model ts reality is to check whether the model is valid for the current situation This veri cation can take several forms For example the analyst could use the model with the company s current values of the input parameters If the model s outputs are then in line with the outputs currently observed by the company the analyst has at least shown that the model can duplicate the current situation A second way to verify a model is to enter a number of input parameters even if they are not the company s current inputs and see whether the outputs from the model are reasonable One common approach is to use extreme values of the inputs to see whether the outputs behave as they should If they do this is another piece of evidence that the model is reasonable If certain inputs are entered in the model and the model s outputs are not as expected there could be two causes First the model could simply be a poor represen tation of reality In this case it is up to the analyst to re ne the model so that it is more realistic The second possible cause is that the model is ne but our intuition is not very good In this case the fault lies with us not the model An interesting example of faulty intuition occurs with random sequences of 0s and 1s such as might occur with successive ips of a fair coin Most people expect that heads and tails will alternate and that there will be very few sequences of say four or more heads or tails in a row However a perfectly accurate simulation model of these ips will show contrary to what most people expect that fairly long runs of heads or tails are not at all uncommon In fact one or two long runs should be expected if there are enough ips The fact that outcomes sometimes defy intuition is an important reason why models are important These models prove that your ability to predict outcomes in complex environments is often not very good 5 Select one or more suitable decisions Many but not all models are decision models For any speci c decisions the model indicates the amount of pro t obtained the amount of cost incurred the level of risk and so on If the model is working cor rectly as discussed in step 4 then it can be used to see which decisions produce the best outputs 6 Present the results to the organization In a classroom setting you are typically nished when you have developed a model that correctly solves a particular problem In the business world a correct model even a useful one is not always enough An analyst typically has to sell the model to management Unfortunately the people in management are sometimes not as well trained in quantitative methods as the analyst so they are not always inclined to trust complex models There are two ways to mitigate this problem First it is helpful to include relevant people throughout the company in the modeling process from beginning to end so 3 Modeling and Models I 5 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it that everyone has an understanding of the model and feels an ownership of it Second it helps to use a spreadsheet model whenever possible especially if it is designed and documented properly Almost everyone in today s business world is comfortable with spreadsheets so spreadsheet models are more likely to be accepted 7 Implement the model and update it over time Again there is a big difference between a classroom situation and a business situation When you turn in a class room assignment you are typically finished with that assignment and can await the next one In contrast an analyst who develops a model for a company usually cannot pack up his bags and leave If the model is accepted by management the company will then need to implement it company wide This can be very time consuming and politically difficult especially if the model s prescriptions represent a significant change from the past At the very least employees must be trained how to use the model on a daytoday basis In addition the model will probably have to be updated over time either because of changing conditions or because the company sees more potential uses for the model as it gains experience using it This presents one of the greatest challenges for a model developer namely the ability to develop a model that can be modified as the need arises 1 4 CONCLUSION In this chapter we have tried to convince you that the skills in this book are important for you to know as you enter the business world The methods we discuss are no longer the sole province of the quant jocks By having a PC on your desk that is loaded with powerful software you incur a responsibility to use this software to analyze business prob lems We have described the types of problems you will learn to analyze in this book along with the software you will use to analyze them We also discussed the modeling process a theme that runs throughout this book Now it is time for you to get started I6 Chapter I Introduction to Data Analysis and Decision Making Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it ENTERTAINMENT ON A CRUISE SHIP ruise ship traveling has become big business Many cruise lines are now competing for customers of all age groups and socioeconomic evesThey offer all types of cruises from relatively inexpensive 3 to 4day cruises in the Caribbean to 2 to I5day cruises in the Mediterraneanto severalmonth aroundtheworld cruises Cruises have several features that attract customers many of whom book six months or more in advance I they offer a relaxing everythingdoneforyou way to travel 2 they serve food that is plentiful usually excellent and included in the price of the cruise 3 they stop at a number of interesting ports and offer travelers a way to see the world and 4 they provide a wide variety of entertainment particularly in the evening This last feature the entertainment presents a dif cult problem for a ship s staffA typical cruise might have well over I000 passengers including elderly singles and couples middleaged people with or without children and young people often honey moonersThese various types of passengers have varied tastes in terms of their afterdinner prefer ences in entertainment Some want traditional dance music some want comedians some want rock music some want movies some want to go back to their cabins and read and so on Obviously cruise enter tainment directors want to provide the variety of entertainment their customers desire within a reasonable budget because satisfied customers tend to be repeat customersThe question is how to provide the right mix of entertainment On a cruise one of the authors and his wife took a few years ago the entertainment was of high quality and there was plenty of varietyA sevenpiece show band played dance music nightly in the largest lounge two other small musical combos played nightly at two smaller lounges a pianist played nightly at a piano bar in an intimate lounge a group of professional singers and dancers played Broadwaytype shows about twice weekly and various professional singers and comedians played occasional singlenight performances7 Although this entertainment was free to all of the passengers much of it had embarrassingly low attendanceThe nightly show band and musical combos who were contracted to play nightly until midnight often had less than a half dozen people in the audience sometimes literally noneThe professional singers dancers and comedians attracted larger audiences but there were still plenty of empty seats In spite of this the cruise staff posted a weekly schedule and they stuck to it regardless of attendance In a short term nancial sense it didn t make much difference The performers got paid the same whether anyone was in the audience or not the passengers had already paid indirectly for the entertainment as part of the cost of the cruise and the only possible opportunity cost to the cruise line in the short run was the loss of liquor sales from the lack of passengers in the entertainment oungesThe morale of the entertainers was not great entertainers love packed houses but they usually argued philosophically that their hours were relatively short and they were still getting paid to see the world If you were in charge of entertainment on this ship how would you describe the problem with entertainment Is it a problem with deadbeat passengers lowquality entertainment or a mismatch between the entertainment offered and the enter tainment desired How might you try to solve the probemWhat constraints might you have to work withinWoud you keep a strict schedule such as the one followed by this cruise director or would you play it more by ear Would you gather data to help solve the probemWhat data would you gather How much would nancial considerations dictate your decisionsWoud they be longterm or short term considerations I 7There was also a moderately large onboard casino but it tended to attract the same people every night and it was always closed when the ship was in port Case Entertainment on a Cruise Ship I 7 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not naterially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER RECENT PRESIDENTIAL ELECTIONS residential elections in United States are scrutinized more than ever It hardly seems that one is over before we start hearing plans and polls for the next There is thorough coverage of the races leading up to the elections but it is also interesting to analyze the results after the elections have been hedThis is not dif cult given the many informative Web sites that appear immediately with election results For example aWeb search for 2008 presidential election results nds many sites with indepth results interactive maps and more In additionthe resulting data can often be imported into Excel rather easily for further analysis The le contains such down loaded data for the 2000 Bush versus Gore 2004 Bush versus Kerry and 2008 Obama versus McCain eectionsThe results of the 2000 election are particularly interesting As you probably remember this was one of the closest Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Photo by Alex WongNewsmakersGetty Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it elections of all timewith Bush defeating Gore by a very narrow margin in the electoral vote 27 to 266 following a disputed recount in Florida In fact Gore actually beat Bush in the total count of US votes 50999897 to 50456002 However because of the allornothing nature of electoral votes in each state Bush s narrow margin of victory in many closely contested states won him a lot of electoral votes In contrast Gore outdistanced Bush by a wide margin in several large states winning him the same electoral votes he would have won even if these races had been much closer A closer analysis of the statebystate results shows how this actually happened In the Excel fiewe created two new columns Bush Votes minus GoreVotes and Pct for Bush minus Pct for Gore with a value for each state including the District of CoumbiaWe then created column charts of these two variables as shown in Figures 2 and 22 Figure 2 Chart of Vote Differences 2000000 1500000 1000000 500000 0 500000 1000000 1500000 2000000 Votes for Bush minus Votes for Gore Figure 22 Chart of Percent Differences 6000 4000 2000 000 2000 4000 6000 8000 10000 Pct for Bush minus Pct for Gore Each of these charts tells the same story but in slightly different ways From Figure 2 I we see how Gore won big large vote difference in several large states most notably California Massachusetts and NewYork Bush s only comparable margin of victory was in his home state of Texas However Bush won a lot of close races in states with relatively 22 Chapter 2 Describing the Distribution of a SingeVariabe Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it few electoral votes but enough to add up to an overall win As Figure 22 indicates many of these cose races such as Alaska and Idaho for Bush and District of Columbia for Gore were not that close after all at least not from a percentage standpointThis is one case of many where multiple charts can be created to tell a story Perhaps an argument can be made that Figure 2 tells the story best but Figure 22 is also interesting The bottom line is that the election could easily have gone the other way With one more swing state particularly ForidaA Gore would have been president On the other hand Gore won some very close races as well particularly in Iowa Minnesota New Mexico and Oregon If these had gone the other way the popular vote would still have been very close but Bush s victory in the electoral vote would have been more impressive I 2 1 INTRODUCTION It is customary to refer to the raw numbers as data and the output of a statistical analysis as information You start with the data and you hope to end with information that an organization can use for competitive advantage The goal of this chapter and the next is very simple to make sense out of data by constructing appropriate summary measures tables and graphs Our purpose here is to take a set of data that at first glance might have little meaning and to present the data in a form that makes sense to people There are numerous ways to do this limited only by your imagination but there are several tools used most often 1 a variety of graphs including bar charts pie charts histograms scatterplots and time series graphs 2 numerical summary measures such as counts percentages averages and measures of variability and 3 tables of summary measures such as totals averages and counts grouped by cate gories These terms might not all be familiar to you at this point but you have undoubtedly seen examples of them in newspapers magazine articles and books The material in these two chapters is simple complex and important It is simple because there are no difficult mathematical concepts With the possible exception of variance standard deviation and correlation all of the numerical measures graphs and tables are natural and easy to understand It used to be a tedious chore to produce them but with the advances in statistical software including add ins for spreadsheet packages such as Excel they can now be produced quickly and easily If it is so easy why do you also claim that the material in this chapter is complex The data sets available to companies in today s computerized world tend to be extremely large and lled with unstructured data As you will see even in data sets that are quite small in comparison to those that real companies face it is a challenge to summarize the data so that the important information stands out clearly It is easy to produce summary measures graphs and tables but our goal is to produce the most appropriate ones The typical employees of today not just the managers and technical specialists have a wealth of easytouse tools at their disposal and it is frequently up to them to sum marize data in a way that is both meaningful and useful to their constituents people within their company their company s suppliers and their company s customers It takes some training and practice to do this effectively Because today s companies are inundated with data and because virtually every employee in the company must summarize data to some extent the material in this chapter and the next one is arguably the most important material in the book There is sometimes a tendency to race through the descriptive statistics chapters to get to the more interest ing material in later chapters as quickly as possible We want to resist this tendency The material covered in these two chapters deserves close examination and this takes some time Data analysis in the real world is never done in a vacuum It is done to solve a problem Typically there are four steps that are followed whether the context is business medical 2 Introduction 23 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it science or any other eld The rst step is to recognize a problem that needs to be solved Perhaps a retail company is experiencing decreased sales in a particular region or for a par ticular product Why is this happening The second step is to gather data to help understand and then solve the problem This might be done through a survey of customers by assem bling data from already existing company systems by nding relevant data on the Web or other means Once the data is gathered the third step is to analyze the data using the tools you will learn in the book The fourth step is to act on this analysis by changing policies undertaking initiatives publishing reports and so on Of course the analysis can sometimes repeat steps For example once a given set of data is analyzed it might be apparent that even more data needs to be collected Use your imagination As we discuss the tools for analyzing data we will often jump into the third step 730 ask interesting directly providing you with a data set to analyze Although this data set may not be questions about the directly connected to the goal of solving some company s problem you should still Z1VC391btoS fJ We strive to ask interesting questions of the data We have tried to include interesting data W1 suppiy you with sets often containing real data that make this possible If the data set contains salaries the tools to answer you might ask what drives these salaries Does it depend on the industry a person is in these Cl e5 0 5 Does it depend on gender Does it depend on educational background Is the salary structure whatever it is changing over time If the data set contains cost of living indexes there are also a lot of interesting questions you can ask How are the indexes changing over time Does this behavior vary in different geographical regions Does this behavior vary across different items such as housing food and automobiles These early chapters provide you with many tools to answer such questions but it is up to you to ask good questions and then take advantage of the most appropriate tools to answer them The material in these chapters is organized as follows In this chapter we present a number of ways for analyzing one variable at a time In the next chapter we look at ways of discovering relationships between variables In addition there is a bonus Chapter 17 on importing data from external sources into Excel a natural companion to Chapters 2 and 3 This bonus chapter is available on this textbook s Web site 22 BASIC CONCEPTS We begin with a short discussion of several important concepts populations and samples data sets variables and observations and types of data 221 Populations and Samples First we distinguish between a population and a sample A population includes all of the entities of interest whether they be people households machines or whatever The follow ing are three typical populations I All potential voters in a presidential election I All subscribers to cable television I All invoices submitted for Medicare reimbursement by nursing homes In these situations and many others it is virtually impossible to obtain information about all members of the population For example it is far too costly to ask all potential voters which presidential candidates they prefer Therefore we often try to gain insights into the characteristics of a population by examining a sample or subset of the population In later chapters we will examine populations and samples in some depth but for now it is 24 Chapter 2 Describing the Distribution of a SingeVariabe Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it enough to know that we typically want samples to be representative of the population so that observed characteristics of the sample can be generalized to the population as a whole A population includes all of the entities of interest in a study A sample is a subset of the population often randomly chosen and preferably representative of the population as a whole A famous example where a sample was not representative is the case of the Literary Digest asco of 1936 In the 1936 presidential election subscribers to the Literary Digest a highbrow literary magazine were asked to mail in a ballot with their preference for president Overwhelmingly these ballots favored the Republican candidate Alf Landon over the Democratic candidate Franklin D Roosevelt Despite this FDR was a landslide winner The discrepancy arose because the readers of the Literary Digest were not at all representative of most voters in 1936 Most voters in 1936 could barely make ends meet let alone subscribe to a literary magazine Thus the typical lower to middle income voter had almost no chance of being chosen in this sample Today Gallup Harris and other pollsters make a conscious effort to ensure that their samples which usually include about 1000 to 1500 people are representative of the population It is truly remarkable for example that a sample of 1500 voters can almost surely predict a candidate s actual percentage of votes correctly to within 3 We explain why this is possible in Chapters 7 and 8 The important point is that a representative sample of reasonable size can provide a lot of important information about the population of interest We use the terms population and sample a few times in this chapter which is why we have defined them here However the distinction is not really important until later chapters Our intent in this chapter is to focus entirely on the data in a given data set not to generalize beyond it Therefore the given data set could be a population or a sample from a population For now the distinction is largely irrelevant 222 Data Sets Variables and Observations We now discuss the types of data sets we will examine Although the focus of this book is Excel virtually all statistical software packages use the same concept of a data set A data set is generally a rectangular array of data where the columns contain variables such as height gender and income and each row contains an observation Each observation includes the attributes of a particular member of the population whether it be a person a company a city a machine or other entity This terminology is common but other terms are often used A variable column is often called a eld or an attribute and an observa tion row is often called a case or a record Also data sets are occasionally rearranged so that the variables are in rows and the observations are in columns However the most common arrangement by far is to have variables in columns with variable names in the top row and observations in the remaining rows A data set is usually a rectangular array of data with variables in columns and obser vations in rows A variable or eld or attribute is a characteristic of members of a population such as height gender or salary An observation or case or record is a list of all variable values for a single member of a population 22 Basic Concepts 25 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it EXAMPLE Figure 23 Environmental Survey Data 2l DATA FROM AN ENVIRONMENTAL SURVEY he data set shown in Figure 23 represents 30 responses from a questionnaire concerning the president s environmental policies See the le Questionnaire DataXlsX Identify the variables and observations A C D E F G 1 Person Age Gender State Children Salary Opinion 2 1 35 Male Minnesota 1 65400 5 3 2 61 Female Texas 2 62000 1 4 3 35 Male Ohio 0 63200 3 5 4 37 Male Florida 2 52000 5 6 5 32 Female California 3 81400 1 7 6 33 Female New York 3 46300 5 8 7 65 Female Minnesota 2 49600 1 9 8 45 Male New York 1 45900 5 10 9 40 Male Texas 3 47700 4 11 10 32 Female Texas 1 59900 4 12 11 57 Male New York 1 48100 4 13 12 38 Female Virginia 0 58100 3 14 13 37 Female Illinois 2 56000 1 15 14 42 Female Virginia 2 53400 1 16 15 38 Female New York 2 39000 2 17 16 48 Male Michigan 1 61500 2 18 17 40 Male Ohio 0 37700 1 19 18 57 Female Michigan 2 36700 4 20 19 44 Male Florida 2 45200 3 21 20 40 Male Michigan 0 59000 4 22 21 21 Female Minnesota 2 54300 2 23 22 49 Male New York 1 62100 4 24 23 34 Male New York 0 78000 3 25 24 49 Male Arizona 0 43200 5 26 25 40 Male Arizona 1 44500 3 27 26 38 Male Ohio 1 43300 1 28 27 27 Male Illinois 3 45400 2 29 28 63 Male Michigan 2 53900 1 30 29 52 Male California 1 44100 3 31 30 48 Female New York 2 31000 4 Objective To illustrate variables and observations in a typical data set Solution This data set provides observations on 30 people who responded to the questionnaire Each observation lists the person s age gender state of residence number of children annual salary and opinion of the president s environmental policies These six pieces of 26 Chapter 2 Describing the Distribution of a SingeVariabe Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Three variables that appear to be numeri cal but are usually treated as categorical are phone numbers zip codes and Social Security numbers Do you see why Can you think of others information represent the variables It is customary to include a row row 1 in this case that lists variable names These variable names should be concise but meaningful Note that an index of the observation is often included in column A If you start sorting on other vari ables you can always sort on the index to get back to the original sort order As you will see shortly when we begin to use a very powerful statistical addin for Excel called StatTools the concept of a data set is crucial Before you can perform any statistical analysis on a data set with StatTools you must designate a rectangular range as a StatTools data set This is easy yet it must be done As you will also see StatTools allows several layouts for data sets including one where the variables are in rows and the observations are in columns However the default layout the one you will see over 99 of the time is the one shown in Figure 23 where variables are in columns observations are in rows and the top row contains variable names I 223 Types of Data There are several ways to categorize data as we explain in the context of Example 21 A basic distinction is between numerical and categorical data The distinction here is whether you intend to do any arithmetic on the data It makes sense to do arithmetic on numerical data but not on categorical data Actually there is a third data type a date vari able As you may know Excel stores dates as numbers but for obvious reasons dates are treated differently from typical numbers A variable is numerical if meaningful arithmetic can be performed on it Otherwise the variable is categorical In the questionnaire data Age Children and Salary are clearly numerical For exam ple it makes perfect sense to sum or average any of these In contrast Gender and State are clearly categorical because they are expressed as text not numbers The Opinion variable is less obvious It is expressed numerically on a 1to5 scale However these numbers are really only codes for the categories strongly disagree dis agree neutral agree and strongly agree We never intend to perform arithmetic on these numbers in fact it is not really appropriate to do so Therefore it is most appropriate to treat the Opinion variable as categorical Note too that there is a definite ordering of its categories whereas there is no natural ordering of the categories for the Gender or State variables When there is a natural ordering of categories we classify the variable as ordinal If there is no natural ordering as with the Gender and State vari ables we classify the variables as nominal However both ordinal and nominal variables are categorical A categorical variable is ordinal if there is a natural ordering of its possible values If there is no natural ordering it is nominal Excel Tip How do you remember for example that I stands for strongly disagree in the Opinion variable You can enter a comment a reminder to yourself and others in any cell To do so right click on a cell and select the Insert Comment item A small red tag appears in any cell with a comment Moving the cursor over that cell causes the comment to appear You will see numerous comments in the les that accompany this book 22 Basic Concepts 27 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Categorical variables can be coded numerically or left uncoded In Figure 23 Gender has not been coded whereas Opinion has been coded This is largely a matter of taste so long as you realize that coding a truly categorical variable does not make it numerical and open to arithmetic operations An alternative way of displaying the data appears in Figure 24 Now Opinion has been replaced by text and Gender has been coded as 1 for males and 0 for females This 01 coding for a categorical variable is very common Such a variable is called a dummy variable and it often simplifies the analysis You will see dummy variables often throughout the book A dummy variable is a 01 coded variable for a specific category It is coded as 1 for all observations in that category and 0 for all observations not in that category Figure 24 Environmental Data Using a Different Coding A B c D E F G H I J K L 1 Person Age Gender State Children Salary Opinion 2 1 Midde aged 1 Minnesota 1 65400 Strongly agree I 3 2 Elderly 0 Texas 2 62000 Strongly disagree Note the formulas I used in columns 3 C 4 3 Middleaged 1 Ohio 0 53200 Neutral and G to get this recoded data The formulas 5 4 Middleaged 1 Florida 2 52000 Strongly agree in columns A and F are based on the lookup 6 5 Young 0 California 3 81400 Strongly disagree 7 6 Young 039 New York 3 46300 Strongly agree l 8 7 Elderly 0 Minnesota 2 49600 Strongly disagree 9 8 Midde aged 1 New York 1 45900 Strongly agree I 10 9 Midde aged 1Texas 3 47700 Agree 10 Young 0 Texas 1 59900 Agree 12 11 Midde aged 1 New York 1 48100 Agree 13 12 Midde aged 0 Virginia 0 58100 Neutral 14 13 Midde aged 0 Illinois 2 56000 Strongly disagree Age lookup table range name AgeLookup 14 Midde aged 0Virginia 2 53400 Strongly disagree 0 Young 16 15 Midde aged 0 New York 2 39000 Disagree 35 Midde aged 17 16 Midde aged 1 Michigan 1 61500 Disagree 60 Elderly 18 17 Midde aged 1 Ohio 0 37700 Strongly disagree 19 18 Middleaged 0 Michigan 2 36700 Agree Opinion lookup table range name OpinionLookup 19 Midde aged 1Forida 2 45200 Neutral 1 Strongly disagree 21 20 Midde aged 1 Michigan 0 59000 Agree 2 Disagree 22 21 Young 0 Minnesota 2 54300 Disagree 3 Neutral 23 22 Midde aged 1 New York 1 62100 Agree 4 Agree 24 23 Young 1 New York 0 78000 Neutral 5 Strongly agree 25 24 Midde aged 1 Arizona 0 43200 Strongly agree 26 25 Midde aged 1 Arizona 1 44500 Neutral 27 26 Midde aged 1 Ohio 1 43300 Strongly disagree 28 27 Young 1 Illinois 3 45400 Disagree 29 28 Elderly 1 Michigan 2 53 900 Strongly disagree 30 29 Midde aged 1 California 1 44100 Neutral 31 30 Midde aged 0 New York 2 31000 Agree In addition we have categorized the Age variable as young 34 years or younger middle aged from 35 to 59 years and elderly 60 years or older This method of taking a numerical variable and making it categorical is called binning putting the data into discrete bins and it is also very common It is also called discretizing The pur pose of the study dictates whether age should be treated numerically or should be binned there is no absolute right or wrong way A binned or discretized variable corresponds to a numerical variable that has been categorized into discrete categories These categories are typically called bins 28 Chapter 2 Describing the Distribution of a SingeVariabe Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Excel Tip As Figure 24 indicates we used lookup tables along with the very important VLOOKUP function to transform the data setfrom Figure 23 to Figure 24 Take a look at these functions in the questionnaire le There is arguably no more important Excel function than VLOOKUP so you should de nitely learn how to use it Numerical variables can be classified as discrete or continuous The basic distinction is whether the data arise from counts or continuous measurements The variable Children is clearly a count that is discrete whereas the variable Salary is best treated as continu ous This distinction between discrete and continuous variables is sometimes important because it dictates the type of analysis that is most natural A numerical variable is discrete if it results from a count such as the number of children A continuous variable is the result of an essentially continuous measurement such as weight or height Finally data sets can be categorized as crosssectional or time series The opinion data set in Example 21 is crosssectional A pollster evidently sampled a cross section of people at one particular point in time In contrast time series data occur when we track one or more variables through time A typical example of a time series variable is the series of daily closing values of the Dow Jones Index Very different types of analyses are appropriate for crosssectional and time series data as becomes apparent in this and later chapters Crosssectional data are data on a cross section of a population at a distinct point in time Time series data are data collected over time A time series data set generally has the same layout variables in columns and observations in rows but now each variable is a time series Also one of the columns usually indicates the time period A typical example appears in Figure 25 See the le Toy ReVenuesxlsx Figure 25 A B C D E F Typical Time Series 1 Quarter Revenue Data Set 2 Q1 2007 1026 3 Q22OO7 1056 All monetary values are in 4 Q32007 1182 thousands of dollars 5 042007 2861 l 6 012008 1172 7 Q2 2008 1249 8 032008 1346 9 042008 3402 10 012009 1286 11 022009 1317 12 Q3 2009 1449 13 042009 3893 14 012010 1462 15 022010 1452 16 032010 1631 17 042010 4200 22 Basic Concepts 29 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it It has quarterly observations on revenues from toy sales over a four year period in column B with the time periods listed chronologically in column A Of course there could be other related time series variables to the right of column B 23 DESCRIPT IVE MEASURES FOR CAT EGORICAL VARIABLES In this section we indicate methods for describing a categorical variable Because it is not appropriate to perform arithmetic on the actual values of the variable there are only a few possibilities for describing the variable and these are all based on counting First you can count the number of categories Many categorical variables such as Gender have only two categories Others such as Region can have more than two categories As you count the categories you can also give the categories names such as Male and Female Keep in mind that categorical variables such as Opinion in Example 21 can be coded numerically In these cases it is still a good idea to supply text descriptions of these categories such as strongly agree and it is often useful to substitute these meaningful descriptions for the numerical codes as in Figure 24 This is especially useful for statistical reports The only meaningful Once you know the number of categories and their names the only thing left to do is W0 730 Summarize count the number of observations in each category1 The resulting counts can be reported Categorical is with as raw counts or they can be transformed into percentages For example if there are counts of observations in its CategO eS 1000 observations you can report that there are 560 males and 440 females or you can report that 56 of the observations are males and 44 are females In fact it is often useful to report the counts in both of these ways Finally once you have the counts you can display them graphically usually in a column chart or a pie chart The following exam ple illustrates how to do this in Excel EXAMPLE 22 SUPERMARKET SALES he file Supermarket Transactionsxlsx contains over 14000 transactions made by supermarket customers over a period of approximately two years The data are not real but real supermarket chains have huge data sets just like this one A small sample of the data appears in Figure 26 Column B contains the date of the purchase column C is a unique identifier for each customer columns D H contain information about the customer columns 1K contain the location of the store columns L N contain information about the Figure 26 Supermarket Data Set A B c D E F G H I J K L M N o P Purchase Customer Marital Home State or Product Product Product Units 1 Transaction Date ID Gender Status Owner Children Annual Income City Province Country Family Department Category Sold Revenue 2 1 12182007 7223 F S Y 2 30K 50K Los Angeles CA USA Food Snack Foods Snack Foods 5 2738 T 2 12202007 7341 M M quotv 5 70K 90K Los Angeles CA USA Food Produce Vegetables 5 1490 4 3 12212007 8374 F M N 2 50K 70K Bremerton WA USA Food Snack Foods Snack Foods 3 552 i 4 12212007 9619 M M Y 3 30K 50K Portland OR USA Food Snacks Candy 4 444 6 5 12222007 1900 F S Y 3 130K 150K Beverly Hills CA USA Drink Beverages Carbonated Bev 4 1400 7 6 12222007 6696 F M Y 3 10K 30K Beverly Hills CA USA Food Deli Side Dishes 3 437 8 7 12232007 9673 M S Y 2 30K 50K Salem OR USA Food Frozen Foods Breakfast Foods 4 1378 9 8 12252007 354 F M Y 2 150K Yakima WA USA Food Canned Foods Canned Soup 6 734 10 9 12252007 1293 M M Y 3 10K 30K Bellingham WA USA NonConsume Household Cleaning Supplie 1 241 11 10 12252007 7938 M S N 1 50K 70K San Diego CA USA NonConsume Health and Hyg Pain Relievers 2 896 1Researchers have devised some very sophisticated tools for dealing with categorical variables However we plan to keep it simple by focusing solely on counts of categories 30 Chapter 2 Describing the Distribution of a SingeVariabe Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it product purchased and the last two columns indicate the number of items purchased and the amount paid Which of the variables are categorical and how can these categorical variables be summarized Objective To summarize categorical variables in a large data set Solution Most of the variables in this data set are categorical Only Children Units Sold and Revenue are numerical Purchase Date is a date variable and Customer ID is used only to identify customers All of the others are categorical This includes Annual Income which has been binned into categories Three of these categorical Variables Gender Marital Status and Homeowner have only two categories The others have more than two categories The rst question is how you can discover all of the categories for a variable such as Product Department Without good tools this is not a trivial problem One option is to sort on this variable and then manually go through the list looking for the different categories Fortunately there are much easier ways using EXcel s built in table and pivot table tools We will postpone these for later and deal for now only with the easy categorical variables Figure 27 shows summaries of Gender Marital Status Homeowner and Annual Income along with several corresponding charts for Gender Each of the counts in column S can be obtained with EXcel s COUNTIF function For example the formula in cell S3 is COUNTIFD2D14060R3 This function takes two arguments the data range and a criterion so it is perfect for counting observations in a category Then to get the percentages in column T each count is divided by the total number of observations As a check it is a good idea to sum these percentages They should sum to 100 for each variable as they do here Figure 27 Summaries of Categorical Variables Rs T u v w X Y zAAAI3AcAD AE 1 Categoricalsummaries l l l l 2 Gender Count Percent 3 M 6889 490 Gender Count Gender Percent 4 F 7170 510 7200 j 520 5 1000 7000 1 510 6 500 7 Marital Status Count Percent 6800 39 490 8 s 7193 512 6600 480 9 M 6866 488 M F i M F 10 1000 11 l l l l l l l l 12 Homeowner Count Percent 13 Y 8444 601 Gender Count different scale T Gender Percent different scale 14 N 5615 399 8000 600 15 1000 6000 400 16 4000 i 17 Annuallncome Count Percent 2000 i 200 13 10K 30K 3090 220 0 i 00 19 30K 50K 4601 327 M F X M F 20 50K 70K 2370 169 21 70K 90K 1709 122 1 1 1 22 90K 110K 613 44 23 110K130K 643 46 Gender Count Gender Percent 24 130K 150K 760 54 T 25 150K 273 19 26 1000 M W M 27 or W or 28 29 30 23 Descriptive Measures for CategoricaVariabes 3 l Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it When you have a As the charts indicate you get essentially the same chart regardless of whether you choice between 0 graph the counts or the percentages However be careful with misleading scales If you mple Chart and a highlight the range R2S4 and then insert a column chart you get the top left chart by more fanc chart keep it Simgla Simple default Its vertical scale starts well above 6000 which makes it appear that there are many Charts tend to reveai more females than males By resetting the vertical scale to start at 0 as in the two middle the information in the charts you see more accurately that there are almost as many males as females Finally data quotOre C e0r Y which is preferable a column chart or a pie chart We tend to prefer column charts but this is entirely a matter of taste We also tend to prefer column charts to horizontal bar charts but this is again a matter of taste Our only recommendation in general is to keep charts simple so that the information they contain emerges as clearly as possible Excel Tip If you are new to Excel charts particularly in Excel 2007 you should try creating the charts in Figure 27 on your own One way is to put your cursor in a blank cell select a desired chart type from the Insert ribbon and then designate the data to be included in the chart However it is usually more e icient to select the data to be charted and then insert the chart For example try highlighting the range R239S4 and then inserting a column chart Except for a little cleanup deleting the legend changing the chart title and possibly changing the vertical scale you get almost exactly what you want with little work If this example of summarizing categorical variables appears to be overly tedious be patient As we indicated earlier Excel has some powerful tools especially pivot tables that make this summarization much easier We will discuss pivot tables in depth in the next chapter For now just remember that the only meaningful way to summarize a categorical variable is to count observations in each of its categories Before leaving this section we mention one other efficient way to find the counts and percentages for a categorical variable This method uses dummy 01 variables To see how it works focus on any category of some categorical variable such as M for Gender Recode the variable so that each M is replaced by a 1 and all other values are replaced by 0 This can be done in Excel in a new column using a simple IF formula See column E of Figure 28 Now you can find the count of males by summing the 0s and ls and you can H Figure 28 El E l C D E l 1 l39il39r3n333ti13n Pur3h33E 3te Eu3t 3n1i3rlDi Gender Gender lun1nW i tirl39iiquotl Summarizing a V V H V K g g 3 ii 1 13 13e393r333 3333 F 3 Category with a 13 0 V I V 3 1 3 131 33 3333 3331 33 1 Dummy Variable P V H V 0 3 3 13 3113333 33 33 F 3 3 131quot 3133333 3313 33 1 3 3 131 33113333 1333 F 3 3 H 3 131 33I3333 3333 F 3 E V 3 13iquot33f3333 3333 33 1 13 3 131 33 3333 333 F 3 13 ii 31 131quot 33 i 3333 1333 33 1 E11 13 131 33 333339 3333 33 1 13333 13333 13 33a393 333 333 3 F 3 13333 13333 13iquot33i39 3333 3133 F 3 13333 13333 133333313331 3333 F 3 13333 1333 13 3 1 3333 333 31 1 13333 13333 133 3 11 3333 3133 F 3 p13333 13333 131quot 3 1 3333 3333 31 1 13331 33331 3333 13333 T P3333111 33333 32 Chapter 2 Describing the Distribution of a SingeVariabe Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it nd the percentage of males by averaging the 0s and 1s That is the formulas in cells E14061 and E14062 use the SUM and AVERAGE functions on the data in column E You should convince yourself why this works for example what arithmetic are you really doing when you average 0s and 1s and you should remember this method It is one reason why dummy variables are used so frequently in spreadsheet data analysis I PROBLEMS Note Student solutions for problems whose numbers appear within a color box are available for purchase at wwwcengagebraincom Level A The le P0201xlsX indicates the gender and nationality of the MBA incoming class in two successive years at the Kelley School of Business at Indiana University a For each year create tables of counts of gender and of nationality Then create column charts of these counts Do they indicate any noticeable change in the composition of the two classes b Repeat part a for nationality but recode this variable so that all nationalities that have counts of 1 or 2 are classi ed as Other 2 The le P0202xlsx contains information on over 200 movies that came out during 2006 and 2007 a Create two column charts of counts one of the different genres and one of the different distributors b Recode the Genre column so that all genres with count 10 or less are lumped into a category called Other Then create a column chart of counts for this recoded variable Repeat similarly for the Distributor variable The le P0203xlsx contains data from a survey of 399 people regarding a govemment environmental policy a Which of the variables in this data set are categorical Which of these are nominal which are ordinal b For each categorical variable create a column chart of counts c Recode the data into a new data set making four transformations 1 change Gender to list Male or Female 2 change Children to list No children or At least one child 3 change Salary to be categorical with categories Less than 40K Between 40K and 70K Between 70K and 100K and Greater than 100K where you can treat the breakpoints however you like and 4 change Opinion to be a numerical code from 1 to 5 for Strongly Disagree to Strongly Agree Then create a column chart of counts for the new Salary variable The le P0204Xlsx contains salary data on all Maj or League Baseball players for each year from 2002 to 2009 The 2009 sheet is used for examples later in this chapter For each year create a table of counts of the various positions expressed as percentages of all players for the year Then create a column chart of these percentages for each year Do they remain fairly constant from year to year Level B 5 The file DJ IA Monthly Closexlsx contains monthly values of the Dow Jones Industrial Average from 1950 through 2009 It also contains the percentage changes from month to month This le will be used for an example later in this chapter Create a new column for recoding the percentage changes into six categories Large negative lt 3 Medium negative lt 1 2 3 Small negative lt0 2 1 Small positive lt1 20 Medium positive lt3 21 and Large positive 23 Then create a column chart of the counts of this categorical variable Comment on its shape 24 DESCRIPT IVE MEASURES FOR NUMERICAL VARIABLES There are many ways to summarize numerical variables both with numerical summary measures and with charts and we will discuss the most common ways in this section But before we get into details it is important to understand the basic goal of this section We begin with a numerical variable such as Salary where there is one observation for each person Our basic goal is to leam how these salaries are distributed across the different peo ple To do this we can ask a number of questions including the following 1 What are the 24 Descriptive Measures for NumericaVariabes 33 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it most typical salaries 2 How spread out are the salaries 3 What are the extreme salaries on either end 4 Is a chart of the salaries symmetric about some middle value or is it skewed in some direction 5 Does the chart of salaries have any other peculiar features besides possible skewness In the next chapter we will explore methods for checking whether a variable such as Salary is related to other variables but for now we simply want to explore the distribution of values in the Salary column As always in this book our main tool is Excel Excel has a number of builtin tools for summarizing numerical variables and we will discuss these However even better tools are available in Excel add ins and in this section we will introduce a very powerful add in from Palisade Corporation called StatTools There are two important advantages of StatTools over other statistical software First it works inside Excel which is an obvious advantage for the many users who prefer to work in Excel Second it is extremely easy to learn with virtually no learning curve However keep in mind that StatTools is not part of Microsoft Office You get the academic version of StatTools free with this book but if you eventually want to use StatTools in your job you will have to persuade your company to purchase it Many of our graduates have done exactly that 241 Numerical Summary Measures Throughout this section we will focus on a Salary variable Specifically we examine the 2009 salaries for Major League Baseball players as described in the following example CHANGES IN EXCEL 20l0 Microsoft modified many of the statistical functions and added a few new ones in Excel 200 Although Microsoft advertises the superiority of the new functions all of the old functions can still be usedWhen a modified or new function is relevant we will indicate this in the text 23 BASEBALL SALARIES he file Baseball Salaries 2009xlsx contains data on 818 Major League Baseball MLB players as of May 2009 There are four variables as shown in Figure 29 the player s name team the position and salary How can these 818 salaries be summarized Figure 29 A B C D Baseball Salaries 1 Player Team Position Salary 2 Aardsma Dave Seattle Mariners Pitcher 419000 3 Abreu Bobby Los Angeles Angels Outfielder 5000000 4 Adams Mike San Diego Padres Pitcher 414800 5 Adenhart Nick Los Angeles Angels Pitcher 400000 6 Affeldt Jeremy San Francisco Giants Pitcher 3500000 7 Albaladejo Jon New York Yankees Pitcher 403075 8 Albers Matt Baltimore Orioles Pitcher 410000 9 Amezaga Alfredo Florida Marlins Shortstop 1300000 10 Anderson Brett Oakland Athletics Pitcher 400000 11 Anderson Brian Nikoli Chicago White Sox Outfielder 440000 12 Anderson Garret Atlanta Braves Outfielder 2500000 13 Anderson Josh Detroit Tigers Outfielder 400000 14 Anderson Marlon New York Mets Second Baseman 1150000 34 Chapter 2 Describing the Distribution of a SingeVariabe Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Objective To learn how salaries are distributed across all 2009 MLB players Solution The various numerical summary measures can be categorized into several groups measures of central tendency minimum maximum percentiles and quartiles measures of variability and measures of shape We will explain each of these in this extended example I Measures of Central Tendency There are three common measures of central tendency all of which try to answer the basic question of which value is most typical These are the mean the median and the mode The mean is the average of all values of a variable If the data set represents a sample from some larger population we call this measure the sample mean and denote it by Y pronounced X bar If the data set represents the entire population we call it the population mean and denote it by p the Greek letter mu This distinction is not impor tant in this chapter but it will become relevant in later chapters when we discuss statistical inference In either case the formula for the mean is given by Equation 21 Formula for the Mean Mean 21 Here n is the number of observations and X I is the value of observation i Equation 21 says to add all the observations and divide by n the number of observations The 2 Greek capital sigma symbol means to sum from i 1 to i n that is to sum over all observations For Excel data sets you can calculate the mean with the AVERAGE function This is shown for the baseball data along with a lot of other summary measures we will discuss shortly in Figure 210 Specifically the average salary for all players is a whopping 3260059 Is this a typical salary Keep reading Figure 2 I 0 Summary Measures of Baseball Salaries Using Excel Functions A B c D E F 1 Measures of central tendency Measures of variability 2 Mean 3260059 Range 32600000 3 Median 1151000 Interquartile range 5088800 4 Mode 400000 70 Variance 19045050733784 5 Standard deviation 4364064 6 Min max percentiles quartiles Mean absolute deviation 3205753 7 Min 400000 8 Max 33000000 Measures of shape 9 P01 400000 001 Skewness 20996 10 P05 400000 005 Kurtosis 51266 11 P10 401000 010 12 P20 411200 020 Percentages of values less than given values 13 P50 1151000 050 Value Percentage less than 14 P80 5500000 080 1000000 4670 15 P90 10000000 090 1500000 5367 16 P95 13000000 095 2000000 5856 17 P99 18707500 099 2500000 6345 18 Q1 419550 1 3000000 6785 19 Q2 1151000 2 20 Q3 4237500 3 24 Descriptive Measures for NumericaVariabes 35 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it The median is the middle observation when the data are arranged from smallest to largest If the number of observations is odd the median is literally the middle observation For example if there are nine observations the median is the fth smallest or fth largest If the number of observations is even the median is usually de ned as the average of the two middle observations although there are some slight variations of this de ni tion For example if there are 10 observations the median is usually de ned to be the average of the fth and sixth smallest values For highly skewed data In any case the median can be calculated in Excel with the MEDIAN function the media is tYPlC0llY Figure 210 shows that the median salary is 1151000 In words half of the players el17rterquotEquoteT 1lre make less than this and half make more Why is the median in this example so much median is unaffected smaller than the mean and which is more appropriate These are important questions by the extreme Values and they are relevant for many real world data sets As you might expect the vast whereas the mean majority of baseball players have relatively modest salaries that are dwarfed by the is Ver Y 5e 339TlVe 130 astronomical salaries of a few stars Because it is an average the mean is strongly in u extreme Values enced by these really large values so it is quite high In contrast the median is completely unaffected by the magnitude of the really large salaries so it is much smaller For example the median would not change by a single cent if Alex Rodriguez made 33 trillion instead of his measly 33 million but the mean would increase to more than 34 million In many situations like this where the data are skewed to the right a few extremely large salaries not balanced by any extremely small salaries most people would argue that the median is a more representative measure of central tendency than the mean However both are often quoted And for variables that are not skewed in one direction or the other the mean and median are often quite close to one another The mode is the value that appears most often and it can be calculated in Excel with the MODE function In most cases where a variable is essentially continuous the mode is not very interesting because it is often the result of a few lucky ties However the mode for the salary data in Figure 210 is not a result of luck Its value 400000 is evidently the minimum possible salary set by the league As shown in cell C4 with a COUNTIF formula this value occurred 70 times In other words close to 10 of the players earn the minimum possible salary This is a good example of learning something you probably didn t know simply by exploring the data CHANGES IN EXCEL 20l0 There are two new versions of the MODE function in Excel 20O MODEMULT and MODESNGL The latter is the same as the current MODE functionThe MULT version returns multiple modes if there are multiple modes Minimum Maximum Percentiles and Quartiles As you look at the values of some variable it is natural to ask how many values are lower than a particular value For example you might ask how many salaries are less than 1 million In this subsection you will come back to this question but we rst answer a slightly different question Given a certain percentage such as 25 what is the salary value such that this percentage of salaries is below it This leads to percentiles and quartiles Speci cally for any percentage 19 the pth percentile is the value such that a percentage 19 of all values are less than it Similarly the rst second and third quartiles are the percentiles corresponding top 25 p 50 and p 75 These three values divide the data into four groups each with approximately a quarter of all observations 36 Chapter 2 Describing the Distribution of a SingeVariabe Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Note that the second quartile is equal to the median by de nition To complete this group of descriptive measures we add the minimum and maximum values with the obvious meanings You are probably aware of percentiles from standardized tests For example if you learn that your score in the verbal SAT test was at the 93rd percentile this means that you scored better than 93 of those taking the test The minimum and maximum can be calculated with Excel s MIN and MAX func tions For the percentiles and quartiles you can use Excel s PERCENTILE and QUAR TILE functions The PERCENTILE function takes two arguments the data range and a value of 19 between 0 and 1 It has to be between 0 and 1 For example if you want the 95th percentile you must enter the second argument as 095 not as 95 The QUARTILE function also takes two arguments the data range and 1 2 or 3 depending on which quar tile you want Figure 210 shows the minimum maximum the three quartiles and several commonly requested percentiles for the baseball data Note that at least 25 of the players make within 20000 of the league minimum and more than a quarter of all players make more than 4 million In fact more than 1 of the players make well over 18 million with Alex Rodriguez topping the list at 33 million And they say it s just a game Excel Tip Note the values in column C ofFigure 2I0for percentiles and quartiles These allow you to enter one formula for the percentiles and one for quartiles that can then be copied down Specifically the formulas in cells B9 and B18 are PERCENTILEData D2D819 C9 and QUARTILEData D2D819 C18 Here Data is a reference to the worksheet that contains the data Always look for ways to make your Excelformulas copyable It saves time and it limits errors And ifyou don t want the values in column C to be visible just color them white CHANGES IN EXCEL 20 I 0 Exce s PERCENTILE and QUARTILE functions can give strange results when there are only a few observations For this reason Microsoft added new functions in Excel 20O PERCENTLEEXC PERCENTLENC QUARTLEEXC and QUARTLENCwhere EXC and INC stand for exclusive and inclusiveThe INC functions work just like the old PERCENTILE and QUARTILE functions The EXC versions are recommended especially for a small number of observations Before continuing let s revisit the first question asked in this subsection If you are given a certain salary gure such as 1 million how can you nd the percentage of all salaries less than this This is essentially the opposite of a percentile question In a percentile question you are given a percentage and you want to nd a value Now you are given a value and you want to nd a percentage You can nd this percentage in Excel by dividing a COUNTIF by the total number of observations A few such values are shown in the bottom right of Figure 210 The typical formula in cell F14 which is then copied down is COUNTIFDataD2D819quotltquotampE14COUNTDataD2D819 The following Excel tip explains this formula in more detail 24 Descriptive Measures for NumericaVariabes 37 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Excel Tip The condition in this COUNTIFformula is a bit tricky You literally want it to be ltI000000 but you want the formula to refer to the values in column E to enable copying Therefore you can concatenate or string together the literal part lt and the variable part the reference to cell E14 The ampersand symbol 62 in the middle is the symbol used to concatenate in Excel This use of concatenation to join literal and variable parts is especially useful in functions like COUNTIF that require a condition so don t be afraid to use it Measures of Variability If you learn that the mean or median salary in some company is 100000 this tells you something about the typical salary but it tells you nothing about how spread out the salaries are that is their variability The percentiles and quartiles discussed in the previous section certainly tell you something about variability In fact by knowing a lot of percentiles you know almost exactly how the data are spread out Just look at the list of percentiles in Figure 210 and add a few more if you want to ll in the gaps In this sub section we list a few measures that summarize variability even more These include the range the interquartile range the variance and standard deviation and the mean absolute deviation None of these says as much about variability as a complete list of percentiles but they are very useful The range is a fairly crude measure of variability It is de ned as the maximum value minus the minimum value For the baseball salaries this range is 326 million It certainly tells us how spread out the salaries are but it is too sensitive to the extremes For example if Alex Rodriguez s salary increased to 43 million the range would increase by 10 million just because of one player A less sensitive measure is the interquartile range abbreviated IQR It is de ned as the third quartile minus the first quartile so it is really the range of the middle 50 of the data For the baseball data the IQR is 3817950 If you excluded the 25 of players with the lowest salaries and the 25 with the highest salaries this IQR would be the range of the remaining salaries The range or a modified range such as the IQR probably seems like a natural measure of variability but there is another measure that is quoted much more frequently the standard deviation Actually there are two totally related measures variance and standard deviation and we will begin with a definition of variance The variance is essentially the average of the squared deviations from the mean where if X I is a typical observation its squared deviation from the mean is X I X As in our discussion of the mean there is a sample variance denoted by s2 and a population variance denoted by 02 where 0 is the Greek letter sigma They are de ned by the following formulas Formula for Sample Variance I1 I mean2 s2 1 22 n 1 Formula for Population Variance I l 2Xl mean2 0392 1 23 n 38 Chapter 2 Describing the Distribution of a SingeVariabe Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Technical note It is traditional to use the capital letter Nfor the population size and n for the sample size but we won t worry about this distinction in this chapter Furthermore there is a technical reason why the sample variance uses n I in the denominator not n and this will be explained in a later chapter However the di erence is negligible when n is large Excel implements both of these formulas You can use the VAR function to obtain the sample variance denominator n I and you can use the VARP function to obtain the population variance denominator n To understand why the variance is indeed a measure of variability look at either for mula If all of the observations are close to the mean then their squared deviations from the mean will be relatively small and the variance will be relatively small On the other hand if at least a few of the observations are far from the mean then their squared devia tions from the mean will be large and this will cause the variance to be large Note that because deviations from the mean are squared an observation a certain amount below the mean contributes the same amount to variance as an observation that same amount above the mean There is a fundamental problem with variance as a measure of variability It is hard to interpret the variance numerically because it is in squared units For example if the observations are measured in dollars then variance is in squared dollars To obtain a more natural measure we take the square root of variance The result is called standard deviation Again there are two versions of standard deviation The sample standard deviation denoted by s is the square root of the quantity in Equation 22 The population standard deviation denoted by 0 is the square root of the quantity in Equation 23 To calculate either standard deviation in Excel you can first find the variance with the VAR or VARP function and then take its square root or you can find it directly with the STDEV sample or STDEVP population function CHANGES IN EXCEL 20 I 0 The functions for variance and standard deviation have been renamed in Excel 20O toVARS VARP STDEVS and STDEVP However they work exactly like the old versions The data in Figure 211 should help clarify these concepts It is in the file VariabilityXlsX It will help if you open this file and look at its formulas as you read this The variable Diameterl on the left has relatively low variability its 10 values hover closely around its mean of approximately 100 found in cell A16 with the AVERAGE function To show how variance is calculated we explicitly calculated the 10 squared deviations from the mean in column B Then either variance sample or population can be calculated in cells A19 and A22 as the sum of squared deviations divided by 9 or 10 Alternatively they can be calculated more directly in cells B19 and B22 with Excel s VAR and VARP functions Then either standard deviation sample or population can be calculated as the square root of the corresponding variance or with Excel s STDEV or STDEVP functions The calculations are exactly the same for Diameter2 on the right It also has mean approximately equal to 100 but its observations vary much more around 100 than the obser vations for Diameterl As expected this increased variability is obvious in a comparison of the variances and standard deviations This example also indicates why variability along with measures of it is important Imagine that you are about to buy 10 parts from one of two suppliers and you want each part s diameter to be close to 100 centimeters Furthermore suppose that Diameterl in 24 Descriptive Measures for NumericaVariabes 39 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Figure 2 I I Calculating Variance and Standard Deviation A B c D E F 1 Low variability supplier High variability supplier 2 3 Diameterl Sq dev from mean Diameter2 Sq dev from mean 4 10261 6610041 10321 9834496 5 10325 10310521 9366 41139396 6 9634 13682601 12087 432473616 7 9627 14205361 11026 103754596 8 10377 13920361 11731 297079696 9 9745 6702921 11023 103144336 10 9822 3308761 7054 872257156 11 10276 7403841 3953 3665575936 12 10156 2313441 13322 1098657316 13 9816 3530641 10191 3370896 14 15 Mean Mean 16 100039 100074 17 18 Sample variance Sample variance 19 91098 91098 7363653 7363653 20 21 Population variance Population variance 22 81988 81988 6627287 6627287 23 24 Sample standard deviation Sample standard deviation 25 30182 30182 271361 271361 26 27 Population standard deviation Population standard deviation 28 28634 28634 257435 257435 Variability is usually the enemy Being close to a target value on average is not good enough if there is a lot of variability around this target the example represents 10 randomly selected parts from supplier 1 whereas Diameter2 represents 10 randomly selected parts from Supplier 2 You can see that both suppliers are very close to the target of 100 on average but the increased variability for Supplier 2 makes this supplier much less attractive There is a famous saying in operations manage ment Variability is the enemy This example illustrates exactly what this saying means Empirical Rules for Interpreting Standard Deviation Now you know how to calculate the standard deviation but there is a more important ques tion How do you interpret its value Fortunately the standard deviation often has a very natural interpretation which is why it is quoted so frequently This interpretation can be 40 Chapter 2 Describing the Distribution of a SingeVariabe Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it stated as three empirical rules Namely if the values of this variable are approximately normally distributed symmetric and bell shaped then the following rules hold 1 Approximately 68 of the obserlations are within one standard deviation of the mean that is within the interval X i s These empirical rules 2 Approximately 95 of the observations are within two standard deviations of the give a concrete mean mean that is within the interval Y i 2s ing to standard devia 0 for Symmetric 3 Approximately 997 of the observat1ons almost all of them are within three bemshaped dSm standard deviations of the mean that is within the interval Y i 39 butions However they tend to be much less Fortunately many variables in real world data are indeed approximately normally dis accurate for skewed tributed so these empirical rules correctly apply We will study the normal distribution in distributions much more depth in Chapter 5 FUNDAMENTAL INSIGHT Usefulness of Standard Deviation measure It is measured in the same units as the vari able it has a long tradition and at least for many data sets it obeys the empirical rules discussed here These empirical rules give a very concrete meaning to a standard deviation Variability is clearly an important property of any numerical variable and there are several measures for quantifying the amount of variability However standard deviation is by far the most popular such As an example if the parts supplied by the suppliers in Figure 211 have diameters that are approximately normally distributed then the intervals in the empirical rules for supplier 1 are about 100 i 3 100 i 6 and 100 i 9 Therefore about 68 of this supplier s parts should have diameters from 97 to 103 95 should have diameters from 94 to 106 and almost none should have diameters below 91 or above 109 Obviously the situation for sup plier 2 is much worse With a standard deviation slightly larger than 25 the second empiri cal rule implies that about 1 out of every 20 of this supplier s parts will be below 50 or above 150 It is clear that supplier 2 has to do something to reduce its variability In fact this is exactly what almost all suppliers are continuously trying to do reduce variability Retuming to the baseball data Figure 210 indicates that the standard deviation of salaries is slightly above 436 million The variance is shown but because it is in squared dollars it is a huge value without a meaningful interpretation Can the empirical rules be applied to these baseball salaries The answer is that you can always try but if the salaries are not at least approximately normally distributed the rules won t be very accurate And because of obvious skewness in the salary data due to the stars with astronomical salaries the assumption of a normal distribution is not a good one Nevertheless the rules are checked in Figure 212 For each of the three rules the lower and upper endpoints of the corresponding interval are found in columns I and J Right away there are problems Because the standard deviation is larger than the mean all three lower Figure 2 I 2 Empirical Rules for Baseball Salaries H I J K L M N o 1 Do empirical rules apply 2 Lower endpoint Upper endpoint below lower above upper below lower above upper between 3 Rule 1 1104004 7624123 0 120 0 1467 8533 4 Rule 2 5468068 11988186 0 61 0 746 9254 5 Rule 3 9832131 1635224996 0 16 0 196 9804 24 Descriptive Measures for NumericaVariabes 4 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it endpoints are negative which automatically means that can be no salaries below them But continuing the COUNTIF was used again with concatenation to nd the number of salaries above the upper endpoints in column L and the corresponding percentages appear in column N Finally subtracting columns M and N from 100 gives the percentages between the end points in column 0 These three percentages according to the empirical rules should be about 68 95 and 997 Rules 2 and 3 are not way off but rule 1 isn t even close The point of these calculations is that even though the empirical rules give substantive meaning to the standard deviation for many variables they should be applied with caution especially when the data are clearly skewed Before leaving variance and standard deviation you might ask why the deviations from the mean are squared in the de nition of variance Why not simply take the absolute deviation from the mean For example if the mean is 100 and two observations have values 95 and 105 then each has a squared deviation of 25 but each has an absolute deviation of only 5 Wouldn t this latter value be a more natural measure of variability Intuitively it would but there is a long history in the eld of statistics of using squared deviations They have many nice theoret ical properties that are not shared by absolute deviations Still some analysts quote the mean absolute deviation abbreviated as MAD as another measure of variability particularly in time series analysis It is de ned as the average of the absolute deviations Formula for Mean Absolute Deviation n 2Xi Xl MAD 14 24 71 There is another empirical rule for MAD For many but not all variables the standard deviation is approximately 25 larger than MAD that is s 125MAD Fortunately Excel has a littleknown function AVEDEV that performs the calculation in Equation 24 Using it for the baseball salaries in Figure 210 you can see that MAD is slightly above 32 million If this is multiplied by 125 the result is slightly over 4 million which is indeed fairly close to the standard deviation Measures of Shape There are two nal measures of a distribution you will hear occasionally skewness and kur tosis Each of these has not only an intuitive meaning but also a speci c numeric measure We have already mentioned skewness in terms of the baseball salaries It occurs because of a lack of symmetry A few stars have really large salaries and no players have really small salaries Alternatively the largest salaries are much farther to the right of the mean than the smallest salaries are to the left of the mean This lack of symmetry will be apparent from a histogram of the salaries in the next section We say that these salaries are skewed to the right or positively skewed because the skewness is due to the really large salaries If the skewness were due to really small values as might occur if we were examining temperature lows in Antarctica then we would call it skewness to the left or negatively skewed In either case there is a measure of skewness that can be calculated with Excel s SKEW function For the baseball data it is approximately 21 as shown in Figure 210 You don t need to know exactly what this value means Simply remember that 1 it is pos itive when there is skewness to the right 2 it is negative when there is skewness to the left 3 it is approximately zero when there is no skewness the symmetric case and KW 750513 is 0quot 0501 4 its magnitude increases as the degree of skewness increases extreme events the kind that occurred in The other measure kurtosis has to do with the fatness of the tails of the distribution late 2008 and Sent relative to the tails of a normal distribution Remember from the third empirical rule that a way Street into a normal distribution has almost all of its observations within three standard deviations of the panic mean In contrast a distribution with high kurtosis has many more extreme observations 42 Chapter 2 Describing the Distribution of a SingeVariabe Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Is this important in reality It certainly is For example many researchers believe the Wall Street meltdown in late 2008 was at least partly due to nancial analysts relying on the normal distribution whereas in reality the actual distribution had much fatter tails More specifically nancial analysts followed complex mathematical models that indicated really extreme events would virtually never occur Unfortunately a number of extreme events did occur and they sent the economy into a deep recession2 Although kurtosis can be calculated in Excel with the KURT function it is about 51 for the baseball salaries we won t have any use for this measure in the book Nevertheless when you hear the word kurtosis think fat tails and extreme events And if you plan to work on Wall Street you should definitely learn more about kurtosis Numerical Summary Measures in the Status Bar You might have noticed that summary measures sometimes appear automatically in the status bar at the bottom of your Excel window The rule is that if you select multiple cells in a single column or even in multiple columns selected summary measures appear for the selected cells Nothing appears if only a single cell is selected These can be very handy for quick lookups Also you can control the summary measures that appear by right clicking on the status bar and selecting your favorites 242 Numerical Summary Measures with StatTools In the previous subsection we used Excel s built in functions AVERAGE STDEV and others to calculate a number of summary measures A much quicker way is to use Palisade s StatTools addin As we promised earlier StatTools requires almost no learning curve After you go through this section you will know everything you need to know to continue using StatTools like a professional EXAMPLE 23 BASEBALL SALARIES CONTINUED 1 se the StatTools addin to generate the same summary measures that were calculated in the previous subsection Objective To leam the fundamentals of StatTools and use this addin to generate summary measures of baseball salaries Solution Because this is your first exposure to StatTools we must first explain how to get started StatTools is part of the Palisade DecisionTools Suite and you have the free academic version of this suite as a result of purchasing the book The explanations and screenshots in the book are based on version 55 of the suite It is possible that by the time you are reading this you might have a later version In any case you must install the suite before you can use StatTools Once the suite is installed you can load StatTools by double clicking on the StatTools item in the list of programs on the Windows Start menu It is in the Palisade group If Excel is already running this will load StatTools on top of Excel If Excel isn t running 2The popular book The Black Swan by Nassim Nicholas Taleb Random House 2007 is all about extreme events and the trouble they can cause 24 Descriptive Measures for NumericaVariabes 43 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Figure 2 I 3 StatTools Ribbon Q 6 0O 39E1 K ti F 0 P 0n 8 t F tt icresett E2ce He rne In5 ert Peg e Legre ut Fe rn1u e5 Eiete Fe1ri eer Tali ew D39e1rele1e er Acre In at 0 StatTe e I 5 0 5u mmargr Statitstics N ermaity Tests P Eluelity Eentrelt 39Jtiiitie5 l quot T Summary Graphs ETin1e Series end Furecesting z Nenpzlerameteric Tests H Helip T Ijetra Set Date l V 1 ManagE LJti i tiE5 1I39 5 lEElIEl39IEE Invferennze 3939egre55un and Clesslfrcatren H 1 M Date a39itnra7r5e5 7 g Teels this will launch Excel and load StatTools as well You will know that StatTools is loaded when you see the StatTools tab and ribbon as shown in Figure 213 The buttons in the Analyses group on this ribbon are for performing the various statis tical analyses many of which will be explained in the book But before you can use these you need to know a few basic features of StatTools I Basic StatTools Features 1 There is an Application Settings item on the Utilities dropdown list When you click it you get the dialog box in Figure 214 All of the other addins in the Palisade suite have a similar Application Settings item This is where you can change overall settings of StatTools You can experiment with these settings but the only one you will proba bly ever need to change is the Reports Placement setting where your results are placed The dropdown list in Figure 214 shows the four possibilities We tend to prefer Figure 2 I 4 Stetirfle Hmi eetien Settings L Application Settings Generei5ettI nge Dialog Box pp F 39Heierne Screen True T 39 L1Iii lFlielEn39IEnZ I3Initre Werkbeek i V HPerirrie EHsting lleprte g Ivi39s39rJurl z k tuiedeting Frei erenee New 391e39erIrlek l iliillll f Enn1men39l5 I3IFer Leel L5eu Celmnn in I3Iire 5 A Hates and Warnings Query Fer Starting IIHI K lEeuielinnell illmmenlzef False ftltirres rHee iierieiele Preference Ineerr in Eeurte Ii39L ieI e Set T lI pn eIlini Fre erenie Li Er Iirr keu e Input e G Eel E2IeFeIule eeplr Eel Ferririetting Felee Werieiiie Learner Eelurnrne il teimes In First Ftw 1 EFquotrimer39r Remtzge True gj E5eIunder39r Henge True 2 arraulegreee lwaerning Message When Ignoring I139ee ini Date True 1 Ignoring NlIl i39lil39HUI11EI39iE 1EitsEI True Diie39Iug Memurrr Eiernenfeer Leeli Lleeii tfeluee er w Ceneetl 44 Chapter 2 Describing the Distribution of a SingeVariabe Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it the Active Workbook option this places the results on a new worksheet or the Query for Starting Cell option this lets you choose the cell where your results will start 2 If you want to unload StatTools without closing Excel you can choose the Unload StatTools Add In item from the Utilities dropdown list 3 Although you probably won t need it there is plenty of online help including exam ple spreadsheets on the Help dropdown list 4 This is the important one Before you can perform any statistical analysis you must de ne a StatTools data set You do this using the Data Set Manager button Try it now With the Baseball Salaries 2009xlsx le open make sure your cursor is anywhere within the data set and click on the Data Set Manager button You will first be asked whether you want to add the range A1D819 as a new StatTools data set Click on Yes Then you will see the dialog box in Figure 215 StatTools makes several guesses about your data set They are generally correct but you can always override them First it gives your data set a generic name such as Data Set 1 You can accept this or supply a more meaningful name The latter is especially useful if your le contains more than one data set Second you can override the data range Note that this range should include the variable names in row 1 Third the default layout is that variables are in columns with variable names in the top row You should override these settings only in rare cases where your data set has the roles of rows and columns reversed The Multiple button is for very unusual cases We will not discuss it here Finally if you want to apply some color to your data set you can check the Apply Cell Formatting option We generally don t Figure 2 l 5 51atTu1il5 Data 5etfMan gEr Baseball 5aElar139e Finiahetlarlmr Data Set Manager 1 CF Dialog BOX New EilZi t 1 Delete Data Set game Data Set 1 EEIF39angE ll F391Di3139S Multiple I l apply Cell Fgrmatting 39 39I39arialue5 Laygut F Qzulumns F Eelys FF Names in Eirst F39nuIrI Ene Data Flange i i39arialue IquotJame IEne Flange i39lame IDIItuIIt Fgrmat IrgII2IIEI1E Player STF39ayer autg 2 Ei2EiEI1395 Team STTean39 autg 3 C2CE139E F39g5itinn STF39g5itign autg 4 D2DEI1E Salary STSaary autg 4 3939ari39aue5 EIIE Data Cells Per 3939ariaue u Cancel 24 Descriptive Measures for NumericaVariabes 45 Copyright 2010 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it You must create a StatTools data set with the Data Set Manager before you can perform any analysis with StatTools But this generally takes only a few mouse clicks Figure 2 I 6 One Variable Summary Dialog Box For now simply click on OK You now have a StatTools data set and you can begin the analysis Fortunately this step has to be done only once If you save the le and reopen it at a later date StatTools still remembers this data set So when we said that StatTools has a short learning curve this is it simply remember to designate a StatTools data set before you begin any analysis Now that the preliminaries are over you can quickly get the summary measures for the Salary variable To do so select the One Variable Summary item from the Summary Statistics dropdown list You will see the dialog box in Figure 216 If you see two columns of variables in the top pane click on the Format button and select Stacked This is a typical StatTools dialog box In the top section you can select a StatTools data set and one or more variables In the bottom section you can select the measures you want For this example we have chosen all of the measures In addition you can add other percentiles if you like Before you click on OK click on the double check button to the left of the OK button This brings up the Application Settings dialog box already shown in Figure 214 This is your last chance to designate where you want to place the results We chose Active Workbook which means that the results are placed in a new worksheet automatically named One Var Summary SJ39et InF5 lJinaJtstisIls 5umrnatr s ts mss 3939ariaues Select Cine er Mere Qata Set Data Set1 X Enrmat I Name Ftddress I F39a3939er n2na1sI Team e2ea1s Pnsitinn c2ca1s U Salary n2na1s Summary Statistics tn F39enurt F Mean R Minimum F Either F39ernenties FF n39arianne F MaimIIm 100 QEIDDW F Standard Deviation ange 250 gl5D rd Skewness pg Cnunt EIDDDID gglDD F iiurtnsis F Sum IDIDDDID p E 2EIIquot39 F Median First Cluartile EDIDDCH F Mean I5Ius Desriatinn Third Cluartile QDIDDDID F Mnde F Interguartile 0d Mg 06 Cancel StatTools Tip In general you might want to choose only your favorite summary measures such as mean median standard deviation minimum and maximum This requires you to uncheck all of the others To avoid all of this unchecking in future analyses you can click on the Save button in the middle of the bottom left group This saves your choices as the defaults from then on 46 Chapter 2 Describing the Distribution of a SingeVariabe Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it The results appear in Figure 217 If you compare these to the measures from Excel functions in Figure 210 you will see some slight discrepancies in the percentiles and quartiles The kurtosis is also quite different When Palisade developed StatTools it did not fully trust EXcel s statistical functions so it developed its own based on best practices from the statistical literature In fact if you click on any of the results you will see functions such as StatMean StatStdDev StatPercentile and so on Don t be overly con cerned that the percentiles and quartiles don t exactly match Both sets provide the same basic picture of how the salaries are distributed Figure 2 I 7 A l B Summary Measures 7 salarv fo 1 Salaries 8 One Variable Summary Data Set 1 9 Mean S326005928 10 Variance 190450507337843O 11 Std Dev S436406356 12 Skewness 20996 13 Kurtosis 81266 14 Median S115000000 15 Mean AbsDev S32057S26O 16 Minimum 4000O000 17 Maximum S33000OO000 18 Range S326000O000 19 Count 818 20 Sum S266672849400 21 1st Quartile S41940000 22 3rd Quartile S425000000 23 lnterquartile Range S383060000 24 100 S4000O000 25 250 S4000O000 26 500 S4000O000 27 1000 540100000 28 2000 541100000 29 3000 S550000000 30 9000 S1000000000 31 9500 S13000OO000 32 9750 S15000OO000 33 9900 51875000000 Technical Note Why is there a discrepancy at all in the percentiles and quartiles Suppose for example that you want the 75 percentile 3rd quartile and there are 818 observations By de nition the 75 percentile is the value such that 75 of the values are below it and 25 are above it Now 75 of 818 is 61350 This suggests that you should sort the 818 observations in increasing order and locate the 613 and 614m smallest For the baseball data these salaries are 4200000 and 4250000 Excel reports the 75 percentile as 4237500 whereas StatTools reports it as 4250000 In words Excel interpolates and StatTools doesn t but either is reasonable As for kurtosis Excelprovides an index that is 0for a normal distribution whereas StatTools returns a value 3 for a normal distribution So the two indexes di er by 3 For what it s worth Wikipedia indi cates that either definition ofkurtosis is acceptable 24 Descriptive Measures for NumericaVariabes 47 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Ifyou open a le with errors in StatTools outputs close the file load StatTools and reopen the file The term distribution refers to the way the data are distributed in the various categories It is common to refer to a skewed distri bution say rather than a skewed histogram However either term can be used FUNDAMENTAL INSIGHT HistogramsVersus Summary Measures There are three other things to note about the StatTools output First it formats the results according to its own rules If you would like fewer or more decimals or any other formatting changes you can certainly reformat in the usual way Second the fact that there are formulas in these result cells indicates that they are live If you go back to the data and change any of the salaries the summary measures will update automatically This is true for most but not quite all StatTools outputs Regression analysis discussed in Chapters 10 and 11 is the most important situation where the StatTools results are not live Finally if you open a le with StatTools outputs but StatTools is not loaded you may see VALUE errors in the cells These can be xed by closing the le loading StatTools and opening the le again 243 Charts for Numerical Variables There are many graphical ways to indicate the distribution of a numerical variable but the two we prefer and will discuss in this subsection are histograms and box plots also called boxwhisker plots Each of these is useful primarily for cross sectional variables If they are used for time series variables the time dimension gets buried Therefore we will discuss time series graphs for time series variables separately in the next section Histograms A histogram is the most common type of chart for showing the distribution of a numerical variable It is based on binning the variable that is dividing It is important to remember that each of the sum mary measures we have discussed for a numerical variabe the mean the median the standard devia tion and others describes only one aspect of a numerical variable In contrast a histogram provides the complete picture It indicates the center of the distribution the variability the skewness and other aspects all in one convenient chart it up into discrete categories The histogram is then a column chart of the counts in the various cate gories with no gaps between the bars In general a histogram is great for showing the shape of a dis tribution We are particularly interested in whether the distribution is symmetric or is skewed in one direction The concept is a simple one as illus trated in the following example with the baseball salary data EXAMPLE 23 BASEBALL SALARIES CONTINUED A histogram can be created with Excel tools only but the process is quite tedious It is much easier to use StatTools e have already mentioned that the baseball salaries are skewed to the right How does this show up in a histogram of salaries Objective To see the shape of the salary distribution through a histogram Solution It is possible to create a histogram with Excel tools only no add ins but it is a tedious process First the bins must be defined If you do it yourself you will probably choose nice bins such as 400000 to 800000 800000 to 1200000 and so on But there is also the question of how many bins there should be and what their endpoints should be and these are not always easy choices In any case once the bins have been selected the number of observations in each bin must be counted This can be done in 48 Chapter 2 Describing the Distribution of a SingeVariabe Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Figure 2 I 8 StatTools Histogram Dialog Box In many situations you can accept the StatTools defaults for histogram binsThey generally show the big picture quite well which is the main goal Excel with the COUNTIF function You can also use the COUN TIF S and FREQUENCY functions but we won t discuss them here The resulting table of counts is usually called a frequency table Finally a column chart of the counts must be created If you are inter ested we have indicated the steps in the Histogram sheet of the nished version of the baseball le It is much easier to create a histogram with StatTools as we now illustrate As with all StatTools analyses the first step is to designate a StatTools data set which has already been done for the salary data To create a histogram select the Histogram item from the Summary Graphs dropdown list to obtain the dialog box in Figure 218 At this point all you really need to do is select the Salary variable and click on OK This gives you the default bins indicated by auto values Essentially StatTools checks your data and chooses good settings for the bins The resulting histogram along with the bin data it is based on appears in Figure 219 StatTools has used 11 bins with the endpoints indicated in columns B and C The histogram is then a column chart with no gaps between the bars of the counts in column E These counts are also called frequencies i5ti1ITiculsi 39 variables Select One or More gata S3 Inata Se1 A Enrmat X I Name IInnre55 I t Player n2n319 g Team E2Ea19 Position c2ca19 Salary t2na19 CJutiuun5 umlzuernzuf ins j 53939i39339 EFLcis I l39l393939393 l393 j istogram Minimum F 539quot539339 1 I5i5 II lquot39339339quot3quotquotquot JJ Histogram animIIm 53939I39 P P Cancel Pn l B You could argue that the bins chosen by StatTools aren t very nice For example the upper limit of the first bin is 336363636 If you want to finetune these you can enter your own bins instead of the auto values in Figure 218 We will illustrate this in the next example but it is largely beside the point for the main question about baseball salaries The StatTools default histogram shows very clearly that the salaries are skewed to the right and finetuning bins won t change this primary finding The vast majority of the players are in the lowest two categories and the salaries of the stars account for the long tail to the right This big picture finding is all you typically want from a histogram When is it useful to fine tune the StatTools histogram bins One good example is when the values of the variable are integers as illustrated next 24 Descriptive Measures for NumericaVariabes 49 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Figure 2 I 9 Histogram of Salaries A B c D E F G 7 Salary Data Set 1 8 Histogram Bin Min Bin Max Midpoint Freq Rel Freq Prb Density 9 Bin 1 40000000 5336363636 5188181818 574 07017 0000000237 10 Bin 2 5336363636 5632727273 5484545455 102 01247 0000000042 11 Bin 3 5632727273 5929090909 5780909091 49 00599 0000000020 12 Bin 4 929090909 1225454545 1077272727 43 00526 0000000018 13 Bin 5 1225454545 51521818182 51373636364 32 00391 0000000013 14 Bin 6 1521818182 51818181818 51670000000 8 00098 0000000003 15 Bin 7 1818181818 52114545455 51966363636 7 00086 0000000003 15 Bin 8 2114545455 2410909091 2262727273 2 00024 0000000001 17 Bin 9 2410909091 2707272727 2559090909 0 00000 0000000000 13 Bin 10 2707272727 53003636364 52855454545 0 00000 0000000000 19 Bin 11 3003636364 3300000000 3151818182 1 00012 0000000000 20 21 Histogram of Salary Data Set 1 22 23 700 24 600 25 26 500 27 5 28 400 29 E7 300 30 31 200 32 100 33 34 0 35 3 3 F R 8 3 Q 3 3 35 2 5 E E 37 E 3 E E 5 5 E E E 38 2739 31 Z S 3 3 3 N 3 3 S 39 39f 39f 39f 39f 39f 39f 39f I EXAM PLE 24 LOST on LATE BAGGAGE AT AIRPORTS he le Late or Lost BaggageXlsX contains information on 456 ights into an airport This is not real data For each ight it lists the number of bags that were either late or lost A sample is shown in Figure 220 What is the most natural histogram for this data set Objective To ne tune a histogram for a variable with integer counts Solution From a scan of the data sort from lowest to highest it is apparent that all ights had from 0 to 8 late or lost bags Therefore the most natural histogram is one that shows the count of each possible value If you try using the default settings in StatTools this is not what you will get However if you ll in the Histogram dialog box as shown in Figure 221 you 50 Chapter 2 Describing the Distribution of a SingeVariabe Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Figure 220 A B Data on Late or Lost 1 Flight Bags late or lost Baggage 2 1 O 3 2 3 4 3 5 5 4 O 6 5 2 7 6 2 8 7 1 9 8 5 10 9 1 11 10 3 12 11 3 13 12 4 14 13 5 15 14 4 16 15 3 Fi8U39 e 2139 51atTnl5 Histngrcam I1Ivgi1II11S1i1g n39ariaue5 Select Cine cur 39 nreI Bins Qalza Eel Itata Eel 1 3 Enrmat I I Name I IInnre55 I I1 Flight n2n45 Bans late cur last EI2EI45 Dptiuns umlzuer nuF Bins I 9 TI E45i5 I 39E393393393933939i39IE I islzugram 39391in in39IIn39 3915 quotquot 1Axis I F 339339quot3quotquotquot j Histogram maximum 35 39 I PD Cancel H For a quick analysis feel free to accept S T I 39 hfg grguiqs Oapl g auc will get exactly what you want The resulting histogram appears 1I1 Figure 222 Do you see However don t be the trick When you request 9 bins and set the min and max to 05 and 85 StatTools afraid to experiment divides the range from 05 to 85 into 9 equal length bins 05 to 05 05 to 15 and on With lhese 0P 0 5 3977 up to 75 to 85 Of course each bin contains only one possible value the integer in the TI7 quot II trO 391VgkI39 S39 middle So you get the count of 0s the count of 1s and so on As an extra bene t the hgistogmm as StatTools always labels the horizontal axis with the midpoints of the bins which are meanngfu and easy exactly the integers you want For an even nicer look we formatted these horizontal axis to read as possible values With no decimals 24 Descriptive Measures for NumericaVariabes 5 I Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Figure 222 Histogram of Counts A B C D E F G 7 Bags late or ost Data Set 1 8 Histogram Bin Min Bin Max Bin Midpoint Fred Rel Fred Prb Density 9 Bin 1 0500 0500 0000 16 00351 004 10 Bin 2 0500 1500 1000 67 01469 015 11 Bin 3 1500 2500 2000 113 02478 025 12 Bin 4 2500 3500 3000 101 02215 022 13 Bin 5 3500 4500 4000 77 01689 017 14 Bin 6 4500 5500 5000 44 00965 010 15 Bin 7 5500 6500 6000 23 00504 005 16 Bin 8 6500 7500 7000 13 00285 003 17 Bin 9 7500 8500 8000 2 00244 000 18 19 Histogram of Bags late or lost Data Set 1 22 23 100 24 25 80 26 27 E 3 33 E 60 30 40 31 32 33 20 34 35 0 36 0 1 2 3 4 5 6 7 8 37 The point of this example is that you do have control over the histogram bins if you are not satis ed with the StatTools defaults Just keep one technical detail in mind If a bin extends say from 27 to 34 then its count is the number of observations greater than 27 and less than or equal to 34 In other words observations equal to the right endpoint are counted but observations equal to the left endpoint are not They would be counted in the previous bin So in this example if we had designated the minimum and maximum as 1 and 8 in Figure 221 we would have gotten the same histogram I Box Plots A box plot also called a box whisker plot is an alternative type of chart for showing the distribution of a variable For the distribution of a single variable a box plot is not nearly as popular as a histogram but as you will see in the next chapter side by side box plots are very popular for comparing distributions such as salaries for men versus salaries for women As with histograms box plots are big picture charts They show you at a glance some of the key features of a distribution We explain how they do this in the following continuation of the baseball salary example 52 Chapter 2 Describing the Distribution of a SingeVariabe Copyright 2010 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it EXAM PLE 23 BASEBALL SALARIES CONTINUED Excel has no builtin box plot chart type In this case you must rely on StatTools Figure 223 StatTools Box Vhisker Plot Dialog Box Figure 224 Box Plot of Salaries histogram of the salaries clearly indicated the skewness to the right Does a box plot of salaries indicate the same behavior Objective To illustrate the features of a box plot particularly how it indicates skewness Solution This time you must rely on StatTools There is no easy way to create a box plot with Excel tools only Fortunately it is easy with StatTools Select the Box Whisker Plot item from the Summary Graphs dropdown list and ll in the resulting dialog box as in Figure 223 there are no other choices to make The box plot appears in Figure 224 StatTools also lists some mysterious values below the box plot You can ignore these but don t delete them They are the basis for the box plot itself 5tetTels Hii 39lI39lllhislter Pit 3939erieues Select Dne cur Wlcurejl gata Set lcata Set1 P a Eurmat rb I Name a I5Iclclress I F39e3939er a2ae1s E Team e2ee1s If Pcusiticun c2ce1s P Selen39 c2ceis Clptinns F IncIIcle Eel Describing Plcut Elements El l E Cancel w Box Plot of Salary Data Set 1 0 5000000 10000000 15000000 20000000 25000000 30000000 35000000 24 Descriptive Measures for NumericaVariabes 53 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it To help you understand the elements of a box plot StatTools provides the generic box plot shown in Figure 225 It is not drawn to scale You can get this by checking the Include Key Describing Plot Elements option in Figure 223 although you will probably want to do this only once or twice As this generic diagram indicates the box itself extends left to right from the 1st quartile to the 3rd quartile This means that it contains the middle half of the data The line inside the box is positioned at the median and the x inside the box is posi tioned at the mean The lines whiskers coming out either side of the box extend to 15 IQRs interquartile ranges from the quartiles These generally include most of the data outside the box More distant values called outliers are denoted separately with small squares They are hollow for mild outliers and solid for extreme outliers as indicated in the explanation Mean Figure 225 MEd a39 Elements of a 8 m l 3rd Quarnle Generic Box Plot El qlquot 39ll3391l e Mid Dutlier Ememe Du ieyr D ll 339 391 I L V q 2Ltr 39er whislltEr UliPET W39hIsllterl iIn Eenluiquotarrti39lE Rangre IQPJ quot i39il iJ539er5 E z39lannl ts 06 Fur iesti pubslErixatians Hfeat are new more tl ari 1 E 1 frem me edges of the bmc MiI d eu ieirks are observations beitareen 15 IQR and 3 0 from the erzlges of lIll IEi be Ezmeme autlrers are greater ian 3 lQF from me edges of the heist The box plot of salaries in Figure 224 should now make more sense It is typical of an extremely rightskewed distribution The mean is much larger than the median as we explained earlier there is virtually no whisker out of the left side of the box because the first quartile is barely above the minimum value remember all the players earning 400000 and there are many outliers to the right the stars In fact many of these out liers overlap one another You can decide whether you prefer the histogram of salaries to this box plot or vice versa but both are clearly telling the same story Box plots have been around for several decades and they are probably more popular now than ever The implementation of box plots in StatTools is just one version of what you might see Some packages draw box plots vertically not horizontally Also some vary the height of the box to indicate some other feature of the distribution The height of the box is irrelevant in StatTools s box plots Nevertheless they all follow the same basic rules and provide the same basic information I FUNDAMENTAL INSIGHT Box PlotsVersus Histograms arguably more intuitive box plots are still informa tive Besides sidebyside box plots are very useful for comparing two or more populations Box plots and histograms are complementary ways of displaying the distribution of a numerical variable Although histograms are much more popular and are 54 Chapter 2 Describing the Distribution of a SingeVariabe Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it PROBLEMS Level A 6 The le P0206xlsX lists the average time in minutes it takes citizens of 379 metropolitan areas to travel to work and back home each day 21 Create a histogram of the daily commute times b Find the most representative average daily commute time across this distribution c Find a useful measure of the variability of these average commute times around the mean d The empirical rule for standard deviations indicates that approximately 95 of these average travel times will fall between which two values For this particular data set is this empirical rule at least approximately correct The le P0207xlsX includes data on 204 employees at the ctional company Beta Technologies a Indicate the data type for each of the six variables included in this data set b Create a histogram of the Age variable How would you characterize the age distribution for these employees c What proportion of these full time Beta employees are female d Find appropriate summary measures for each of the numerical variables in this data set e For the Salary variable explain why the empiri cal rules for standard deviations do or do not apply The le P0208xlsX contains data on 500 shipments of one of the computer components that a company manufactures Specifically the proportion of items that are defective is listed for each shipment 21 Create a histogram that will help a production manager understand the variation of the proportion of defective components in the company s shipments b Is the mean or median the most appropriate measure of central location for this data set Explain your reasoning c Discuss whether the empirical rules for standard deviations apply Can you tell or at least make an educated guess by looking at the shape of the histogram Why The le P0209XlsX lists the times required to service 200 consecutive customers at a ctional fast foods restaurant 21 Create a histogram of the customer service times How would you characterize the distribution of service times b Calculate the mean median and rst and third quartiles of this distribution 12 c Which measure of central tendency the mean or the median is more appropriate in describing this distribution Explain your reasoning d Find and interpret the variance and standard devia tion of these service times e Are the empirical rules for standard deviations applicable for these service times If not explain why Can you tell whether they apply or at least make an educated guess by looking at the shape of the histogram Why The le P0210xlsX contains midterm and nal exam scores for 96 students in a corporate nance course 21 Create a histogram for each of the two sets of exam scores b What are the mean and median scores on each of these exams c Explain why the mean and median values are dif ferent for these data d Based on your previous answers how would you characterize this group s performance on the midterm and on the nal exam e Create a new column of differences nal exam score minus midterm score A positive value means the student improved and a negative value means the student did the opposite What are the mean and median of the differences What does a histogram of the differences indicate The le P0211XlsX contains data on 148 houses that were recently sold in a ctional suburban community The data set includes the selling price of each house along with its appraised value square footage number of bedrooms and number of bathrooms 21 Which of these variables are continuous Which are discrete b Create histograms for the appraised values and selling prices of the houses How are these two distributions similar How are they different c Find the maximum and minimum sizes measured in square footage of all sample houses d Find the houses at the 80th percentile of all sam ple houses with respect to appraised value Find the houses at the 80th percentile of all sample houses with respect to selling price e What are the typical number of bedrooms and the typical number of bathrooms in this set of houses How do you interpret the word typical The file P0212XlsX includes data on the 50 top graduate programs in the United States according to a recent U S News amp World Report survey a Indicate the type of data for each of the 10 vari ables considered in the formulation of the overall ranking 55 24 Descriptive Measures for NumericaVariabes Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it 14 15 56 Editorial review has deemed that any suppressed content does not materially affect the overall leaining experience Cengage Leaining reserves the right to remove additional content at any time if subsequent rights restrictions require it b Create a histogram for each of the numerical vari ables in this data set Indicate whether each of these distributions is approximately symmetric or skewed Which if any of these distributions are 16 The le P0216XlsX contains traffic data from 256 weekdays on four variables Each variable lists the number of vehicle arrivals to a tollbooth during a speci c ve minute period of the day skewed to the right Which if any are skewed to a Create a histogram of each variable How would the left you characterize and compare these distributions c Identify the schools with the largest and smallest b Find a table of summary measures for these vari annual out of state tuition and fee levels ables that includes at least the means medians d Find the annual out of state tuition and fee levels standard deviations first and third quartiles and at each of the 25th 50th and 75th percentiles for 5th and 95th percentiles Use these to compare the these schools For Excel 2010 users only nd these arrival process at the different times of day percentiles using both the PERCENTILEINC and PERCENTILEEXE functions Can you explain Level B how and why they are different if they are indeed different f Create a box plot to characterize this distribution of these MBA salaries Is this distribution essentially symmetric or skewed If there are any outliers on either end which schools do they correspond to Are these same schools outliers in box plots of any of the other numerical variables from columns E to L The file P0213XlsX contains the thickness in cen timeters of 252 mica pieces A piece meets speci cations if its thickness is between 7 and 15 centimeters a What fraction of mica pieces meets specifications b Are the empirical rules for standard deviations at least approximately valid for these data Can you tell or at least make an educated guess by looking at a histogram of the data c If the histogram of the data is approximately bell shaped and you want about 95 of the observations to meet speci cations is it suf cient for the average and standard deviation to be at least approximately 11 and 2 centimeters respectively Recall that the le Supermarket Transacti0nsXlsX contains over 14000 transactions made by supermarket customers over a period of approximately two years Using these data create a box plot to characterize the distribution of revenues earned from the given transactions Is this distribution essentially symmetric or skewed What if you restrict the box plot to transactions in the food product family Hint StatTools will not let you de ne a second data set that is a subset of an existing data set But you can copy data for the second question to a second worksheet Recall that the le Baseball Salaries 2009XlsX contains data on 818 MLB players as of May 2009 Using these data create a box plot to characterize the distribution of salaries of all pitchers Do the same for non pitchers Summarize your ndings See the hint in the previous problem Chapter 2 Describing the Distribution of a Single Variable 18 The le P0217XlsX contains salaries of 200 recent graduates from a fictional MBA program 3 What salary level is most indicative of those earned by students graduating from this MBA program this year Do the empirical rules for standard deviations apply to these data Can you tell or at least make an educated guess by looking at the shape of the histogram Why If the empirical rules apply here between which two numbers can you be about 68 sure that the salary of any one of these 200 students will fall If the MBA program wants to make a statement such as Some of our recent graduates started out making X dollars or more and almost all of them started out making at least Y dollars for their promotional materials what values of X and Y would you suggest they use Defend your choice As an admissions of cer of this MBA program how would you proceed to use these findings to market the program to prospective students The le P0218XlsX contains daily values of the Standard amp Poor s 500 Index from 1970 to 2009 It also contains percentage changes in the index from each day to the next 3 b Create a histogram of the percentage changes and describe its shape Check the percentage of these percentage changes that are more than k standard deviations from the mean for k 1 2 3 4 and 5 Are these approximately what the empirical rules indicate or are there fat tails Do you think this has any real implications for the financial markets Note that we have discussed the empirical rules only for k 1 2 and 3 For k 4 and 5 they indicate that only 0006 and 00001 of the observations should be this distant from the mean Copyright 2010 Cengage Leaining All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters 25 TIME SERIES DATA If we are analyzing time series variables summary measures such as means and standard deviations and charts such as histograms and box plots often don t make much sense Our main interest in time series variables is how they change over time and this information is lost in traditional summary measures and in histograms or box plots Imagine for example that you are interested in daily closing prices of a stock that has historically been between 20 and 60 If you create a histogram with a bin such as 45 to 50 you will get a count of all daily closing prices in this interval but you won t have a clue of when they occurred The histogram is missing a key feature time Similarly if you report the mean of a time series such as the monthly Dow Jones average over the past 40 years you will get a measure that isn t very relevant for the current and future values of the Dow Therefore we turn to a different but very intuitive type of chart called a time series graph This is simply a graph of the values of one or more time series using time on the horizontal axis and it is always the place to start a time series analysis We illustrate some possibilities in the following example 25 CRIME IN THE US he file Crime in USxlsx contains annual data on violent and property crimes for the years 1960 to 2007 Part of the data is listed in Figure 226 This shows the number of crimes The rates per 100000 population are not shown but they also appear in the file Are there any apparent trends in this data If so are the trends the same for the different types of crimes Figure 226 Crime Data A B c D E F G H I J K D quot39 3 cu 1 3 6 E Ti 8 E E is 3 395 3 h i 3 cu E N 3 0 E 39gt gt L 5 g 3 In 3 6 3 3 2 5 3 E 2 T 3 5 8 F 2 8 E E 2 5 1 i 8 395 3 2 8 e 2 1 2 8 at E 3 3 2 5 2 1960 179323175 288460 9110 17190 107840 154320 3095700 912100 1855400 328200 3 1961 182992000 289390 8740 17220 106670 156760 3198600 949600 1913000 336000 4 1962 185771000 301510 8530 17550 110860 164570 3450700 994300 2089600 366800 5 1963 188483000 316970 8640 17650 116470 174210 3792500 1086400 2297800 408300 6 1964 191141000 364220 9360 21420 130390 203050 4200400 1213200 2514400 472800 7 1965 193526000 387390 9960 23410 138690 215330 4352000 1282500 2572600 496900 7 196639195576000 430180 11040 25820 157990 235330 4793300 14101002822000 561200 9 1967 197457000 499930 12240 27620 202910 257160 5403500 1632100 3111600 659800 10 1968 199399000 595010 13800 31670 262840 286700 6125200 1858900 3482700 783600 Excel Tip Note the format of the variable names in row I If you have long variable names one possibility is to align them vertically and check the Wrap Text option These are both available through the Format Cells command which can be accessed by right clicking any cell With these changes the row I labels are neither too tall nor too wide Objective To see how time series graphs help to detect trends in crime data 25 Time Series Data 57 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Solution It is actually quite easy to create a time series graph with Excel tools only no add ins We illustrate the process in the Time Series worksheet of the nished version of the crime file But as usual StatTools is a bit quicker and easier We will illustrate a few of the many time series graphs you could create from this data set As usual start by designating a StatTools data set Then select the Time Series Graph item from the Time Series and Forecasting dropdown list Note that this item is not in the Summary Graphs group The resulting dialog box appears in Figure 227 At the top you can Figure 227 StatTools Time Series Graph Dialog BOX 3939erieues ISeen Cine er iriere 3939eue end Cine LeueI goto sot noto Se1 Eormot Name I IInuress 0l i fear 2 439E I Pooolooon Ei2Ei49 E 39i39ieen crime une C2C439E Murder and nenneninen 3I23I4395 Feriue repe E2Ei39E F39uuer39r F2F4393 0i 0i 0i O graph Formal H Time Series with LeueI E 7 E 39llI391Ti39TIl Clplziens iii Elel Fill 3939erieiues en a Single Graph El7 gse Twe EI3Ies Graphs el Twe 39i39erieues Unis Cancel choose between a graph with a label and one without a label The label is for time so if you have a time variable in this case Year choose the with label option This leads to two columns of variables one for the label Lbl and one for values Val Check Year in the Lbl column and select one or more variables in the Val column For this first graph we selected the Violent crime total and Property crime total variables to get started When you select multiple Val variables the first option at the bottom lets you plot all variables in a single chart or create a separate chart for each We chose the former Furthermore when you select exactly two Val variables you can use two different Yaxis scales for the two variables This is useful when they are of very different magnitudes as is the case for violent and property crimes so we checked this option This option isn t avail The whole purpose able if you select more than two Val variables In this case all are forced to share the same Ofl ime Series r0Ph5 Y axis scale The resulting time series graph appears in Figure 228 The graph shows that is to deteCt historical both types of crimes increased sharply until the early 1990s and have been gradually Patterns in the dam decreasing since then In this crime example we are coking for However the time series population in Figure 229 indicates that the US population has broad trends increased steadily since 1960 so it is possible that the trend in crime rates is different 58 Chapter 2 Describing the Distribution of a SingeVariabe Copyright 2010 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Figure 228 Total Violent and Property Crimes 39II39uI l nrl crlmatunl 4quot Elm I Slet I Ihne e ea ii 1l1quotUlI Evlquotl39I vI rl l39I1quot39TIiI2 4397Jata Eer i I PrmnpErtI39 lIquotlf391E lial I Ia39Ea Set 1 IF nape 1 391WrIIn r iiia quot39EJ1at a5 IE 39 3 Bi Q sru 11 an 9 D 7395J I uIv c vi In ads mi pg 4 g 1 u my an n It in l I nu us In In I 39 ha Fv rm T m 3 gm Im an an EEI an n m a an El El am 43 an at an an EH1 rm EH an m an um am am an my 41 in an 1 as E H H H F 9 F In FF 3939I Fr F H H F H PI PI FF 1 Fl Inf 239 rs F Figure 229 Population Totals 35DE IElL l4I E l iil T 2511unmm I T1ll EHlI39 5 ll TimeSeries Kufl unullatiun I Data 5et1 19E 1952 1 EiE ZIPEIE 1958 19TH 19 193 19E 193E l 1932 1934 1935 1533 1990 1531 195d 1995 199 E EDD 2034 EDUE than the trends in Figure 228 This is indeed true as seen in Figure 230 It shows the good news that the crime rate has been falling fairly rapidly since its peak in early 1990s3 3Why did this occur One compelling reason was suggested by Levitt and Dubner in their popular book Freakonomics Read their somewhat controversial analysis to see if you agree 59 25 Time Series Data Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Figure 230 Violent and Property Crime Rates 1 1rI11e5e r39iIa5 Him 1 am Emn I Wale nt 1I39E3911Erilt1quotlJai39lS39aE L 1171 I IP39mpE 1111izrine rate Ema Set E311 39I1quotInln mi rim is ratnf lmsi iat iii Pirin p E 1 39E1r 1IFl3939ETE l E1 III at 5 E1 H 1 EDElEl 139CD39 7 FD 0 El H 1 LE um an El u in E 1 H E In 11 as H 112 LE1 m 1 H aquot in 11 11 6411 11 1111 r1 ra 11 re 1111 in ma 11 5 in 11 as 11 in 15 11 11 11 11 an in 1111 11 an an 1111 111 939 11 3911 run In 11 an an m an 121 11 1 a 11 11 11 11 11 a 11 1I 11 1 11 11 1 11 5 11 11 11 g 11 1 H n H 11 Think 0b0UI interesting StatTools Tip StatTools remembers your previous choices for any particular type of questions y might analysis such as time series graphs Therefore if you run another analysis of the same ask about crime in the U 5 These Wm lead type make sure to uncheck variables you don t want in the current analysis naturally to particular time Series graphs Because it is so easy we also created two more time series graphs that appear in that heip answer these Figures 231 and 232 The rst shows the crime rates for the various types of violent questions crimes whereas the second does the same for property crimes The patterns up then down are similar for each type of crime but they are certainly not identical For example the larceny theft and motor vehicle theft rates both peaked in the early 1990s but the burglary Figure 23 I Rates of Violent Crime Types Ti m E Se r iE5 N 4511 4I111 513 39 EDIE rut llI1i39111rI39lE1r a 111 nI1111 l IEEliEEn f mans E39I11gquotIf1Er39 B Daita Eat 1 l 1li39nr1391 E1Ia ra111 ratEuquot Jlata1 I39 H1 izibae 1gr qf J Ea1aSe1 it ii1ggra 1ra1ti1139I agsaault 11311 1 L1i1iL39a139s 111a1i 6 El E ii l 1 r11 1111 Eu 1 Iquot1 3939 12 11 171 1quot 11 In 111 11 4 H in H II 11 n1I InI 1 ru 1 1 11 rrl 11 1u r1 1 11 1 rrl 1 1 r1 1 3 nu 1u in 60 Chapter 2 Describing the Distribution of a SingeVariabe Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Figure 232 Rates of Property Crime Types Tim E 21 rile5 35Ellil 3 250 2 BLi39 l a F I rate 5 ZIvanaSEt ful l I LaEIEmi39th Irate39 Data Set 1 15001 E Mutjnr 39lrE l1i lEa lhEfE IratE39 Data Set 1 I EEIIEII l 1552 1554 19EEi 1563 1ED 13 E 1 E E 1950 195 193 1935 1533 1990 1552 1994 1995 l l ZUUD EDGE Z d 2005 rate was well in decline by this time Finally Figure 231 indicates one problem with having multiple time series variables on a single chart any variable with small values can become swamped by variables with much larger values It might be a good idea to create two sepa rate charts for these four variables with murder and rape on one and robbery and aggravated assault on the other Then you could see the murder and rape patterns more clearly I CHANGES IN ExcEL 20 I 0 One new feature in Excel 20 I 0 is the sparkline This is a minichart embedded in a cell Although it applies to any kind of data it is especially useful for time series dataTry the following Open a le such as the problem le P0330xsx that has multiple time series one per column Highlight the cell below the last time series value of the rst time series and click on the Line item in the Sparklines group on the Insert ribbon In the resulting dialog box highlight the data in the rst time seriesYou will get a mini time series graph in the cell Now copy this cell across for the other time series and increase the row height to expand the graphs Change any of the time series values to see how the sparklines change automatically We suspect that these instantgraph sparklines will become very popular As we mentioned earlier traditional summary measures such as means medians and standard deviations are often not very meaningful for time series data at least not for the original data However it is often useful to find differences or percentage changes in the data from period to period and then report traditional measures of these The following example illustrates these ideas 26 THE DJIA INDEX he Dow Jones Industrial Average DJIA or simply the Dow is an index of 30 large publicly traded US stocks and is one of the most quoted stock indexes The le DJIA Monthly Closexlsx contains monthly values of the Dow from 1950 through 2009 What is a useful way to summarize the data in this le Objective To find useful ways to summarize the monthly Dow data 25 Time Series Data 6 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Figure 233 Summary Measures and Graph of the Dow Closing Value Time Series of Closing Value Data Set 1 One Variable Summary Data 59quot 1 1 2 Mean 322212 6000 StdDev 384015 14000 2 Median 95239 1st Quartile 75523 12000 3rd Quartile 391342 10000 1 8000 6000 4000 2000 0 ONltl39kDO0ONltl39kOO0ONltl39kOOOONltl39kOO0ONltl39kOOOONltl39kD mmmmmLD OOO0O0OOO0mmmmmOOOO 439239 39 46n5l cocu U9rts320 ooooooooooooooooooo 392lt ltEU39gtOZZZZZZZZZZZZZZZZZZZ Solution A time series graph and a few summary measures of the Dow appear in Figure 233 The graph clearly shows a gradual increase through the early 1990s except for Black Monday in 1987 then a sharp increase through the rest of the 1990s and nally huge swings in the past decade The mean 3222 the median 952 and any of the other traditional summary measures are of historical interest at best In situations like this it is useful to look at percentage changes in the Dow These have been calculated in the le and have been used to create the summary measures and time series graph in Figure 234 The graph shows that these percentage changes have uctuated around zero sometimes with wild swings like Black Monday Actually the mean and median of the percentage changes are slightly positive about 064 and 085 respectively Figure 234 Summary Measures and Graph of Percentage Changes of the Dow Percentage Change Time Series of Percentage Change Data Set1 One Variable Summary Data 59quot 1 O 2 1 Mean 000638 39 StdDev 004175 015 Median 000851 1st Quartile 001648 3rd Quartile 003289 02 025 03 ONltl39LOO0Olltl39LOO0Olltl39kOO0Olltl39LOO0Olltl39LOO0Olltl39LOO0 wwwwwaaaaawwwww 99999 CCCCCCCCCCCCCCCCCCCCCCCCCCCCCC CUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCU TTTTTTTTTTTTTTTTTTTTTTTTTTTTTT 62 Chapter 2 Describing the Distribution of a SingeVariabe Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it PROBLEMS In addition the quartiles show that 25 of the changes have been less than 165 and 25 have been greater than 329 Finally the empirical rules indicate for example that about 95 of the percentage changes over this period have been no more than two standard devia tions 835 from the mean You can check that the actual percentage within two standard deviations of the mean is 9541 so this empirical rule applies very well4 I Level A 20 22 The file P0219XlsX lists annual percentage changes in the Consumer Price Index CPI from 1914 through 2008 Find and interpret the first and third quartiles and the interquartile range for these annual percentage changes Discuss whether these are even meaningful summary measures for this time series data set Suppose that the data set listed the actual CPI values not percentage changes for each year Would the quartiles and interquartile range be meaningful in this case Why or why not The Consumer Confidence Index CCI attempts to measure people s feelings about general business conditions employment opportunities and their own income prospects Monthly average values of the CCI are listed in the le P0220XlsX a Create a time series graph of the CCI values b Have US consumers become more or less confident through time c How would you explain recent variations in the overall trend of the CCI The le P0221Xlsx contains monthly interest rates on 30year fixed rate mortgages in the United States from 1977 to 2009 The le also contains rates on 15year fixed rate mortgages from late 1991 to 2009 What conclusions can you draw from a time series graph of these mortgage rates Specifically what has been happening to mortgage rates in general and how does the behavior of the 30year rates compare to the behavior of the 15year rates The le P0222XlsX contains annual trade balances exports minus imports from 1980 to 2008 a Create a times series graph for each of the three time series in this le b Characterize recent trends in the US balance of trade gures using your time series graphs What has happened to the total number and average size of farms in the United States since the middle of the 20th century Answer this question by creating a time series graph of the data from the US Department of Agriculture in the file P0223XlsX Is the observed result consistent with your knowledge of the structural changes within the US farming economy Is educational attainment in the United States on the rise Explore this question by creating time series graphs for each of the variables in the le P0224XlsX Comment on any observed trends in the annual educational attainment of the general US population over the given period The monthly averages of the federal funds rate and the bank prime loan rate are listed in the le P0225xlsX a Describe the time series behavior of these two variables Can you discern any cyclical or other patterns in the times series graphs of these key interest rates b Discuss whether it would make much sense especially to a person at the present time to quote traditional summary measures such as means or percentiles of these series Level B 26 In which months of the calendar year do US gasoline service stations typically have their lowest retail sales levels In which months of the calendar year do the service stations typically have their highest retail sales levels Create time series graphs for the monthly data in the le P0226XlsX to respond to these two questions There are really two series one of actual values and one of seasonally adjusted values The latter adjusts for any possible seasonality such as higher values in June and lower values in January so that any trends are more apparent The file P0227XlsX contains monthly data for total US retail sales of building materials There are really two series one of actual values and one of seasonally adjusted values The latter adjusts for any possible seasonality such as higher values in June and lower values in January so that any trends are more apparent 4One of the problems asks you to check whether all three of the empirical rules apply to similar stock price data The extreme tails are where there are some surprises 25 Time Series Data 63 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it a Is there an observable trend in these data That is 21 Create a graph of both time series and comment do the values of the series tend to increase or on any observable trends including a possible decrease over time seasonal pattern in the data Does seasonal b Is there a seasonal pattern in these data If so what adjustment make a difference How is the seasonal pattern b Based on your time series graph of actual sales make a qualitative projection about the total retail sales levels for the next 12 months Speci cally in which months of the subsequent year do you expect retail sales levels to be highest In which months of the subsequent year do you expect retail sales levels to be lowest 28 The le P0228xlsX contains total monthly US retail sales data for a number of years There are really two series one of actual sales and one of seasonally adjusted sales The latter adjusts for any possible seasonality such as higher sales in December and lower sales in February so that any trends are more apparent 26 OUT LIERS AND MISSING VALUES Most textbooks on data analysis including this one tend to use example data sets that are cleaned up Unfortunately the data sets you are likely to encounter in your jobs are often not so clean Two particular problems you will encounter are outliers and missing data the topics of this section There are no easy answers for dealing with these problems but you should at least be aware of the issues 261 Outliers An outlier is literally a value or an entire observation that lies well outside of the norm For the baseball data Alex Rodriguez s salary of 33 million is de nitely an outlier This is indeed his correct salary the number wasn t entered incorrectly but it is way beyond what most players make Actually statisticians disagree on an exact de nition of an outlier Going by the third empirical rule you might de ne an outlier as any value more than three standard deviations from the mean but this is only a rule of thumb Let s just agree to de ne outliers as extreme values and then for any particular data set you can decide how extreme a value needs to be to be labeled an outlier Sometimes an outlier is easy to detect and deal with For example this is often the case with data entry errors Suppose a data set includes a Height variable a person s height measured in inches and you see a value of 720 This is certainly an outlier and it is certainly an error Once you spot it you can go back and check this observation to see what the person s height should be Maybe an extra 0 was accidentally appended and the true value is 72 In any case this type of outlier can usually be xed easily Sometimes a careful check of the variable values one variable at a time will not reveal any outliers but there still might be unusual combinations of values For example it would be strange to nd a person with Age equal to 10 and Height equal to 72 Neither of these values is unusual by itself but the combination is certainly unusual Again this would prob ably be a result of a data entry error but it would be harder to spot The scatterplots dis cussed in the next chapter are useful for spotting unusual combinations It isn t always easy to detect outliers but an even more important issue is what to do about them Of course if they are due to data entry errors they can be xed but what if they are legitimate values like Alex Rodriguez s salary One or a few wild outliers like this one can dominate a statistical analysis For example they can make a mean or standard deviation much different than if the outliers were not present For this reason some people argue rather naively that outliers should be eliminated before running statistical analyses However it is not appropriate to eliminate outliers simply because the resulting analysis comes out nicer without them There has to be a legitimate reason for eliminating outliers and such a reason sometimes exists For example 64 Chapter 2 Describing the Distribution of a SingeVariabe Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it One good way of dealing with outliers is to report results with the outliers and without them suppose you want to analyze salaries of typical managers at your company Then it is probably appropriate to eliminate the CEO and possibly other highranking executives from the analysis arguing that they aren t really part of the population of interest and would just throw off the results Or if you are interested in the selling prices of typical homes in some community it is probably appropriate to eliminate the few homes that sell for over 2 mil lion again arguing that these are not the types of homes you are interested in Probably the best advice we can give for dealing with outliers is to run the analyses two ways with the outliers and without them This way you can report the results both ways and you are being honest 262 Missing Values There is no missing data in the baseball salary data set All 818 observations have a value for each of the four variables For real data sets however this is probably the exception rather than the rule Most real data sets unfortunately have gaps in the data This could be because a person didn t want to provide all the requested personal information what business is it of yours how old I am or whether I drink alcohol it could be because data doesn t exist stock prices in the 1990s for companies that went public after 2000 or it could be because some values are simply unknown Whatever the reason you will undoubtedly encounter data sets with varying degrees of missing values As with outliers there are two issues how to detect missing values and what to do about them The first issue isn t as trivial as you might imagine For an Excel data set you might expect missing data to be obvious from blank cells This is certainly one possibility but there are others Perhaps surprisingly missing data are coded in a variety of strange ways One common method is to code missing values with an unusual number such as 9999 or 9999 Another method is to code missing values with a symbol such as or If you know the code and it is often supplied in a footnote then it is usually a good idea at least in Excel to perform a global search and replace replacing all of the missing value codes with blanks The more important issue is what to do about missing values One option is to simply ignore them Then you will have to be aware of how the software deals with missing values For example if you use Excel s AVERAGE function on a column of data with some missing values it will react the way you would hope and expect it adds all the existing values and divides by the number of existing values StatTools reacts in the same way for all of the measures discussed in this chapter after alerting you that there are indeed missing values We will say more about how StatTools deals with missing data for other analyses in later chapters If you are using other statistical software such as SPSS or SAS you should read its online help to learn how its various statistical analyses deal with missing data Because this is such an important topic in real world data analysis researchers have studied many ways of filling in the gaps so that the missing data problem goes away or is at least disguised One possibility is to fill in all of the missing values in a column with the average of the existing values in that column Indeed this is an option in some software packages but we don t believe it is usually a very good option Is there any reason to believe that missing values would be average values if they were known Probably not Another possibility is to examine the existing values in the row of any missing value It is possible that they provide some information on what the missing value should be For example if a person is male is 55 years old has an MBA degree from Harvard and has been a manager at an oil company for 25 years this should probably help to predict his missing salary It probably isn t below 100000 We will not discuss this issue any further here because it is quite complex and there are no easy answers But be aware that you will undoubtedly have to deal with missing data at some point in your jobs either by ignoring the missing values or by filling in the gaps in some way 26 Outliers and MissingVaues 65 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it PROBLEMS Level A The file P0229xlsX contains monthly percentages of on time arrivals at several of the largest US airports and all of the major airlines from 1988 to 2008 The By Airline sheet contains a lot of missing data presumably because some the airlines were not in existence in 1988 and some went out of business before 2008 The By Airport sheet contains missing data only for Atlantic City International Airport and we re not sure why 21 Use StatTools to calculate summary measures means medians standard deviations and any other measures you would like to report for each airline and each airport How does it deal with missing data b Use StatTools to create histograms for a few of the airports and a few of the airlines including Atlantic City International How does it deal with missing data c Use StatTools to create time series graphs for a few of the airports and a few of the airlines including Atlantic City International How does it deal with missing data d Which airports and which airlines have done a good job Which would you like to avoid the names of all CEOs who are mild outliers and all those who are extreme outliers Level B 31 32 There is no consistent way of de ning an outlier that everyone agrees upon For example some people refer to an outlier that is any observation more than three standard deviations from the mean Other people use the box plot de nition where an outlier moderate or extreme is any observation more than 15 IQR from the edges of the box and some people care only about the extreme box plot type outliers those that are 30 IQR from the edges of the box The le P0218xlsx contains daily percentage changes in the SampP 500 index over a four year period Identify outliers days when the percentage change was unusually large in either a negative or positive direction according to each of these three de nitions Which de nition produces the most outliers Sometimes it is possible that missing data are predictive in the sense that rows with missing data are somehow different from rows without missing data Check this with the le P0232xlsX which contains blood pressures for 1000 ctional people along with variables that can 30 The Wall Street Journal CEO Compensation Study be related to blood pressure These other variables have a analyzed CEO pay for many US companies with scal number of missing values presumably because the year 2008 revenue of at least 5 billion that led their people didn t want to report certain information proxy statements between October 2008 and March a For each of these other variables find the mean and 2009 The data are in the le P0230XlsX Note This standard deviation of blood pressure for all people data set is a somewhat different CEO compensation data without missing values and for all people with set from the one used as an example in the next chapter missing values Can you conclude that the presence 21 Create a new variable that is the sum of salary and or absence of data for any of these other variables bonus and create a box plot of this new variable has anything to do with blood pressure b As the box plot key indicates mild outliers are b Some analysts suggest lling in missing data for a observations between 15 IQR interquartile range variable with the mean of the nomnissing values for and 30 IQR from the edge of the box whereas that variable Do this for the missing data in the blood extreme outliers are greater than 3 IQR from the pressure data In general do you think this is a valid edge of the box Use these definitions to identify way of lling in missing data Why or why not 2 27 EXCEL TABLES FOR FILTERING SORTING AND SUMMARIZING5 In this section we introduce a great tool that was introduced in Excel 2007 tables Tables were somewhat available in previous versions of Excel but they were never called tables before and some of the really useful features of Excel 2007 tables are new It is useful to begin with some terminology and history Earlier in this chapter we dis cussed data arranged in a rectangular range of rows and columns where each row is an observation and each column is a variable with variable names at the top of each column 66 Chapter 2 Describing the Distribution of a SingeVariabe 5This section indicates how powerful the Excel 2007 table filtering tools are However if you are interested in more advanced lters or database D functions see Chapter 2 s Advanced Filter and Database Functions on this textbook s Web site Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Informally we refer to such a range as a data set In fact this is the technical term used by StatTools In previous versions of Excel data sets were called lists and Excel provided several tools for dealing with lists In Excel 2007 recognizing the importance of data sets Microsoft made them much more prominent and provided even better tools for analyzing them Speci cally you now have the ability to designate a rectangular data set as a table and then employ a number of new and powerful tools for analyzing tables These tools include ltering sorting and summarizing We illustrate Excel tables in the following example Before proceeding however we mention one important caveat Some of the tools discussed in this section will not work on an Excel le in the old xls format Therefore we purposely illustrate them on les saved in the new xlsx format new to Excel 2007 27 HvTEx s CUSTOMER DATA he le Catalog Marketingxlsx contains data on 1000 customers of HyTex a c tional direct marketing company for the current year A sample of the data appears in Figure 235 The variables are de ned as follows Figure 23 5 HyTex Customer Data A B C D E F G H I J K L M N O 1 Person Age Gender 0wnHome Married Close Salary Children History Catalogs Region State City FirstPurchase Amountspent 2 1 1 0 0 0 1 16400 1 1 12 South Florida Orlando 10232003 218 3 2 2 0 1 1 0 108100 3 3 18 Midwest Illinois Chicago 5252001 2632 4 3 2 1 1 1 1 97300 1 NA 12 South Florida Orlando 8182007 3048 5 4 3 1 1 1 1 26800 0 1 12 East Ohio Cleveland 12262004 435 6 5 1 1 0 0 1 11200 0 NA 6 Midwest Illinois Chicago 842007 106 7 6 2 0 0 0 1 42800 0 2 12 West Arizona Phoenix 342005 759 8 7 2 0 0 0 1 34700 0 NA 18 Midwest Kansas Kansas City 6112007 1615 9 8 3 0 1 1 0 80000 0 3 6 West California San Francisco 8172001 1985 10 9 2 1 1 0 1 60300 0 NA 24 Midwest Illinois Chicago 5292007 2091 11 10 3 1 1 1 0 62300 0 3 24 South Florida Orlando 692003 2644 I Age coded as 1 for 30 or younger 2 for 31 to 55 3 for 56 or older I Gender coded as 1 for males 0 for females I OwnHome coded as 1 if the customer owns a home 0 otherwise I Married coded as 1 if the customer is currently married 0 otherwise I Close coded as 1 if the customer lives reasonably close to a shopping area that sells similar merchandise 0 otherwise Salary combined annual salary of the customer and spouse if any Children number of children living with the customer History coded as NA if the customer had no dealings with HyTex before this year 1 if the customer was a lowspending customer last year 2 if medium spending 3 if highspending Catalogs number of catalogs sent to the customer this year FirstPurchase date of the customer s first purchase with HyTex AmountSpent total amount of purchases made by the customer this year In addition the variables Region State and City indicate where the customer resides HyTex wants to nd some useful and quick information about its customers by using an Excel table How can it proceed Objective To illustrate Excel tables for analyzing the HyTex data 27 ExceITabIes for Filtering Sorting and Summarizing 67 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Solution The range A1O1001 is in the form of a data set it is a rectangular range bounded by blank rows and columns where each row is an observation each column is a variable and variable names appear in the top row Therefore it is a candidate for an Excel table However it doesn t benefit from the new table tools until you actually designate it as a table To do so select any cell in the data set click on the Table button in the left part of the Insert ribbon see Figure 236 and accept the default options An alternative way to designate an Excel table is to select any of the options on the Format as Table dropdown list on the Home ribbon Two things happen First the data set is designated as a table it is formatted nicely and a dropdown arrow appears next to each variable name as shown in Figure 237 Second a new Table Tools Design ribbon becomes available see Figure 238 This ribbon is available any time the active cell is inside a table Note that the table is named Tablel by default if this is the first table However you can change this to a more descriptive name if you like Figure 236 Insert Ribbon with Table Button V M p Y 9 El El iquotl 1quotl 39 p 3923quot ij E 39 Catalog lll arletiniel 1339 quotF pz 2 RIF p Z H 2 3 Home l Insert P39age La1roIut Forrriulia5 Data Fleiiiiew lilizeiar Deirellolper 931 Q 2 quot 39 39 2 V 39 1quot NJ X Y 2 I 0 E U ii Pli1rot3939alie T ali5e S Flictuzre Clip Shapes 39En39Iarlquotl3939irt 1 Column Line Pie Ear Area Scatter Clrtner J HgI39EiE iilT3939l12 T Tent Header WovrdJ irt Sitgnature Cll1ject Eyrmlool Art 39 39 39 39 Ch arts l T l Boat 5 Footer Line 39 ll ram Joann 1 too Elme 1 Qaoenooemao Figure 23 7 Table with Dropdown Arrows Next to Variable Names 11111 lie lmcn bl JilE LL LEE ll1 l lZI l 1 blism l L l M l ii l of l 1 l 3 5 if Genoa DrwneH nr1uEton iiiifu Hi 5tt 39EEIliIEliIla Heginna tate p 1391 Fiquotr5t u39roha lin1oulntspe l Soul 1 1 o D or 1 1aaaoo 1 1 12 smith Floriola JD1r39vi37I39ll i2iICJ i ilJBiquot39J3lquot23 H 2 2 o 1 1 o 1oa1oo a 3 1a Wlidwe5t lllinoi5 lZZhioago 525l392oo1 2532 4 F a 1 1 1 1 1 913 1 Na 12 smith MFInridam 3Drlalnlilo ailarzoor G Sail 1 v 1 1 1 1 azraeoo o 1 12 Eaat ohie Celvear39ul 12i392ar2oo1 3435 i 5 1 1 ii iii 1 5111 olnn 5 Midwast lllinonie Ieiiiaaga lfaiarzoor 105 A TE 2 El El iii 1 t12BlEl i El 2 12 West Jtrizona Phoevnix 3l39 l392lT lIl39l5v T395 Fl ill 1 2 o o oi 1 34oo nine 1EMidweot Llitanaazs lCarra5lCit39ir gtliil3911aquot2llil J 1515 S ill 1 1 El lEilZlrElEl El 3 El West Cai39iornia San Fgra noisoo Ea 1Fi392 rl1 1g 9Es5 8 an 1 1 41 of 1 aoa nine 24 Miawest lllinnit Chiizagor srreaiaoorl 321 11 pl 1o P 4 1 1 1 o a2a oo o a aaaautri Florida Clrlantlo arar2oo3 521544 Figure 238 Table Tools Design Ribbon H 39 1 lj IQ 2 39339 quot 3 quot E13 El E 39T5ali eTool Catalog l larlre tingl51c Microsoft Excel Home Insert Page Layout Fornwlfars Data Review View Deweioper Deslgn Table lllan39Ie Svummarizewith Pivot39l39able ii r Praprl iEr Header Row lli First Column TalIle1 E393 Flemo1re Duplicates x Open in E i3933939E 5Ei quotf Total Row C Last Column 7 t 5 Export Refresh H p Cir ResizeTable 1J CCif1lI39ElquotiItD Range 7 7 35 IJnIInl Banded Rows l Banded Columns Properties Tools External Table Data Table St39le Options 68 Chapter 2 Describing the Distribution of a SingeVariabe Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it One handy feature of Excel tables is that the variable names remain visible even when you scroll down the screen Try it to see how it works When you scroll down far enough that the variable names would disappear the column headers A B C and so on change to the variable names Therefore you no longer need to freeze panes or split the screen to see the variable names However this works only when the active cell is within the table If you click outside the table the column headers revert back to A B C and so on The dropdown arrows next to the variable names allow you to filter in many different ways For example click on the OwnHome dropdown list uncheck the Select All option and check the 1 option This filters out all customers except those who own their own home Filtering is discussed in much more detail later on but at this point just be aware that filtering does not delete any observations it only hides them There are three indica tions that the table has been filtered 1 the row numbers are colored blue and some are missing 2 a message appears at the bottom of the screen indicating that only 516 out of 1000 records are visible and 3 there is a filter icon next to the OwnHome dropdown arrow It is easy to remove this filter by opening the OwnHome dropdown list and select ing Clear Filter but don t do so yet As illustrated in Figure 238 there are various options you can apply to tables includ ing the following I A number of table styles are available for making the table attractive You can experi ment with these including the various table styles and table style options Note the dropdown list in the Table Styles group It gives you many more styles than the seven originally visible In particular at the top left of options there is a no color style you might prefer I In the Tools group you can click on Convert to Range This undesignates the range as a table and the dropdown arrows disappear I In the Properties group you can change the name of the table You can also click on the Resize Table button to expand or contract the table range The Total row in an I A particularly useful option is the Total Row in the Table Style Options group If Excel table SUm you check this a new row is appended to the bottom of the table see Figure 239 mcmzes only the It creates a sum formula in the rightmost column6 This sum includes only the non visible data The data that has been ltered out is ignored hidden rows To prove this to youself clear the OwnHome filter and check the sum It increases to 1216768 This total row is quite exible First you can summarize the last column by a number of summary measures such as Average Max Min Count and others To do so select cell 01002 and click on the dropdown list that appears Second you can summarize any other column in the table in the same way For example if you select cell G1002 a dropdown list appears for Salary and you can then summarize Salary with the same summarizing options Figure 239 Total Row 1 Children Hiatunr Catalogs Region State City FiFS tF39UrCl IEiSE muunE tSipEvnt 11 E1194 El 3 IE W1idw39E5t Ehi EiIt IiiIrIIi39Ia39li 1iIl39i39r23Jiquotf2 Zl39 I 113539 9 r1 1 x ml391quotlE llaif ii p E119 ti 2 12 Went W39a5t1ingttin Seattlei Ei quot1alJquotf2Ir39 EiI3 9991 p 11 at i i1 EIsst 0 51001 1 3 31 WE5 it Utah Salt Lake Ci 3quot39EiiquotT239 2quot IEi393l 1IlIl2 5The actual formula is SUBTOTAL109AmountSpent where 109 is a code for summing However you never need to type any such formula you can choose the summary function you want from the dropdown list 27 ExceTabes for Filtering Sorting and Summarizing 69 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Filtering is certainly possible without using Excel tables but there are definitely advan tages to ltering with Excel tables Excel tables have a lot of builtin intelligence Although there is not enough space here to give a full account try the following to see what we mean I In cell R2 or any cell in row 2 outside the table enter a formula by typing an equals sign pointing to cell 02 typing a divide sign l and pointing to cell G2 You do not get the usual formula 02G2 Instead you get Table1This RowAmountSpentTablelThis RowSalary This is certainly not the Excel syntax you are used to and it is pretty ugly but it makes perfect sense I Similarly you can expand the table with a new variable such as the ratio of AmountSpent to Salary Start by typing the variable name Ratio in cell Pl Then in cell P2 enter a formula exactly as you did in the previous bullet You will notice two things First as soon as you enter the Ratio label column P becomes part of the table Second as soon as you enter the new formula in one cell it is copied to all of column P This is what we mean by table intelligence I We saved the best for last Excel tables expand automatically as new rows are added to the bottom or new columns are added to the right You saw this latter behavior in the previous bullet To appreciate the benefit of this suppose you have a monthly time series data set You designate it as a table and then build a line chart from it to show the time series behavior Later on if you add new data to the bottom of the table the chart will automatically update to include the new data This is a great fea ture In fact when we discuss pivot tables in the next chapter we will recommend always basing them on tables not ranges Then they too will update automatically when new data is added to the table I 271 Filtering We now discuss ways of filtering data sets that is finding records that match particular criteria Before getting into details there are two aspects of filtering you should be aware of First this section is concerned with the types of filters called AutoFilter in previous versions of Excel 2003 and earlier The term AutoFilter implied that these were very simple filters easily learned in a few minutes If you wanted to do any complex filtering you had to move beyond AutoFilter to Excel s Advanced Filter tool Excel 2007 still has Advanced Filter However the term AutoFilter has been changed to Filter to indicate that these easy filters are now more powerful than the old AutoFilter Fortunately they are just as easy as AutoFilter Second one way to filter is to create an Excel table as indicated in the previous subsection This automatically provides the dropdown arrows next to the field names that allow you to filter Indeed this is the way we will filter in this section on an existing table However a designated table is not required for ltering You can filter on any rectangular data set with variable names There are actually three ways to do so For each method the active cell should be a cell inside the data set 1 Use the Filter button from the Sort amp Filter dropdown list on the Home ribbon 2 Use the Filter button from the Sort amp Filter group on the Data ribbon 3 Right click on any cell in the table and choose the Filter option You get several options the most popular of which is Filter by Selected Cell s Value For example if the selected cell has value 1 and is in the Children column then only customers with a single child will remain visible This behavior should be familiar to Access users The point is that Microsoft realizes how important filtering is to Excel users Therefore they have made filtering a very prominent and powerful tool in Excel 2007 70 Chapter 2 Describing the Distribution of a SingeVariabe Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it EXAMPLE As far as we can tell the two main advantages of ltering on a table as opposed to the three options just listed are the nice formatting banded rows for example provided by tables and more importantly the totals row If this totals row is showing it summarizes only the visible records the hidden rows are ignored We now continue Example 27 to illustrate a number of ltering possibilities Unlike some how to Excel books we won t lead you through a lot of descriptions and screen shots Once you know the possibilities that are available you should nd them extremely easy to use EXAMPLE 27 HvTEx s CUSTOMER DATA CONTINUED The number of ways you can lter with ExceI s newest tools is virtually unlimited Don t be afraid to experiment You can always clear filters to get back to where you started he HyTex company wants to analyze its customer data by applying one or more lters to the data It has already designated the data set as an Excel table What types of lters might be useful Objective To investigate the types of lters that might be applied to the HyTex data Solution There is almost no limit to the lters you can apply but here are a few possibilities I Filter on one or more values in a field Click on the Catalogs dropdown arrow You will see ve checkboxes all checked Select All 6 12 18 and 24 To select one or more values uncheck Select All and then check any values you want to lter on such as 6 and 24 In this case only customers who received 6 or 24 catalogs will remain visible In Excel 2003 and earlier it wasn t possible to select more than one value this way Now it s easy I Filter on more than one field With the Catalogs lter still in place create a lter on some other eld such as customers with one child When there are lters on multiple elds only records that meet all of the criteria are visible in this case customers with one child who received 6 or 24 catalogs I Filter on a continuous numerical field The Salary and AmountSpent elds are basically continuous elds so it would not make much sense to lter on one or a few particular values However it does make sense to lter on ranges of values such as all salaries greater than 75000 This is easy Click on the dropdown arrow next to Salary and select Number Filters You will see a number of obvious possibilities including Greater Than I Top 10 and AboveBelow Average lters Continuing the previous bullet the Number Filters include Top 10 Above Average and Below Average options These are particularly useful if you like to see the highs and the lows The Above Average and Below Average lters do exactly what their names imply The Top 10 lter is actually more exible than its name implies It can be used to select the top n items where you can choose n the bottom n items the top n percent of items or the bottom n percent of items Note that if a Top 10 lter is used on a text eld the ordering is alphabetical If it is used on a date eld the ordering is chronological I Filter on a text eld If you click on the dropdown arrow for a text eld such as Region you can choose one or more of its values such as East and South to lter on You can also select the Text Filters item which provides a number of choices including Begins With Ends With Contains and others For example if there were 27 ExceTabes for Filtering Sorting and Summarizing 7 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it an Address eld you could use the Begins With option to nd all addresses that begin with PO Box I Filter on a date eld Excel 2007 has great built in intelligence for filtering on dates If you click on the FirstPurchase arrow you will see an item for each year in the data set with plus signs next to them By clicking on the plus signs you can drill down to months and then days for as much control as you need Figure 240 shows one possibility where we have filtered out all dates except the last part of July 2007 In addition if you click on the Date Filters item you get a number of possibilities such as Yesterday Next Week Last Month and many others There aren t many possibilities regarding dates that Microsoft hasn t included 1 gmI EaeJ511 Newest SH He we5ttn izilciest Figure 240 Filtering on a Date Variable art 5 HP Sari by Cnznjlaznr F quot IiT Qatar Filter Frcim Fir5tF39iIrni1a5E 7ii 3 2 i ii Date Eilters F 39uquotr7 ll 6 SeEIl mI in El ur Pg January Felznrurgr B Mar 5iri El El Mar June L41quot Juir 1 11 P 14 t E5 E E3F iuqust LI Z Elamzeli N I Filter on color or icon Excel 2007 has many ways to color cells or put icons in cells Often the purpose is to denote the sizes of the numbers in the cells such as red for small numbers and green for large numbers We won t cover the possibilities in this book but you can experiment with Conditional Formatting on the Home ribbon The point is that cells are often colored in certain ways or contain certain icons Therefore Excel 2007 allows you to filter on background color font color or icon For example if certain salaries are colored yellow you can isolate them by filtering on yellow We are not sure how often this feature will be used but it is available and easy to use 72 Chapter 2 Describing the Distribution of a SingeVariabe Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Figure 24 Custom Filter for Salary I Use a custom lter If nothing else works you can try a custom lter available at the bottom of the Number Filters Text Filters and Date Filters lists Figures 241 and 242 illustrate two possibilities The first of these lters out all salaries between 25000 and 75000 Without a custom lter this wouldn t be possible The second uses the wildcard to nd regions ending in est West and Midwest Admittedly this is an awkward way to perform this lter but it indicates how exible custom lters can be Eustum FiutuFiter 4 j Shcnw rcuws where Salary escuncn LI lis less than 1quotquot and 1397 gr L sncuc Use tcu represent any single character Use tcu represent any series cul characters lis greater than Figure 242 Custom Filter for Region Eustum i5iutuFiter Shcnw rcuws where Ftegicun leguals Ll lest ll Fgncl Fgr I Lil J Use tcu represent any single character Use tcu represent any series cul characters Cancel I We remind you once again that if you lter on an Excel table and you have summary mea sures in a total row at the bottom of the table these summary measures are based only on the ltered data they ignore the hidden rows One nal comment about lters is that when you click on the dropdown arrow for any variable you always get three items at the top for sorting not ltering see Figure 240 for example These allow you to perform the obvious sorts from high to low or vice versa and they even allow you to sort on color As with ltering you do not need to designate an Excel table to perform sorting the popular A Z and Z A buttons work just ne without tables but sorting is made even easier with tables Now that you know the possibilities here is one particular lter you can try Suppose HyTex wants information about all middle aged married customers with at least two children who have above average salaries own their own home and live in Indiana or Kentucky We imagine that you can run this lter in a few seconds The result sorted in decreasing order of AmountSpent and shown in Figure 243 indicates that the 27 ExceTabes for Filtering Sorting and Summarizing 73 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Figure 243 Results from a Typical Filter TH B N 13 ll H M N 3 H P if 39 2 rliiiitiggr Fi rat1mrchain Arniuuarmt pe l 2 E 39 24 Midweiat Kentucky iaui5ville 3 I 39 21I Equot39 3iCI39E2 j 2 a 1 1 1 52200 3 NA 24 Miidw39e5t Indiana Indianvapcrlis a2391n 2Isa2 2119 39f T 2 1 1 1 III 812400 2 3 24 WidwEi5t Indiana lndianapavlia 11quot25quot2IIIZi 22025 o aaa 2 1 1 1 a 1134nn 3 3 12 Midwest Ke39ntuvck39Ig 1auiwiIIe afisfznaa 512911 2 1 1 1 1 113 2 quot2 1aMiawast 2entuc2yr LmisuiIIe afaf2nn2 1554 2 1 1 1 1 21 2 NA l2 Mi39tIlWE5t India na lndli39a napnli5 4quot28anquot2 T 1122 2 an 1 1 1 p5 2 2 12 Mitiweat Isentm2y LauiwiIIe IEiiquotI39Iquot2E1iIZl32 295 2 1 1 1 1 24500 2 2 12 2 M idweat Ind ia nia lml iana pol iquot5 Luquot12nquot2i39I1III2 324 2 an 1 1 1 222nn 2 392 l ymitlweat Isentuc2yr Lcrrliisirilla 1af1i392aa3e 215 2 1 1 1 1 1 E 2 2 Enquot EI iidweat Indiana Indianvapais quotM3quot2 IZIIE 558 222222n average salary for these 10 customers is 84750 and their total amount spent at HyTex is 14709 We summarized Salary by average and AmountSpent by sum in the totals row I PROBLEMS Level A The file P0203Xlsx contains data from a survey of 399 people regarding an environmental policy Use filters for each of the following a Identify all respondents who are female middle aged and have two children What is the average salary of these respondents b Identify all respondents who are elderly and strongly disagree with the environmental policy What is the average salary of these respondents c Identify all respondents who strongly agree with the environmental policy What proportion of these individuals are young d Identify all respondents who are either 1 middle aged men with at least one child and an annual salary of at least 50000 or 2 n1iddle aged women with two or fewer children and an annual salary of at least 30000 What are the mean and median salaries of the respondents who meet these conditions What proportion of the respondents who satisfy these conditions agree or strongly agree with the environmental policy 34 The file P0207Xlsx includes data on 204 employees at the ctional company Beta Technologies Use filters for each of the following a Identify all employees who are male and have exactly 4 years of post secondary education What is the average salary of these employees b Find the average salary of all female employees who have exactly 4 years of post secondary educa tion How does this mean salary compare to the one obtained in part a 35 c Identify all employees who have more than 4 years of post secondary education What proportion of these employees are male d Identify all full time employees who are either 1 females between the ages of 30 and 50 inclu sive with at least 5 years of prior work experience at least 10 years of prior work experience at Beta and at least 4 years of postsecondary education or 2 males between the ages of 40 and 60 inclusive with at least 6 years of prior work experience at least 12 years of prior work experience at Beta and at least 4 years of postsecondary education e For those employees who meet the conditions spec i ed in part d compare the mean salary of the females with that of the males Also compare the median salary of the female employees with that of the male employees f What proportion of the full time employees identi fied in part d earns less than 50000 per year The le P0235xlsx contains fictional data from a survey of 500 randomly selected households Use Excel filters to answer the following questions a What are the average monthly home mortgage payment average monthly utility bill and average total debt excluding the home mortgage of all homeowners residing in the southeast sector of the city b What are the average monthly home mortgage pay ment average monthly utility bill and average total debt excluding the home mortgage of all homeowners residing in the northwest sector of the city How do these results compare to those found in part a 74 Chapter 2 Describing the Distribution of a SingeVariabe Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it 3 b c What is the average annual income of the first d What proportion of these supermarket customers household wage earners who rent their home are single and own a home house or apartment How does this compare to the average annual income of the first household Level B wage earners who own their home d What proportion of the surveyed households con tains a single person who owns his or her home 36 Recall that the file Supermarket TransactionsXlsX contains over 14000 transactions made by super market customers over a period of approximately two years Use Excel filters to answer the following questions customers who are married customers who do not own a home c What proportion of these transactions are made by customers who have at least one child What proportion of these transactions are made by What proportion of these transactions are made by 37 The file P0235XlsX contains fictional data from a survey of 500 randomly selected households Use Excel filters to answer the following questions 3 Identify households that own their home and have a monthly home mortgage payment in the top quar tile of the monthly payments for all households Identify households with monthly expenditures on utilities that are within two standard deviations of the mean monthly expenditure on utilities for all households Identify households with total indebtedness excluding home mortgage less than 10 of the household s primary annual income level 28 CONCLUSION The summary measures charts and tables we have discussed in this chapter are extremely useful for describing variables in data sets We call the methods in this chapter and the next chapter exploratory methods because they allow you to explore the characteristics of the data and at least tentatively answer interesting questions Most of these tools have been avail able for many years but with the powerful software now accessible to virtually everyone the tools can be applied quickly and easily to gain insights We can promise that you will be using many if not all of these tools in your jobs Indeed the knowledge you gain from these early chapters is arguably the most valuable knowledge you will gain from the book To help you remember which analyses are appropriate for different questions and dif ferent data types and which tools are useful for performing the various analyses we have created the taxonomy in the file Data Analysis Taxonomyxlsx It doesn t t nicely on the printed page Feel free to refer back to the diagram in this file as you learn the tools in this chapter and the next chapter Summary of Key Terms Term Explanation Excel Pages Equation Population Includes all objects of interest in a study 24 people households machines etc Sample Representative subset of population 24 usually chosen randomly Variable or field Attribute or measurement of members of a population such as height gender or salary Observation or List of all variable values for a single 25 record or case member of a population Data set Usually a rectangular array of data with 25 variables in columns observations in rows and variable names in the top row continued 28 Conclusion 75 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Summary of Key Terms Continued Term Explanation Excel Pages Equation Data type Several categorizations are possible 25 numerical versus categorical discrete versus continuous cross sectional versus time series categorical can be nominal or ordinal Dummy variable A variable coded 1 or 0 1 for observations 28 in a category 0 for observations not in the category Binned or discretized Numerical variable that has been categorized 28 variable into discrete categories called bins Counts of categories Numbers of observations in various COUNTIF function 30 categories StatTools Palisade add in for data analysis in Excel StatTools 34 ribbon Mean Average of observations AVERAGE or 35 21 StatTools Median Middle observation after sorting MEDIAN or 35 StatTools Mode Most frequent observation MODE 35 Percentiles Values that have specified percentages PERCENTILE 36 of observations below them or StatTools Quartiles Values that have 25 50 or 75 of QUARTILE or 36 observations below them StatTools Minimum Smallest observation MIN or StatTools 37 Maximum Largest observation MAX or StatTools 37 Concatenate String together two or more pieces amp character 38 of text or CONCATENATE Range Difference between largest and smallest MAX MIN or 38 observations StatTools Interquartile range Difference between first and third QUARTILE functions 38 IQR quartiles or StatTools Variance Measure of variability essentially the VAR or VARP 38 22 23 average of squared deviations from or StatTools the mean Standard Measure of variability in same units STDEV or STDEVP 39 deviation as observations square root of variance or StatTools Empirical rules Rules that specify approximate percentage 41 observations within one two or three standard deviations of mean for bell shaped distributions Mean absolute Another measure of variability average AVEDEV or 42 24 Deviation MAD of absolute deviations from the mean StatTools Skewness When one tail of a distribution is SKEW or 42 longer than the other StatTools Kurtosis Measure of fatness of tails of a KURT or 42 distribution StatTools Histogram Chart of bin counts for a numerical StatTools 48 variable shows shape of the distribution continued 76 Chapter 2 Describing the Distribution of a SingeVariabe Copyright 2010 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Term Explanation Excel Pages Equation Frequency table Contains counts of observations COUNTIF or 49 in specified categories FREQUENCY Box plots Alternative chart that shows the StatTools 49 distribution of a numerical variable Time series Graph showing behavior through time StatTools 48 graph of one or more time series variables Outlier Observation that lies outside of the general 64 range of observations in a data set Missing values Values that are not reported in a data set 65 Excel tables Rectangular ranges specified as tables Table from Insert 67 especially useful for sorting and filtering ribbon PROBLEMS Conceptual Questions C9 C1 An airline analyst wishes to estimate the proportion of all American adults who are afraid to y in light of the thwarted terrorist attack on a US commercial airliner on December 25 2009 To estimate this percentage the analyst decides to survey 1500 Americans from across the nation Identify the relevant sample and population in this situation C10 Explain how a box plot can be used to determine whether the associated distribution of values is essentially symmetric Suppose that you collect a random sample of 250 salaries for the salespersons employed by a large PC manufacturer Furthermore assume that you nd that two of these salaries are considerably higher than the others in the sample In cleansing this data set should you delete the unusual observations C2 The number of children living in each of a large Explain Why or Why not number of randomly selected households is an example of which data type Be speci c Level A C3 Does it make sense to construct a histogram for the state of residence of randomly selected individuals 38 in a sample Explain why or why not C4 distribution of scores on a midterm exam in a graduate statistics course b C5 A researcher is interested in determining whether there is a relationship between the number of room air conditioning units sold each week and the time of year What type of descriptive chart would be most useful in performing this analysis Explain your choice C6 Suppose that the histogram of a given income distribution is positively skewed What does this fact imply about the relationship between the mean and d median of this distribution C7 quartile and third quartile of any distribution is the The midpoint of the line segment joining the rst e The le P0235xlsx contains ctional data from a survey of 500 randomly selected households Characterize the likely shape of a histogram of the a Indicate the type of data for each of the variables included in the survey For each of the categorical variables in the survey indicate whether the variable is nominal or ordinal Explain your reasoning in each case Create a histogram for each of the numerical variables in this data set Indicate whether each of these distributions is approximately symmetric or skewed Which if any of these distributions are skewed to the right Which if any are skewed to the left Find the maximum and minimum debt levels for the households in this sample Find the indebtedness levels at each of the 25th 50th and 75th percentiles median Is this statement true or false Explain your f Find and interpret the interquartile range for the answer indebtedness levels of these selected households C8 Explain why the standard deviation would likely not The le P0239xlsx contains SAT test scores two be a reliable measure of variability for a distribution of data that includes at least one extreme outlier verbal components a mathematical component and the sum of these three for each state and Washington 77 28 Conclusion Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it 40 42 78 Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it DC in 2009 It also lists the percentage of high school graduates taking the test in each of the states a Create a histogram for each of the numerical variables Are these distributions essentially symmetric or are they skewed Compare the distributions of the average verbal scores and average mathematical scores In what ways are these distributions similar In what ways are they different c Find the mean median and mode of the set of percentages taking the test For each of the numerical variables which is the most appropriate measure of central tendency The le P0243XlsX contains US Bureau of Labor Statistics data on the year to year percentage changes in the wages and salaries of workers in private industries including both white collar and blue collar occupations a Create box plots to summarize these distributions of annual percentage changes Comparing the box plots for white collar and blue collar workers discuss the similarities or differences you see Given that these are time series variables what information is omitted from the box plots Are box plots even relevant Explain the reasoning behind your Choice 44 The file P0244XlsX contains annual data on the e HOW does the mean of the Combined Variable percentage of Americans under the age of 18 living relate to the means of the Critical Reading Math below the poverty level and Writing Variables IS the Same true for a In Wl 11Cl 1 years of the sample has the poverty rate medians for American children exceeded the rate that de nes the third quartile of these data The Wall Street 10 W1 CEO C0mI9enSati0quot Stud b In which years of the sample has the poverty rate aiiaiyzed CEO Pay fieiii many US C0iiil3aiiieS With for American children fallen below the rate that scal year I396V61 1llC Of at least that P de nes the first quartile of these data their proxy statements between October 2008 and c what is the typica1 poyerty rate for American March 2009 The data are in the file P0230XlsX chi1c1ren during this period Netei This data set is a s0111eWhat different CEO d Create and interpret a time series graph for these compensation data set from the one used as an data How successful have Americans been eXa1i1Pie iii the next ei1al3tei recently in their efforts to win the war against a Create histograms to gain a clearer understanding poyerty for the nation s chi1c1ren Of the distTib1lti0fls Of aflflllai base salaries and e Given that this data set is a time series discuss 1301111868 earned by the surveyed CEOS in flseai whether the measures requested in parts ac are 2008 How would you characterize these Very meaningful at the current time histograms b Find the annual salary below which 75 of all Level B given CEO salaries fall c Find the annual bonus above which 55 of all given CEO bonuses fall Determine the range of the middle 50 of all given total direct compensation figures For the 50 of the executives that do not fall into this middle 50 range is there more variability in total direct compensation to the right than to the left Explain The le P0241XlsX contains monthly returns on Barnes and Noble stock for several years As the formulas in the le indicate each return is the percentage change in the adjusted closing price from one month to the next Do monthly stock returns appear to be skewed or symmetric On average do they tend to be positive negative or zero 46 The le P0242XlsX contains monthly returns on Mattel stock for several years As the formulas in the le indicate each return is the percentage change in the adjusted closing price from one month to the next Create a histogram of these returns and summarize what you learn from it On average do the returns tend to be positive negative or zero 47 Chapter 2 Describing the Distribution of a Single Variable The file P0245XlsX contains the salaries of 135 business school professors at a ctional large state university a If you increased every professor s salary by 1000 what would happen to the mean and median salary If you increased every professor s salary by 1000 what would happen to the sample standard deviation of the salaries c If you increased every professor s salary by 5 what would happen to the sample standard deviation of the salaries The file P0246XlsX lists the fraction of US men and women of various heights and weights Use these data to estimate the mean and standard deviation of the height of American men and women H int Assume all heights in a group are concentrated at the group s midpoint Do the same for weights Recall that the HyTex Company is a direct marketer of technical products and that the file Catalog MarketingXlsX contains recent data on 1000 HyTex customers Copyright 2010 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters 48 49 50 a Identify all customers in the data set who are 55 years of age or younger female single and who have had at least some dealings with HyTex before this year Find the average number of catalogs sent to these customers and the average amount spent by these customers this year b Do any of the customers who satisfy the conditions stated in part a have salaries that fall in the bottom 10 of all 1000 combined salaries in the data set If so how many c Identify all customers in the sample who are more than 30 years of age or younger male homeowners married and who have had little if any dealings with HyTex before this year Find the average combined household salary and the average amount spent by these customers this year d Do any of the customers who satisfy the conditions stated in part c have salaries that fall in the top 10 of all 1000 combined salaries in the data set If so how many Recall that the file Baseball Salaries 2009XlsX contains data on 818 MLB players as of May 2009 Using this data set answer the following questions a Find the mean and median of the salaries of all shortstops Are any of these measures in uenced signi cantly by one or more unusual observations b Find the standard deviation first and third quartiles and 5th and 95th percentiles for the salaries of all shortstops Are any of these measures in uenced signi cantly by one or more unusual observations c Create a histogram of the salaries of all shortstops Are any of these measures in uenced signi cantly by one or more unusual observations In 1969 and again in 1970 a lottery was held to determine who would be drafted and sent to Vietnam in the following year For each date of the year a ball was put into an urn For example in the first lottery January 1 was number 305 and February 14 was number 4 Thus a person born on February 14 would be drafted before a person born on January 1 The file P0249Xlsx contains the draft number for each date for the two lotteries Do you notice anything unusual about the results of either lottery What do you think might have caused this result Hint Create a box plot for each month s numbers The le P0250xlsX contains the average price of gasoline in each of the 50 states Note You will need to manipulate the data to some extent before performing the analyses requested below a Compare the distributions of gasoline price data one for each year across states Specifically do you find the mean and standard deviation of these distributions to be changing over time If so how do you explain the trends b In which regions of the country have gasoline prices changed the most 51 52 54 c In which regions of the country have gasoline prices remained relatively stable The file P0251XlsX contains data on US home ownership rates a Employ numerical summary measures to char acterize the changes in homeownership rates across the country during this period b Do the trends appear to be uniform across the US or are they unique to certain regions of the country Explain Recall that the HyTex Company is a direct marketer of technical products and that the file Catalog Marketingxlsx contains recent data on 1000 HyTex customers a Identify all customers who are either 1 home owners between the ages of 31 and 55 who live reasonably close to a shopping area that sells similar merchandise and have a combined salary between 40000 and 90000 inclusive and a history of being a medium or high spender at HyTex or 2 homeowners greater than the age of 55 who live reasonably close to a shopping area that sells similar merchandise and have a combined salary between 40000 and 90000 inclusive and a history of being a medium or high spender at HyTex b Characterize the subset of customers who satisfy the conditions specified in part a In particular what proportion of these customers are women What proportion of these customers are married On average how many children do these customers have Finally how many catalogs do these customers typically receive and how much do they typically spend each year at HyTex c In what ways are the customers who satisfy condition 1 in part a different from those who satisfy condition 2 in part a Be specific Recall that the file Supermarket TransactionsXlsx contains data on over 14000 transactions There are two numerical variables Units Sold and Revenue The rst of these is discrete and the second is continuous For each of the following do whatever it takes to create a bar chart of counts for Units Sold and a histogram of Revenue for the given subpopulation of purchases a All purchases made during January and February of 2008 b All purchase made by married female homeowners c All purchases made in the state of California d All purchases made in the Produce product department The file P0254XlsX contains daily values of an EPA air quality index in Washington DC and Los Angeles from January 1980 through April 2009 For some 28 Conclusion 79 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it unknown reason the source provides slightly different dates for the two cities 21 Starting in column G create three new columns Date Wash DC Index and LA Index Fill the new date column with all dates from 111980 to 4302009 Then use lookup functions to fill in the two new index columns entering the observed index if available or a blank otherwise Hint Use a combination of the VLOOKUP function with False as the last argument and the IFERROR function Look up the latter in online help if you have never seen it before b Create a separate time series graph of each new index column Because there are so many dates it is dif cult to see how the graph deals with missing data but see if you can determine this maybe by expanding the size of the graph or trying a smaller example In spite of the few missing points explain the pattems in the graphs and how Washington DC compares to Los Angeles Note StatTools will not let you create a time series graph with missing data in the middle of the series but you can create a line chart manually in Excel without StatTools 55 The file P0255XlsX contains monthly sales in millions of dollars of beer wine and liquor The data have not been seasonally adjusted so there might be seasonal patterns that can be discovered For any month in any year define that month s seasonal index as the ratio of its sales value to the average sales value over all months of that year a Calculate these seasonal indexes one for each month in the series Do you see a consistent pattern from year to year If so what is it b To deseasonalize the data and get the seasonally adjusted series often reported divide each monthly sales value by the corresponding seasonal index from part a Then create a time series graph of both series the actual sales and the seasonally adjusted sales Explain how they are different and why the seasonally adjusted series might be of interest The file P0256xlsx contains monthly values of indexes that measure the amount of energy necessary to heat or cool buildings due to outside temperatures See the explanation in the Source sheet of the file These are reported for each state in the US and also for several regions as listed in the Locations sheet from 1931 to 2000 Create summary measures and or charts to see whether there is any indication of temperature changes global warming through time and report your findings The file P0257xlsx contains data on mortgage loans in 2008 for each state in the US The le is different from similar ones in this chapter in that each state has its own sheet with the same data laid out in the same format Each state sheet breaks down all mortgage applications by loan purpose applicant race loan type outcome and denial reason for those that were denied The question is how a single data set for all states can be created for analysis The Typical Data Set sheet indicates a simple way of doing this using the powerful but little known INDIRECT function This sheet is basically a template for bringing in any pieces of data from the state sheet you would like to examine 21 Create histograms and summary measures for the example data given in the Typical Data Set sheet and write a short report on your ndings b Create a copy of the Typical Data Set sheet and repeat part a on this copy for at least one other set of variables of your choice from the state sheets 80 Chapter 2 Describing the Distribution of a SingeVariabe Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it 2 CORRECT INTERPRETATION or MEANs mean as de ned in this chapter is a pretty simple concept it is the average of a set of numbers But even this simple concept can cause confusion if you aren t carefuThe data in Table 2 are typical of data presented by marketing researchers for a type of product in this case beer Each value is an average of the number of six packs of beer purchased per customer during a month For the individual brands the value is the average only for the customers who purchased at least one sixpack of that brand For exampethe value for Miller is the average number of sixpacks purchased of all of these brands for customers who purchased at least one six pack of Miller In contrast the Any average is the average number of sixpacks purchased of these brands for all customers in the population Is there a paradox in these averages On first glance it might appear unusual or even impossible that the Any average is less than each brand average Make up your own small data set where you list a number of customers along with the number of sixpacks of each brand of beer each customer purchased and calculate the averages for your data that correspond to those in Table 2 I Do you get the same result that the Any average is lower than all of the others Are you guaranteed to get this result Does it depend on the amount of brand loyalty in your population where brand loyalty is greater when customers tend to stick to the same brand rather than buying multiple brandsWrite up your results in a concise report I Table 2 Average Beer Purchases QIRIS TUV Criteria range starts in row 3 L L l l Case 2 Correct Interpretation of Means 8 l Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it 22 THE Dow JONES INDUSTRIAL AVERAGE he monthly closing values of the Dow Jones Industrial Average DJIA for the period begin ning in January I950 are given in the le DJIA Monthly Cosexsx According to Wikipedia http enwikipediaorgwikiDowJonesndustriaAverage the Dow Jones Industrial Average also referred to as the ndustriaAveragethe Dow Jones the Dow 30 or simply the Dow is one of several stock market indices created by Wall Street journal editor and Dow Jones amp Company cofounder Charles DowThe average is named after Dow and one of his business associates statistician Edward Jones It is an index that shows how 30 large publicly owned companies based in the US have traded during a standard trading session in the stock market It is the second oldest US market index after the Dow Jones Transportation Average which Dow also created Currently Dow Jones amp Company which regularly publishes the index is a subsidiary of News Corporation The Industrial portion of the name is largely historical as many of the modern 30 components have little or nothing to do with traditional heavy industry The average is priceweighted and to compensate for the effects of stock splits and other adjustments it is currently a scaled averageThe value of the Dow is not the actual average of the prices of its component 82 Chapter 2 Describing the Distribution of a SingeVariabe stocks but rather the sum of the component prices divided by a divisor which changes whenever one of the component stocks has a stock split or stock dividend so as to generate a consistent value for the index Along with the NASDAQ Compositethe SampP 500 Index and the Russell 2000 Index the Dow is among the most closely watched benchmark indices tracking targeted stock market activity Although Dow compiled the index to gauge the performance of the industrial sector within the American economy the index s performance continues to be in uenced not only by corporate and economic reports but also by domestic and foreign political events such as war and terrorism as well as by natural disasters that could potentially lead to economic harm Components of the Dow trade on both the NASDAQ OMX and the NYSE Euronext two of the largest stock exchanges Derivatives of the Dow trade on the Chicago Board Options Exchange and through the CME Group the world s largest futures exchange company Using the summary measures and graphical tools from this chapter analyze this important time series over the given period Summarize in detail the behavior of the monthly closing values of the Dow and the associated monthly percentage changes in the closing values of the Dow I Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it 23 HOME AND CONDO PRICES he le Home Price Indexxsx contains an your ndings Some important questions you can index of home prices and a seasonally adjusted answer are the following Are there trends over SA version of this index for several large US cities timeAre there differences across cities Are there It also contains a condo price index for several large differences across months Do condo prices mirror cities and a national index The data are explained in home pricesWhy are seasonally adjusted indexes the Source sheet Use the tools in this chapter to published I make sense out of these data and write a report of Case 23 Home and Condo Prices 83 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER PREDICTORS OF SUCCESSFUL MOVIES he movie industry is a highpro le industry with a highly variable revenue stream In I998 US moviegoers spent close to 7 billion at the box of ce aoneTen years later the gure was slightly higher despite the number of people watching DVDs at homeWith this much money at stake it is not surprising that movie studios are interested in knowing what variables are useful for predicting a movie s nancial successThe article by Simonoff and Sparrow 2000 examines this issue for 3 I movies released in I998 and late I997 They obtained their data from a public Web site wvvwimdbcom Although it is preferable to examine movie pro ts the costs of making movies are virtually impossible to obtainTherefore the authors focused instead on revenues speci cay the total US domestic gross revenue for each film Simonoff and Sparrow obtained prerelease information on a number of variables that were thought to be possible predictors of gross revenue Prerelease means that this information is known about a film Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it MimalDreamstimec0m before the film is actually released These variables include I the genre of the film categorized as action chidren s comedy documentary drama horror science fiction or thriller 2 the Motion Picture Association ofAmerica MPAA rating of the film categorized as G general audiences PG parental guidance suggested PGI3 possibly unsuitable for children under I3 R children not admitted unless accompanied by an adult NCI7 no one under I7 admitted or U unrated 3 the country of origin of the movie categorized as United States Englishspeaking but non United States or non Engishspeaking 4 number of actors and actresses in the movie who were listed in Entertainment Weekly s lists of the 25 Best Actors and 25 Best Actresses as of I998 5 number of actors and actresses in the movie who were among the top 20 actors and top 20 actresses in average box office gross per movie in their careers 6 whether the movie was a sequel 7 whether the movie was released before a holiday weekend 8 whether the movie was released during the Christmas season and 9 whether the movie was released during the summer season To get a sense of whether these variables are related to gross revenue we could calculate a lot of summary measures and create numerous tables However we agree with Simonoff and Sparrow that the information is best presented in a series of sidebyside box plots See Figure 3 I These box plots are slightly different from the versions intro duced in the previous chapter but they accomplish exactly the same thing There are two differences First their box plots are vertical ours are horizontal Second their box plots capture an extra piece of information the widths of their boxes are proportional to the square roots of the sample sizes so that wide boxes correspond to categories with more movies In contrast the heights of our boxes carry no information about sample size Basically each box and the lines and points extending above and below it indicate the distribution of gross revenues for any category Remember that the box itself from bottom to top captures the middle 50 of the revenues in the category the line in the middle of the box represents the median revenue and the lines and dots indicate possible skewness and outliers These particular box plots indicate some interesting and possibly surprising information about the movie business First almost all of the box plots indicate a high degree of variability and positive skewness where there are a few movies that gross extremely large amounts compared to the typical movies in the category Second genre certainly makes a differenceThere are more comedies and dramas wider boxes but they typically gross considerably less than action children s and science fiction fimsThirdthe same is true of Rrated movies compared to movies rated G PG or PG I 3 there are more of them but they typically gross much less Fourth US movies do considerably better than foreign movies Fifth it helps to have stars although there are quite a few sleepers that succeed without having bigname stars Sixth sequels do better presumably reflecting the success of the earlier lms Finally the release date makes a big difference Movies released before holidays during the Christmas season or during the summer season tend to have larger gross revenues Indeed as Simonoff and Sparrow discuss movie studios compete fiercely for the best release dates Are these prerelease variables sufficient to predict gross revenues accurately As you might expect from the amount of variability in most of the box plots in Figure 3 the answer is no Many intangible factors evidently determine the ultimate success of a movie so that some such as There s SomethingAbout Mary do much better than expected and others such as Godzilla do worse than expectedWe will revisit this movie data set in the chapter opener to Chapter I Thereyou will see how Simonoff and Sparrow use multiple regression to predict gross revenue with limited success 86 Chapter 3 Finding Relationships amongVariabes Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Figure 3 I Box P lots of Domestic Gross Revenues for 1998 Movies O O O 39 O 9 en an N 0 8 8 8 8 Eu 8 39 2 9 F quot75 0 39 quotlt73 D O 0 D O 0 39 E 52 I E 52 O O C I U I E O o E O 0 392 Lo 8 392 lo 0 I 3 I T L L Ad Chil Com Doc Dram Hor SF Thrl G PG PG13 R NC17 U Genre of movie MPAA rating 0 O Q 0 O 0 an 5 0 C I 8 8 1 Q 0 L Q 0 3 5 39 5 1 73 I 5 I an o as o g g 2 g 2 I o 39 390 39 39 39 E o 5 79 o o M2 o trgt l T l I 0 II1 I O USA English NonEng O 1 2 3 COUTWV Number of best actors 8 8 E3 A lt T U 0 E 8 8 n I L Q 0 3 3 5 1 0 a O 39 I 2 O 39 E E 9 E 9 5 la 39 2 9 39 cf 9 2 8 e 3 e8 0 I T I J O I I 2393 S O O 1 2 3 No Yes PO Number of top dollar actors Sequel a 5 BE in 8 o o E 8 o 39 o 0 lt1 lt3 D lt 0 lt1 N 0 3 T3 8 8 2 s 839 8 0 0 83 8 0 g 39 39 I 39 39 I 8 M an 0 i an o 3 3 g E 3 E sgt g 30 398 I 3 398 I 0 39 g Ta 2 2 9 I E 9 no5 T s 2 U1 0 O I I I I 3 Q U No Yes No Yes g g Holiday release Summer release E E s c 38 8 2 8 I 39 lt3 2 2 3 E p E D O on E 0 8 quot g quot I 3 c5 5 O I a Z g Lo 3 2 T as quot I I m M as III No Yes 0 d Christmas release quot3 Q I 2 3 1 INTRODUCTION In the previous chapter we introduced a number of summary measures graphs and tables to describe the distribution of a single variable For a variable such as baseball salary our entire focus was on how salaries were distributed over some range This is an important first step in any exploratory data analysis to look closely at variables one at a time but 3 Introduction 87 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it A key issue in this chapter is that different tools should be used to examine relationships depend ing on whether the variables involved are numerical or categorical it is almost never the last step We are almost always interested in relationships between variables For example it is natural to ask what drives baseball salaries Does it depend on qualitative factors such as the player s team or position Does it depend on quantitative factors such as the number of hits the player gets or the number of strikeouts To answer these questions we have to examine relationships between various variables and salary In this chapter we will again discuss several numerical summary measures graphs and tables but they will now involve at least two variables at a time The most useful numerical summary measure is correlation a measure that applies primarily to numerical variables The most useful graph is a scatterplot which again applies primarily to numeri cal variables For relationships involving categorical variables we will introduce other tools For example to break down a numerical variable by a categorical variable as in the chapter opener with movie gross revenues it is often useful to create sidebyside box plots as in Figure 31 Finally we will introduce Excel s arguably most powerful tool pivot tables Pivot tables allow you to break down one variable by others so that all sorts of rela tionships can be uncovered in a matter of minutes As you read this chapter remember that the diagram in the le Data Analysis TaXonomyXlsX is available This diagram gives you the big picture of which analyses are appropriate for which data types and which tools are best for performing the various analyses 32 RELATIONSHIPS AMONG CAT EGORICAL VARIABLES Use a crosstabs a table of counts of joint categories to discover relationships between two categorical variables Consider a data set with at least two categorical variables Smoking and Drinking Each person is categorized into one of three smoking categories nonsmoker NS occasional smoker OS and heavy smoker HS Similarly each person is categorized into one of three drinking categories nondrinker ND occasional drinker OD and heavy drinker HD We want to examine whether smoking and drinking habits are related For example do nondrinkers tend to be nonsmokers Do heavy smokers tend to be heavy drinkers As we discussed in the previous chapter the most meaningful way to describe a cate gorical variable is with counts possibly expressed as percentages and corresponding charts of the counts The same is true of examining relationships between two categorical variables We can find the counts of the categories of either variable separately and more importantly we can find counts of the joint categories of the two variables such as the count of all nondrinkers who are also nonsmokers Again corresponding percentages and charts help tell the story It is customary to display all such counts in a table called a crosstabs for cross tabulations This is also sometimes called a contingency table We illustrate these tables in the following example EXAMPLE 3l RELATIONSHIP BETWEEN SMOKING AND DRINKING he file Smoking DrinkingXlsX lists the smoking and drinking habits of 8761 adults This is not real data The categories have been coded so that N O and H stand for Non Occasional and Heavy and S and D stand for Smoker and Drinker Is there any indication that smoking and drinking habits are related If so how are they related Objective To use a crosstabs to explore the relationship between smoking and drinking 88 Chapter 3 Finding Relationships amongVariabes Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Solution The rst question is the data format If you are lucky you will be given a table of counts However it is also possible that you will have to create these counts In the file for this example the data are in long columns a small part of which is shown in Figure 32 Presumably there could be other variables describing these people but we are interested only in the Smoking and Drinking variables Figure 32 Smoking and Drinking Data Figure 33 Headings for Crosstabs A B C 1 Person Smoking Drinking 2 1 NS OD 3 2 NS HD 4 3 OS HD 5 4 HS ND 6 5 NS OD 7 6 NS ND 8 7 NS OD 9 8 NS ND 10 9 OS HD 11 10 HS HD To create the crosstabs start by entering the category headings in Figure 33 The goal is to fill in the box with counts of joint categories along with totals as row and col umn sums If you are thinking about using the COUNTIF function to obtain the joint counts you are close Unfortunately the COUNTIF function lets you specify only a single criterion but there are now two criteria one for smoking and one for drinking Fortunately Excel has a new function new to Excel 2007 designed exactly for this COUNTIFS It allows you to specify any number of rangecriterion pairs In fact you can fill in the entire table with a single formula entered in cell F4 and copied to the range F4H6 COUNTIFSB2B8762F3C2C8762E4 The first two arguments are for the condition on smoking the last two are for the condition on drinking You can then sum across rows and down columns to get the totals E F G Crosstabs from COUNTIFS formulas NS OS ND OD HD Total The resulting counts appear in the top table in Figure 34 For example among the 8761 people 4912 are nonsmokers 2365 are heavy drinkers and 733 are nonsmokers and heavy drinkers Because the totals are far from equal there are many more non smokers than heavy smokers for example any relationship between smoking and 32 Relationships among CategoricaVariabes 89 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Figure 34 Crosstabs of Smoking and Drinking Relationships between the two variables are usually more evident when the counts are expressed as percentages of row or percentages of column EIF G HI 1 Crosstabs from COUNTIFS formulas 2 3 NS 05 HS Total 4 ND 2118 435 163 2716 5 OD 2061 1067 552 3680 6 HD 733 899 733 2365 7 Total 4912 2401 1448 8761 8 9 Shown as percentages of row 10 NS 05 HS Total 11 ND 780 160 60 1000 12 OD 560 290 150 1000 13 HD 310 380 310 1000 14 15 Shown as percentages of column 16 NS 05 HS 17 ND 431 181 113 18 OD 420 444 381 19 HD 149 374 506 20 Total 1000 1000 1000 drinking is difficult to detect in these raw counts Therefore it is useful to express the counts as percentages of row in the middle table and as percentages of column in the bottom table The latter two tables indicate in complementary ways that there is de nitely a relationship between smoking and drinking If there were no relationship the rows in the middle table would be practically identical as would the columns in the bottom table Convince yourself why this is true But they are far from identical For example the mid dle table indicates that only 6 of the nondrinkers are heavy smokers whereas 31 of the heavy drinkers are heavy smokers Similarly the bottom table indicates that 431 of the nonsmokers are nondrinkers whereas only 113 of the heavy smokers are non drinkers In short these tables indicate that smoking and drinking habits tend to go with one another These tendencies are reinforced by the column charts of the two percentage tables in Figure 35 Figure 35 Column Charts of Smoking and Drinking Percentages Percentages of row Percentages of column 900 600 0 IND on HD INS os DHS 800 500 V 700 600 400 T 500 400 300 J 300 200 J 200 100 100 i 00 00 ND OD HD I 90 Chapter 3 Finding Relationships amongVariabes Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it FUNDAMENTAL INSIGHT CountsVersus Percentages There is no single correct way to display the data in a contingency table Ultimately the data are always counts but they can be shown as raw counts percent ages of row totals percentages of column totals or even percentages of the overall total However when you are oolting for relationships between two cate gorical variables showing the counts as percentages of row totals or percentages of column totals usually makes any relationships stand out more clearly Corresponding charts are also very useful Excel Tip It takes almost no work to create these charts To get the one on the left highlight the range EIO39HI3 and insert a column chartfrom the Insert ribbon Do the same with the range E1 639H 9 to get the chart on the right except that it will have smoking on the horizontal axis and drinking in the legend To reverse their roles simply click on Switch RowColumn from the Chart Tools Design ribbon Although this example illustrates that it doesn t take too much work to create crosstabs and corresponding charts you will see a much quicker and easier way when pivot tables are discussed later in this chapter PROBLEMS Note Student solutions for problems whose numbers appear within a colored box are available for purchase at www cengagebrain com Level A The le P0201XlsX indicates the gender and nationality of the MBA incoming class in two suc cessive years at the Kelley School of Business at Indiana University a For each year separately recode Nationality so that all nationalities with a count of 1 or 2 are listed as Other b For each year create a crosstabs of Gender versus the recoded Nationality and an associated column chart Does there seem to be any relationship between Gender and the recoded Nationality Is the pattern about the same in the two years 2 The le P0203XlsX contains data from a survey of 399 people regarding a govemment environmental policy a Create a crosstabs and an associated column chart for Gender versus Opinion Express the counts as percentages so that for either gender the percentages add to 100 Discuss your findings Specifically do the two genders tend to differ in their opinions about the environmental policy b Repeat part a with Age versus Opinion c Recode Salary to be categorical with categories Less than 40K Between 40K and 70K Between 70K and 100K and Greater than 100K where you can treat the breakpoints however you like Then repeat part a with this new Salary variable versus Opinion The le P0202XlsX contains data about 211 movies released in 2006 and 2007 a Recode Distributor so that all distributors except for Paramount Pictures Buena Vista Fox Searchlight Universal Wamer Bros 20th Century Fox and Sony Pictures are listed as Other Those in Other released fewer than 16 movies Similarly recode Genre so that all genres except for Action Adventure ThrillerSuspense Drama and Comedy are listed as Other Again those in Other are genres with fewer than 16 movies b Create a crosstabs and an associated column chart for these two recoded variables Express the counts as percentages so that for any distributor the percentages add to 100 Discuss your findings 4 Recall from Chapter 2 that the le Supermarket Transacti0nsXlsX contains over 14000 transactions made by supermarket customers over a period of approximately two years To understand which customers purchase which products create a crosstabs 32 Relationships among CategoricaVariabes 9 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it and an associated column chart for each of the following For each express the counts as percentages so that for any value of the rst variable listed the percentages add to 100 Do any patterns stand out a Gender versus Product Department b Marital Status versus Product Department c Annual Income versus Product Department Find the rst second and third quartiles of Salary and then recode Salary as 1 to 4 depending on which quarter of the data each value falls into For example the rst salary 16400 is recoded as 1 because 16400 is less than the rst quartile 29975 Recode AmountSpent similarly based on its quartiles Hint The recoding can be done most easily with lookup tables Level B 5 Recall from Chapter 2 that the HyTex Company is a direct marketer of technical products and that Age versus the recoded AmountSpent OwnHome versus the recoded AmountSpent History versus the recoded AmountSpent The recoded Salary versus the recoded AmountSpent 999 the file Catalog MarketingXlsx contains recent data on 1000 HyTex customers To understand these 6 customers first recode Salary and AmountSpent as indicated in part a and then create each of the requested crosstabs and an associated column chart in parts b to e Express each count as a percentage so that for any value of the first variable listed the percentages add to 100 Do any patterns stand out In the smokingdrinking example in this section we used the function COUNTIFS function new to Excel 2007 to nd the counts of the joint categories Without using this function or pivot tables devise another way to get the counts The raw data are in the le Smoking Drinkingxlsx Hint One possibility is to concatenate the values in columns B and C into a new column D But feel free to nd the counts in any way you like 33 RELATIONSHIPS AMONG CAT EGORICAL VARIABLES AND A NUMERICAL VARIABLE The comparison problem where a numerical variable is compared across two or more sub populations is one of the most important problems faced by data analysts in all fields of study This section describes a very common situation where you are interested in a numerical variable such as salary and you would like to break it down by category of some categorical variable such as gender This is precisely what pivot tables were built for as we will discuss later in the chapter but for now we will discuss the numerical and graphical tools offered by StatTools to explore this problem This general problem typically referred to as the comparison problem is one of the most important problems in data analysis It occurs whenever you want to compare a numerical measure across two or more subpopulations Here are some examples I The subpopulations are males and females and the numerical measure is salary I The subpopulations are different regions of the country and the numerical measure is the cost of living I The subpopulations are different days of the week and the numerical measure is the number of customers going to a particular fast food chain I The subpopulations are different machines in a manufacturing plant and the numerical measure is the number of defective parts produced per day I The subpopulations are patients who have taken a new drug and those who have taken a placebo and the numerical measure is the recovery rate from a particular disease I The subpopulations are undergraduates with various majors business English history and so on and the numerical measure is the starting salary after graduating The list could go on and on Our discussion of the comparison problem begins in this chapter where we use exploratory methods to investigate whether there appear to be dif ferences across the subpopulations on the numerical variable of interest In later chapters we will use inferential methods con dence intervals and hypothesis tests to see whether the differences we see in samples from the subpopulations can be generalized to the subpopulations as a whole 92 Chapter 3 Finding Relationships amongVariabes Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it FUNDAMENTAL INSIGHT Breaking Down By Category with this general comparison problem They allow you to see quickly with charts andor summary mea sures how two or more categories compare More sophisticated techniques for comparing across cate gories will be covered in later chapters There is arguably no more powerful data analysis technique than breaking down a numerical variable by category The methods in this chapter especially side byside box plots and pivot tables get you started 331 Stacked and Unstacked Formats We begin by discussing two possible data formats you will see stacked and unstacked This concept is crucial for understanding how StatTools deals with comparison problems Consider salary data on males and females There could be other variables in the data set The Stacked format but we will ignore them for now Then the data are stacked if there are two long vari S by far the most ables Gender and Salary as indicated in Figure 36 The idea is that the male salaries are common There are stacked in with the female salaries This is the format you will see in the vast majority of One 0rm0re l n situations However you will occasionally see data in unstacked format as shown in zggn gggggfy lges Figure 37 Note that both tables list exactly the same data See the file Stacked Variable that Speci es Unstacked DataXlsX Now there are two short variables Female Salary and Male which category each Salary In addition it is very possible that the two variables have different lengths This is observation is in the case here because there are more females than males Figure 36 A 3 Stacked Data 1 Gender Salary 2 Male 81600 3 Female 61600 4 Female 64300 5 Female 71900 6 Male 76300 7 Female 68200 8 Male 60900 9 Female 78600 10 Female 81700 11 Male 60200 12 Female 69200 13 Male 59000 14 Male 68600 15 Male 51900 16 Female 64100 17 Male 67600 18 Female 81100 19 Female 77000 20 Female 58800 21 Female 87800 22 Male 78900 33 Relationships among CategoricaVariabes and a NumericaVariabe 93 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Figure 37 A B Unstacked Data 1 Female Salary Male Salary 2 61600 81600 3 64300 76300 4 71900 60900 5 68200 60200 6 78600 59000 7 81700 68600 8 69200 51900 9 64100 67600 10 81100 78900 11 77000 12 58800 13 87800 StatTools is capable of dealing with either stacked or unstacked format Not all statis tical software can make this claim Some packages require stacked format Nevertheless there are a few times when you might want to convert from stacked to unstacked format or vice versa StatTools has utilities for doing this These utilities are found on the Data Utilities not the Utilities dropdown list on the StatTools ribbon They are very simple to use and we suggest that you try them on the data in Figures 36 and 37 If you need help open the finished version of the Stacked Unstacked Dataxlsx file which includes instructions for using these data utilities We now return to the baseball data to see which if any of the categorical variables makes a difference in player salaries EXAM PLE 32 BASEBALL SALARIES he file Baseball Salaries 2009 Extraxlsx contains the same 2009 baseball data examined in the previous chapter In addition several extra categorical variables are included Pitcher Yes for all pitchers No for the others League American or National Division National West American East and so on Yankees Yes if team is New York Yankees No otherwise Playoff Team 2008 Yes for the eight teams that made it to the playoffs No for the others I World Series Team 2008 Yes for Philadelphia Phillies and Tampa Bay Rays No for others Do pitchers or any other positions earn more than others Does one league pay more than the other or do any divisions pay more than others How does the notoriously high Yankees payroll compare to the others Do the successful teams from 2008 tend to have larger 2009 payrolls Objective To learn methods in StatTools for breaking down baseball salaries by various categorical variables 94 Chapter 3 Finding Relationships amongVariabes Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it StatTools often lets you choose the stacked format This allows you to choose a Cat categorical variable and a Val value variable for the analysis Figure 38 One Variable Summary Dialog Box with Stacked Format Solution We first look at some numerical summary measures for salary These are the same summary measures from the previous chapter but now we want to break them down by position Fortunately StatTools makes this easy Imagine how you would have to do it without an add in It would not be fun To get started designate the range as a StatTools data set in the usual way and then select One Variable Summary from the Summary Statistics dropdown list The key now is to click on the Format button see Figure 38 and choose Stacked if it isn t already selected When you choose Stacked you get two lists of variables to choose from In the Cat categorical list choose the variable that you want to categorize by in this case Position In the Val value list choose the variable that you want to summarize in this case Salary Then select any of the summary measures you would like to see such as those checked in Figure 38 5tatTls ne iquotariahe Surnmaryr Statistics 3939ariaues Select Cine Category and Cine 39I39aueI oate set oate set 1 Cati3939a Mame H Fiddress 39quotquot5393Equot3l E Iquot Position 2e1s at Stacked F F Pitcher o2oe1s IT I League E2Ee1s I IT Division F2Fe1s F E ranitees s2ee1s Iquot 397 Pleteerr Team zooe H2He1s Summary Statistics to Fteoort F Mean F Minimum fl Either F39eruenties ii 39t39arianne ii Maximum M gg gw ii Standard Deviation T ange 2quotEm gEE 5l E39 quot 355 W 3939quotquotquot5 suios a seooa a ii iiurtosis if Sum HMEm A E iii r7et ii Median ii First Duartile gamma l Mean I3Ius Deviation if Third Duartile g iwg ii Mode is Interguartile 0 UK Cancel The results appear in Figure 39 This table lists each of the requested summary measures for each of the nine positions in the data set If you want to see salaries broken down by team or any other categorical variable you can easily run this analysis again and choose a different Cat variable 1For baseball fans don t be fooled by the low mean for the In elder position There are only seven players in this category evidently the utility infielders who can play several positions and don t command high salaries 33 Relationships among CategoricaVariabes and a NumericaVariabe 95 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Figure 39 Summary Measures of Salary for Various Positions A B c D E F G H I J 7 Salary Catcher Salary Designated Hitter Salary First Baseman Salary ln elder Salary 0utfieder Salary Pitcher Salary Second Baseman Salary Shortstop Salary Third Baseman T One Variable Summary Data Set 1 Data Set 1 Data Set 1 Data Set 1 Data Set 1 Data Set 1 Data Set 1 Data Set 1 Data Set 1 L Mean 2179760 6633875 5706002 667143 3834524 2887334 2463334 3113541 4954790 10 Std Dev 2660154 5425908 5824204 590670 5042623 3927561 2478960 4058174 6333443 T1 Median 950000 4000000 3125000 420000 1500000 825000 1400000 1300000 2400000 3 Minimum 400000 400000 400000 400000 400000 400000 400000 400000 400000 F Maximum 13100000 13000000 20625000 2000000 23854494 18876139 11285714 21600000 33000000 I Count 63 8 39 7 150 407 47 53 44 E 1st Quartile 415000 421000 750000 400000 417500 414800 416700 425000 432400 E 3rd Quartile 2800000 11500000 11600000 550000 5000000 3750000 3500000 4650000 7050000 Sidebyside box There are a lot of numbers to digest in Figure 39 so it is difficult to get a clear PWS are OW f0V0 391 6 picture of differences across positions It is much more enlightening to see a graphical gag I Jf mP 139rr39 g the summary of this information There are several types of graphs you can use Our favorite IS F I U lOn O G numerical Variable way 1S to create sidebyside box plots the same type of chart illustrated in the across categories of chapter opener as we will discuss shortly Another possibility is to create sideby side some categorical histograms with one histogram for each category This is easy with StatTools using V0 0ble the Stacked format option exactly as we did for summary measures However you should not accept the default bins because they will differ across categories and pre vent a fair comparison So make sure you enter your own bins See the finished ver sion of the baseball file for an illustration of sideby side histograms done with default bins and with specified bins A third possibility is to use pivot tables and correspond ing pivot charts as will be discussed later in this chapter For now we illustrate sidebyside box plots These are very easy to obtain Select Box Whisker Plot from the Summary Graphs dropdown list and fill in the resulting dialog box as shown in Figure 310 Again the key is to select the Stacked format so that you can choose a Cat variable and a Val variable The results appear in Figure 311 There is a separate box plot for each category of the Position variable and each has exactly the same interpretation that we discussed in the Figure 3 I 0 sranele eerr w39lrielrer mt BX Plot Dlalog BOX 3939erieues Select One Cetegnrar and One 39i39elIlel with Stacked Format gate sat lcete set 1 Cetl3939e l Heme l Flnzlnzlress 39quotquot5t5quot3lquot3 Fquot Iquot Player 39salerlae 2nus39lri2rie1s r39 Stacked F Iquot Teern 39salarlee 2cicls39le2eeis F I Pnsitinn Salaries 2culs39lc2ce1s Iquot I Pitcher 39salerlee 2ccls39lcece1s F F League 39salarlee 2cicls39lE2Eeis lquot39 Iquot Dis39isinn Salaries 2culs39ll2le1s p Dptinns ll Inclucle Eel Describing F39lcut Elements K K ilill Cllf l Cancel H 96 Chapter 3 Finding Relationships amongVariabes Copyright 2010 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Figure 3quot I Box Plot of Comparison of SalaryData Set 1 Box Plots of Salary I x I I by Position TI Position Third Baseman Position Shortstop 1 Position Second X Baseman X F I I Position Pitcher E X I 39393939 III III III I TI Position Outfielder I Position ln elder Baseman X F Position Designated Hitter 4 gt 39 I PositionFirst l III III I I Position Catcher I I I I I I I 0 5000000 10000000 15000000 20000000 25000000 30000000 35000000 previous chapter Now the differences between positions emerge fairly clearly A few of the conclusions that can be made are the following I The salaries for all positions are skewed to the right mean greater than median long lines and outliers to the right I As a whole first basemen tend to be the highest paid players followed by third base men The designated hitters also make a lot but there are only eight such players in the data set I As a whole pitchers and outfielders don t make as much as first basemen and third basemen but there are a lot of highearning outliers at these two positions I Except for a few notable exceptions catchers and second basemen don t get much respect Because these side by side box plots are so easy to obtain you can generate a lot of them to provide insights into the salary data Several interesting examples appear in Figures 312314 From these box plots we can conclude the following I Pitchers make somewhat less than other players although there are many outliers in each group I The Yankees payroll is indeed much larger than the payrolls for the rest of the teams In fact it is so large that Alex Rodriguez s 33 million is considered only a mild outlier relative to the rest of the team 33 Relationships among CategoricaVariabes and a NumericaVariabe 97 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Figure 3 I 2 Box Plots of Salary by Pitcher Figure 3 I 3 Box Plots of Salary by Yankees You can often create a categorical variable on the y with an IF formula and then use it for sidebyside box pIotsWe did this with the Yankees I Aside from the many outliers the playoff teams from 2008 tend to have larger payrolls than the non playoff teams The one question we cannot answer however at least not without additional data is whether these larger payrolls are a cause or an effect of being successful Box Plot of Comparison of Saary Data Set 1 X I I Pitcher Yes X EIEI El I I I I Pitcher No 0 5000000 10000000 15000000 20000000 25000000 30000000 35000000 Box Plot of Comparison of SalaryData Set 1 El Yankees Yes Yankees No I I I I I I I 0 5000000 10000000 15000000 20000000 25000000 30000000 35000000 There is one StatTools limitation you should be aware of The academic version allows only 12 categories for box plots Therefore you can t choose Team as the Cat variable because there are 30 teams However it is possible to isolate one or more teams in a column and then base the box plots on this column as we did for the Yankees As another example if you were interested in comparing the Yankees the Red Sox and all the others you could create another column with three values Yankees Red Sox and Other 98 Chapter 3 Finding Relationships amongVariabes Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Box Plot of Comparison of SalaryData Set 1 Figure 34 Box Plots of Salary by Playoff Team 2008 X EIEIEIII EIUEIEI I l Playoff Team 2008 Yes I j I I I Playoff Team 2008 No 0 5000000 10000000 15000000 20000000 25000000 30000000 35000000 I P R O B L E M S Level A Thirties and Forties based on their ages in the 2009 sheet Then use StatTools to calculate the mean Recall thC lC Baseball Salaries EXtI39aXlSX Ingdlan and Standard dgvlatlgn Of thg following eentains data On 818 Maj OT League Baseball MLB variables broken down by the recoded Age Comment Players during the 2009 SeaS0n Use StatT001S t0 nd on whether it appears that golfers peak in their thirties the mean median standard deviation and first and a Earnings third quartiles of Salary broken down by each of the b YardDrivc and Driving Accuracy following categories Comment on your findings c Grccns in Rcgniaticn 31 Team d Putting Average Golfers want this to be small b Dtvtsm 10 R llfrmCh tr2thtth HT Cm n i c Whether they played for the Yankees 39 eCa 0 ap 6 a 6 y ex 0 pa y S a d Whether they were in the playoffs direct marketer of technical products and that the le Catalog Marketingxlsx contains recent data on 1000 8 The le Pt207xlsx includes data on 20439employees HyTeX enstonlero Use Staffools to nd the mean at tne nenenat eempany Beta Teennetegted Use median and standard deviation of AmountSpent StatTeetS te nnd tne nteana median and Standat d broken down by the following variables Then create deviation of Annual Salary broken down by each of SidebySide box blots of Anlonntsnenn broken down the following categories Comment on your findings by the some Vanab1eS Comment on now the box blots a Gender complement the summary measures b A recoded version of Education with new values a Age 1 for Education less than 4 2 for Education equal b Gender to 4 and 3 for Education greater than 4 e Close c A recoded version of Age with people aged less d Region tnan 34 nsted as Yenng tnese aged at least 34 and e Year of rst purchase Hint For this one use the less than 50 listed as Middle aged and those aged YEAR fnnenon to ereate a Year eolnmn at least 50 listed as Older f The combination of Married and OwnHome For E The le Golf Statsxlsx contains data on the 200 top this one create a code variable with values from golfers each year from 2003 to 2009 This data set is 1 to 4 for the four combinations of Married and used in an example in the next section Create a OwnHome Alternatively create a text variable recoded Age variable with values Twenties with values such as Not married Owns home 33 Relationships among CategoricaVariabes and a NumericaVariabe 99 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it The file P0235XlsX contains data from a survey of 500 randomly selected households 21 Create a new column Has Second Income with values Yes and No depending on whether the household has a reported second income b Use StatTools to find the mean median and standard deviation of First Income broken down by the variable you created in part a Is there any indication that rst income tends to be any larger or smaller or has more or less variation depending on whether there is a second income c Repeat part b for each of the Monthly Payment and Debt variables 12 The file P0202XlsX contains data about 211 movies released in 2006 and 2007 a Recode Genre so that all genres except for Action Adventure ThrillerSuspense Drama and Comedy are listed as Other Those in Other are genres with fewer than 16 movies b Use StatTools to nd the mean median and standard deviation of Total US Gross broken down by the recoded Genre variable Also create side by side box plots of Total US Gross again broken down by the recoded Genre variable Comment on what the results say about the popularity of different genres 13 The Wall Street Journal CEO Compensation Study analyzed chief executive officer CEO pay from many U S companies with fiscal year 2008 revenue of at least 5 billion that filed their proxy statements between October 2008 and March 2009 The data are in the file P0230xlsX Note This data set contains somewhat different CEO compensation data from the data set used as an example later in this chapter 21 Create a new variable Total 2008 the sum of Salary 2008 and Bonus 2008 Actually this is not tota compensation because it omits the very lucrative compensation from stock options Also recode Company Type so that the Technology and Telecommunications are collapsed into a TechTelecomm category b Use StatTools to find the mean median and standard deviation of Total 2008 broken down by the recoded Company Type Also create side by side box plots of Total 2008 again broken down by the recoded Company Type What do the results tell you about differences in level or variability across company types 14 The file P0255XlsX contains monthly sales in millions of dollars of beer wine and liquor The data have not been seasonally adjusted so there might be seasonal patterns that can be discovered 21 Create a new Month Name variable with values Jan Feb and so on Use Excel s MONTH function and then a lookup table 100 Chapter 3 Finding Relationships amongVariabes Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it b Use StatTools to create side by side box plots of Total Sales broken down by Month Name Is there any evidence of differences across months for either the level of sales or the variability of sales The file P0315XlsX contains monthly data on the various components of the Consumer Price Index CPI The source claims that these data have been seasonally adjusted The following parts ask you to check this claim 21 Create a new Month Name variable with values Jan Feb and so on Use Excel s MONTH function and then a lookup table b Create side by side box plots of each component of the CPI including the All Items variable broken down by the Month Name variable from part a What results would you expect for seasonally adjusted data Are your results in line with this 16 The file P0211XlsX contains data on 148 houses that were recently sold in a fictional suburban com munity The data set includes the selling price of each house along with its appraised value square footage number of bedrooms and number of bathrooms 21 Create two new variables Ratio1 and Ratio2 Ratio1 is the ratio of Appraised Value to Selling Price and Ratio2 is the ratio of Selling Price to Square Feet Identify any obvious outliers in these two Ratio variables b Use StatTools to nd the mean median and standard deviation of each Ratio variable broken down by Bedrooms Also create side by side box plots of each Ratio variable again broken down by Bedrooms Comment on the results c Repeat part b with Bedrooms replaced by Bathrooms d If you repeat parts b and c with any obvious outliers from part a removed do the conclusions change in any substantial way Level B The file P0232xlsx contains blood pressures for 1000 people along with variables that can be related to blood pressure These other variables have a number of missing values probably because some people didn t want to report certain information For each of the Alcohol Exercise and Smoke variables use StatTools to find the mean median and standard deviation of Blood Pressure broken down by whether the data for that variable is missing For example there should be one set of statistics for people who reported their alcohol consumption and another for those who didn t report it Based on your results does it appear that there is any difference in blood pressure between those who reported and those who didn t 18 The le P0318XlsX contains the times in the Chicago marathon for the top runners each year the top 10000 in 2006 the top 20000 in 2007 and 2008 a Merge the data in these three sheets into a single sheet named 20062008 and in the new sheet create a variable Year that lists the year b The Time variable shown as something like 2 16 12 is really stored as a time the fraction of day starting from midnight So 21612 for example which means 2 hours 16 minutes and 12 seconds is stored as 00946 meaning that 21612 AM is really 946 of the way from midnight to the next midnight This isn t very useful Do whatever it takes to recode the times into a new Minutes variable with two decimals so that 21612 becomes 13620 minutes Hint Look up Time functions in Excel s online help c Create a new variable Nationality to recode Country as KEN ETH USA or Other depending on whether the runner is from b KenyaEthiopia the usual winners the USA or some other country d Use StatTools to nd the mean median standard deviation and rst and third quartiles of Minutes broken down by Nationality Also create side by side box plots of Minutes again broken down by Nationality Comment on the results e Repeat part d replacing Nationality by Gender 19 The le P0218XlsX contains daily values of the SampP Index from 1970 to 2009 It also contains percentage date You can look up the presidents and dates online b Use StatTools to nd the mean median standard deviation and rst and third quartiles of Change broken down by President Also create side by side box plots of Change again broken down by President Comment on the results 20 The le P0256XlsX contains monthly values of indexes that measure the amount of energy necessary to heat or cool buildings due to outside temperatures See the explanation in the Source sheet of the le These are reported for each state in the US and also for several regions as listed in the Locations sheet from 1931 to 2000 a For each of the Heating Degree Days and Cooling Degree Days sheets create a new Season variable with values Winter Spring Summer and Fall Winter consists of December January and February Spring consists of March April and May Summer consists of June July and August and Fall consists of September October and November Use StatTools to find the mean median and standard deviation of Heating Degree Days HDD broken down by Season for the 48 contiguous states location code 5999 Ignore the first and last rows for the given location the ones that contain 9999 the code for missing values Also create side by side box plots of HDD broken down by season Comment on the results Do they go in the direction you would expect Do the same for Cooling Degree Days which has no missing data changes in the index from each day to the next c Repeat part b for California code 0499 21 Create a new variable President that lists the d Repeat part b for the New England group of states US presidents Nixon through Obama on each code 5801 34 RELATIONSHIPS AMONG NUMERICAL VARIABLES In general don t use correlations that involve coded categorical variables such as 0 dummies The methods from the previous two sections are more appropriate In this section we discuss methods for nding relationships among numerical variables For example we might want to examine the relationship between heights and weights of people or between salary and years of experience of employees To study such relation ships we introduce two new summary measures correlation and covariance and a new type of chart called a scatterplot Note that these measures can be applied to any variables that are displayed numeri cally However they are appropriate only for truly numerical variables not for categorical variables that have been coded numerically In particular many people create dummy 01 variables for categorical variables such as Gender and then include these dummies in a table of correlations This is certainly possible and we do not claim that it is wrong However if you want to investigate relationships involving categorical variables it is better to employ the tools in the previous two sections 34 Relationships among NumericaVariabes I01 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it 341 Scatterplots We first discuss scatterplots a graphical method for detecting relationships between two numerical variables2 Then we will discuss the numerical summary measures correlation and covariance in the next subsection We do it in this order because correlation and covariance will make more sense once you understand scatterplots A scatterplot is a scatter of points where each point denotes the values of an observation for two selected variables The two variables are often labeled generically as X and Y so a scatterplot is sometimes called an XY chart The whole purpose of a scatterplot is to make a relation ship or the lack of it apparent Do the points tend to rise upward from left to right Do they tend to fall downward from left to right Does the pattern tend to be linear nonlinear or no particular shape Do any points fall outside the general pattern The answers to these questions provide information about the possible relationship between the two variables We illustrate the process in the following example 33 GOLF STATS ON THE PGA Toun or the past decade or so the Professional Golf Association PGA has kept statistics on all PGA Tour players and these stats are published on the Web We imported yearly data from 20032009 into the file Golf Statsxlsx The full 2009 data set wasn t available when we wrote this example but it is now available in the le The le includes an observation for each of the top 200 earners for each year including age eamings events played rounds played 36hole cuts made only the top scorers on Thursday and Friday get to play on the weekend the others don t make the cut top 10s and wins It also includes stats about efficiency in the various parts of the game driving putting greens in regulation and sand saves as well as good holes eagles and birdies and bad holes bogies A sample of the data for 2008 appears in Figure 315 with the data sorted in decreasing order of earnings and a few variables not shown3 What relationships can be uncovered in these data for any particular year Figure 3 I 5 Golf Stats A B C D E F G H I J K 1 Player Age Events Rounds Cuts Made Top 10s Wins Earnings YardsDrive Driving Accuracy Greens in Regulation 2 Vijay Singh 45 23 82 18 8 3 6601095 2987 595 651 3 Tiger Woods 32 6 23 6 6 4 5775000 4 Phil Mickelson 37 21 79 20 8 2 5188875 2965 553 65 5 Sergio Garcia 28 19 70 18 6 1 4858224 2946 594 671 6 Kenny Perry 47 26 97 24 7 3 4663794 2957 62 675 7 Anthony Kim 22 22 81 19 8 2 4656266 301 583 658 8 Camilo Villegas 26 22 78 19 7 2 4422641 2936 582 646 9 Padraig Harrington 36 15 51 12 6 2 4313551 2976 594 595 10 Stewart Cink 35 23 85 19 7 1 3979301 2972 553 646 11 Justin Leonard 35 25 96 24 8 1 3943542 2825 677 659 Objective To use scatterplots to search for relationships in the golf data Solution This example is typical in that you are given many numerical variables and it is up to you to search for possible relationships A good first step is to ask some interesting questions 2Some people spell these plots as scatterplots others scatter plots We and StatTools prefer the one word spelling 3You might recall that Tiger Woods missed the rest of 2008 because of knee surgery after winning the US Open in June This explains his missing values I02 Chapter 3 Finding Relationships amongVariabes Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Scatterplots are great for initial exploration ofthe data lfa scatterplot suggests a relationship between two variables other methods can then be used to examine this relationship in more depth StatTools allows you to create many scatterplots in one run just select multiple X variables andor multiple Y variables Figure 36 StatTools Scatterplot Dialog Box and then try to answer them with scatterplots For example do younger players play more events Are earnings related to age Which is related most strongly to earnings driving putting or greens in regulation Do the answers to these questions remain the same from year to year This example is all about exploring the data and we will answer only a few of the questions that could be asked Fortunately scatterplots are easy to create especially with StatTools so you can do a lot of exploring very quickly It is possible to create a scatterplot with Excel tools only that is without StatTools To do so highlight any two variables of interest and select a scatter chart of the top left type from the Insert ribbon At this point you will probably want to modify the chart by deleting the legend inserting some titles and possibly changing some formatting Also you might want to swap the roles of the X and Y variables The point is that you can do it but the process is a bit tedious especially if you want to create a lot of scatterplots Excel Tip How do you highlight two long variables such as Age and Earnings Here are the steps that make it easy 1 Highlight the Age label in cell B1 2 With yourfinger on the Shift key press the End key and then the down arrow key This highlights the Age column 3 With your finger on the Ctrl key highlight the Earnings label in cell HI 4 With your finger on the Shift key press the End key and then the down arrow key Now both columns are highlighted It is much easier to use StatTools Begin by designating a StatTools data set called Stats 2008 to distinguish it from data sets you might want to create for the other years Then select Scatterplot from the Summary Graphs dropdown list This resulting dialog box appears in Figure 316 You must select at least one X variable and at least one Y variable However you are allowed to select multiple X variables andor multiple Y variables Then a scatterplot will be created for each X Y pair selected For example if you want to see how a number of variables are related to Earnings you can select Earnings as the Y variable and the others as X variables as shown in the gure Note that StatTools shows the associated correlation below each scatterplot if you check the Display Correlation Coefficient option We will discuss correlations shortly Several scatterplots appear in Figures 317 through 320 In a few of these we modi ed the scale on the horizontal axis so that the scatter lls the chart The scatterplots in Figure 317 indicate the possibly surprising results that age is practically unrelated to the number of events played and earnings Each scatter is basically a shapeless swarm of 5tetTnls Seetterplet K 39erieies ISeei e Lees Cine i and Dine VII gate Set statszetie Y Etirrnat i39 H 391 H I leme H Address P F FF Eernings 392eee39iH2H2ti1 I57 I7 quoti erisI39EIriiie 392iiee39i12I2ii1 Ii Iquot Dri39i39iI39ii FHIIIZLII EiI3939 392IIIIIIE39J2J2III1 J F F Greens in eggulatign 392iiiJe39ii2it2ti1 T P Iquot Putting Fwerege 392iitie39iL2L2iJ1 I17 I393939 Send Se39ie F39I 392DDE39M2M2D1 CIiiiins H7 iEIisieii Qerrelelzien Ceeffiizienl e ei I03 34 Relationships among Numerical Variables Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Figure 3 I 7 Scatterplots of Age Versus Events and Earnings 5Isal ternlurls II EIi39E l3939I5 35 Age at 5tiats E l I39IL39ampIIIII IEV Iu39h1quotII j39I Earning5 p H c I 39amptsia2IJIJE an u iri ii i llil 139 35 5uai1um 3 an pFZ mm D 139 rn as T i fr 5 N 1 22 3 amaura at 35 E X 2 Q 139 IIF Elli r E 15 393 1 I 7quot ml quotl39 h K I P 139 E 1aiitram 7 t P W iW in F V W R n Vh 1 39 4 pV V r it I 1 4 ir icmnnm quot 34 r quotr 5 h f rj J A It I3 1quottI7IgirEE1I333quot in 11 I cu in 2 an 4 so an in 15 an 25 an as we as so 55 so g39eiquot5tat5 EH95 Agef tati 39gtiJE Remember that all StatTools charts are really Excel charts so you can modify them as you like with the usual Excel chart tools points and a shapeless swarm always indicates no relationship The scatterplots in Figure 318 con rm what we would expect Speci cally players who play in more events tend to earn more although there are a number of exceptions to this pattern Also players who make more 36 hole cuts tend to earn more Note the outlier in both of these scatter plots Tiger Woods In spite of playing in only six events and making the cut in all of them he earned nearly 6 million Excel Tip Unfortunately there is no automatic way to enter a label such as Tiger Woods next to a point in a scatterplot We wish there were but there isn t at least not without writing a macro We had to insert the text boxes manually in Figure 318 If you click twice on a point don t double click but slowly click twice you can select this point Then ifyou right click you have the option ofadding a data label However this data label is always the value of the Y variable In this case it would be Tiger s earnings not his name Figure 3 I 8 Scatterplots of Earnings Versus Events and Cuts Made S4atquotterpIu39i at IE39arniI15 vs Eumnh af Sizartis 2DEB Staatterpillaut avf Earnings wi ruts Had an Sitzan IEIIJB ililflriiiiifl 1 l iiailiiiiillii T xi 4 E 5331333 E 5SlililZ39IlI quot9 Vi E 1 0 quot 1I I E g 4333335 i E 39i ii I 1 E 1 E i quot gh Erilililililil Q 3wimu E HF 39l39i 4 un f lZ39lZZiiquot fiv E 1 4 p quot V 39 1 2 4 1 EiE39quotg tr it t p I E f E I 439 l Zl1lll P 1 I I 0m a3laIJACl p E P E K V H i E quotF quot 4 39 391 I i 7 1 5 I 39I3quot u I u 5 5 3 35 331 35 31 El 5 13 15 2 25 33 I04 Chapter3 Fi 5 H113 39 51m I 5 quotJun EquotidE Em 1 233 Golfers will be particularly interested in the scatterplots in Figures 319 and 320 First the scatterplots in Figure 319 indicate almost no relationships between earnings and the two components of driving length yards per drive and accuracy percentage of fair ways hit At least in 2008 neither driving length nor driving accuracy seems to have much effect on earnings In contrast there is a reasonably strong upward relationship between nding Relationships amongVariabes Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Figure 3 I 9 Scatterplots of Earnings Versus Driving Length and Driving Accuracy Smtiterplant all Earnings I39 quot l39aflll l39Dfi39lii39E 1 I 539Iill 2 H 3939CICII39JIEIIZZI ECICIEICIIECI SIEJIEICIDCCI WUUIECO ECIUCIUCG E39Ift39 llIi39lf39 lS39E an 2 III EIE ZEIIEIIEIIEIIEIZI 1393C39339ZCU 39 D SEED 2iquot3 ESQ 3939arI Ji39Elri rm f E39l aI5 EEHJE 315 326 iEiIi llH 33951 arts 2 El EIE TEIDUUCU 39 EDUUEICD SDDGDED 43iL39DF3EC EUQCIUW 2iZ39iJl3393393393 339ZIUU393quot3339 39 CI 5tatterpIuE39ni39 Earnings vs Driving Asacwlracy sr39l5ta ts2lJI1E F i a 39l 1i 1 39 3939 T 395 V 1 1 quotrl 39iquot 1 1 g r v T 39 9 i Fl E 1 iF s39P39lT quotl l39p39139 I 39I 1 3 g39r Erquot i T B w 4 r T A t E l 1 W T1 1 r iI39l39EH I I A aii g 50 55 E43 E15 FE T5 El EIri39ui rig 39aoEIEiJiE g 0 Slat 5 EUIEE Figure 320 Scatterplots of Earnings Versus Putting and Greens in Regulation 5aattur39i39Iaquott all Earnings vi Putting Anemia all Stat z 39Z39 i i3 1 EEIEDDDJ fr 5iuaaa E i E g 5 iaaaatza 4 E J 9 aaaaam E HH E R 23 I DUI3 4 39 Zli p6 1 quot p i v g i 4 ii 4 t W W 4 jpk quotquotl3 1 9ll39iquot iiigF it Kr V 0 39 quotMquot in E 4 1 i 1 1 132 11 135 HE is 122 13 Excel allows you to superimpose a trend line linear or curved on a scatterplot It is an easy way to quantify the relation ship apparent in the scatterplot Purtti rlig Awe ragequotEtarts 2ICE LEE E139lFHH m5lll HUB 39lDDillINC 6 E m E l vli IEIEIlJE ECIIEIEIIEIIZIEI ECIQEIIDEEJ CHiiiEIDCI 1 Statlierplutmf Earningg has ireaens in Hsegu la aan avf Stallts 2 r li V 5 at P P 2 li2iI4r 5 b 45 as 4 74 u v bz quotquotlquotT39 plz 391 quot15 P 4 tr p EC 55 ED 55 T13 FE EreEns n nzagtil1a39Iiiri395iaE5 EEIIEIE greens hit in regulation and earnings We would expect players who hit a lot of greens in regulation to earn more and this appears to be the case Finally there is a de nite downward relationship between putting average and earnings Does this mean that better putters earn less Absolutely not The putting stat is the average number of putts per hole so that a lower value is better Therefore we expect the downward relationship indicated in the chart In fact the driving and putting scatterplots tend to confirm the old saying in golf Drive for show putt for dough We could obviously ask many more questions about the relationships in this golf data set and then attempt to answer them with scatterplots For example are the relationships or lack of them in the above scatterplots consistent through the years Or should Earnings per Round be used instead of Earnings as the Y variable You now have a power ful tool scatterplots for examining relationships and the tool is easy to implement We urge you to use it a lot I Trend Lines in Scatterplots In Chapters 10 and 11 we will discuss regression a method for quantifying relationships between variables We can provide a gentle introduction to regression at this point by discussing the very useful Trendline tool in Excel Once you have a scatterplot Excel enables you to superimpose one of several trend lines on the scatterplot Essentially a trend line is a line or curve that fits the scatter as well as possible This could indeed be I05 34 Relationships among Numerical Variables Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it a straight line or it could be one of several types of curves By the way you can also superimpose a trend line on a time series graph exactly as described here for scatterplots To illustrate the Trendline option we created the scatterplot of driving length versus driving accuracy in Figure 321 If you are a golfer you are probably not surprised to see that the longest hitters tend to be less accurate This scatterplot is definitely downward sloping and it appears to follow a straight line reasonably well Sacatterplnt of iquotardsF39Drive L Driving Z of Stats 211115 Figure 32 33 2 Scatterplots of 4 31 1 Driving Length Versus Driving oL E Accuracy K E E 1 239 2 395 9quot 339 E 2513 EN h I1 I I ll 5 l 50 55 st 55 ma 5 an E riwing IJ1FEI fg39r339amp EEl i55 gene Therefore it is reasonable to fit a linear trend line to this scatterplot To do this right click on any point on the chart select Add Trendline and fill out the resulting dialog box as shown in Figure 322 Note that we have checked the Display Equation on Chart option The result after moving the equation to a blank part of the chart appears in Figure 323 The equation you see is a regression equation It states that driving length y is 35089 minus 09829 times driving accuracy X This line is certainly not a perfect f1t there are many points well above the line and others below the line Still it quanti es the downward trend The tools in this subsection scatterplots and trend lines superimposed on scatterplots are among the most valuable tools you will learn in the book When you are interested in a possible relationship between two numerical variables these are the tools you should use first 342 Correlation and Covariance We discussed many numerical summary measures in Chapter 2 all of which involve a single variable The two measures discussed in this section correlation and covariance involve two variables Specifically each measures the strength and direction of a linear relationship between two numerical variables Intuitively the relationship is strong if the points in a scatterplot cluster tightly around some straight line If this straight line rises from left to right the relationship is positive and the measures will be positive numbers If it falls from left to right the relationship is negative and the measures will be negative numbers To measure the covariance or correlation between two numerical variables X and Y indeed to form a scatterplot of X versus Y X and Y must be paired variables That is I06 Chapter 3 Finding Relationships amongVariabes Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Figure 322 More Trendline Options Dialog Box Figure 323 Scatterplot with Trend Line and Equation Superimposed Frrnrit Trendline l i 39E quot3quot39 quot3 Fii iiif l Trendline Dpfriins Line rr Trendi egreeeien Type Line Stile 4 E Ii3quotIEI39itiEIl Shenzlnw Linear i Lgeri rimir 39uJill 3 Eelynnmiel CFrier Payer aming ifwerege P Fufl Q Trenuzlline quotsian39Ie 1ZIITiEI1IiIZZ pF V Linear Series 1 Fenrenzaet Enrwarnzl nerienzle Eawaruzl 3 eriuzle D Eat Inherent Elllieizilen quatin en uzharti II Display Esquared value can daert Cleee SEC 3115 farndsj riiiuIei395t ate 2 IZIIJE Ei EEC E 251 EEG Escaritiberpllnvt af I ard539aEir iawe us Illiriving i lIEiElfEif39 inf Stats E quotf U39S39EE939x ii EEUEE 55 E13 5 5 A EEI El riiirirsng 5i39u Stats EEE I07 34 Relationships among Numerical Variables Copyright 2010 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it they must have the same number of observations and the X and Y values for any observa tion should be naturally paired For example each observation could be the height and weight for a particular person the time in a store and the amount purchased for a particu lar customer and so on With this in mind let X1 and Y1 be the paired values for observation i and let n be the number of observations Then the covariance between X and Y denoted by CovarX Y is given by the following formula4 Formula for Covariance n j 2Xl XYl Y Covar X Y F1 31 n 1 You will probably never have to use Equation 31 directly Excel has a built in COVAR function that does it for you and StatTools also calculates covariance automatically but the formula does indicate what covariance is all about It is essentially an average of products of deviations from means If X and Y vary in the same direction then when X is above its mean Y will tend to be above its mean and when X is below its mean Y will tend to be below its mean In either case the product of deviations will be positive a positive times a positive or a negative times a negative so the covariance will be positive The opposite is true when X and Y vary in opposite directions Then the covariance will be negative CHANGES IN EXCEL 20l0 Exce s old COVAR function actually uses denominator n so it gives the population covariance not the sample covariance denominator n I in Equation 3 In Excel 20O both versions are available named COVARANCEP population and COVARANCES sample Covariance is too Covariance has a serious limitation as a descriptive measure because it is very sensi Se SitiV 130 the tive to the units in which X and Y are measured For example the covariance can be measurement Scales in ated by a factor of 1000 simply by measuring X in dollars rather than thousands of ofXandYtomakeit interpretable so We dollars This limits the usefulness of covariance as a descriptive measure and we will use re much more 0 it very little in the book5 correlation which is In contrast the correlation denoted by CorrelX Y remedies this problem It is a Unaffected bl anitless quantity that is unaffected by the measurement scale For example the correlation measurement SeeleS39 is the same regardless of whether the variables are measured in dollars thousands of dollars or millions of dollars The correlation is defined by Equation 32 where StdevO and StdevY denote the standard deviations of X and Y Again you will probably never have to use this formula for calculations Excel does it for you with the builtin CORREL function and StatTools also calculates correlations automatically but it does show how correlation and covariance are related to one another 4Actually Excel s COVAR function uses n in the denominator whereas StatTools uses n 1 Fortunately this is not an issue with correlation Excel s CORREL function and StatTools produce exactly the same correlations 5Don t write off covariance too quickly however If you plan to take a nance course in investments you will see plenty of covariances I08 Chapter 3 Finding Relationships amongVariabes Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Correlation is useful only for measuring the strength of a linear relationship Strongly related variables could have correlation close to 0 if the relationship is nonlinear A correlation is only a singlenumber summary of a scatter plot so it always contains less infor mation than the full scatterplot FUNDAMENTAL INSIGHT Scatterplots Versus Correlations It is important to remember that a correlation is a singlenumber measure of the linear relationship between two numerical variables Although a correla tion is a very useful measure it is hard to imagine exactly Formula for Correlation CovarX Y C lXY one StdevX gtlt StdevY 32 The correlation is not only unaffected by the units of measurement of the two variables but it is always between 1 and 1 The closer it is to either of these two extremes the closer the points in a scatterplot are to a straight line either in the negative or positive direction On the other hand if the correlation is close to 0 then the scatterplot is typically a cloud of points with no apparent relationship However while it is not com mon it is also possible that the points are close to a curve and have a correlation close to 0 This is because correlation is relevant only for measuring linear relationships When there are several numerical variables in a data set it is useful to create a table of covariances andor correlations Each value in the table then corresponds to a particular pair of variables StatTools allows you to do this easily as illustrated in the following continuation of the golf example However we first make three important points about the roles of scatterplots correlations and covariances I A correlation is a singlenumber summary of a scatterplot It never conveys as much information as the full scatterplot it only summarizes the information in the scatterplot However it is often more convenient to report a table of correlations for many variables than to report an unwieldy number of scatterplots I We are usually on the lookout for large correlations those near 1 or 1 But how large is large There is no generally agreed upon cutoff but by looking at a number of scatterplots and their corresponding correlations you will start to get a sense of what a correlation such as 05 or 07 really means in terms of the strength of the linear relationship between the variables In addition we will attach a concrete meaning to the square of a correlation when we discuss regression in Chapters 10 and 11 I Do not even try to interpret covariances numerically except possibly to check whether they are positive or negative For interpretive purposes concentrate on correlations what a correlation of 03 or 08 say actually means In contrast a scatterplot of two numerical variables indi cates the relationship between the two variables very clearly In short a scatterplot conveys much more infor mation than the corresponding correlation EXAMPLE 33 GOLF STATS CONTINUED In the previous subsection we saw how relationships between several of the golf variables can be detected with scatterplots What further insights do we get by looking at correlations between these variables Objective To use correlations to understand relationships in the golf data I09 34 Relationships among Numerical Variables Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Solution With the many numerical variables in the golf data set it is indeed unwieldy to create scatterplots for all pairs of variables but it is easy to create a table of correlations with StatTools6 If you want only one correlation you might instead use Excel s CORREL function As an example we will create a table of correlations for the golf data in 2008 To do so select Correlation and Covariance from the Summary Statistics dropdown list and fill in the resulting dialog box as shown in Figure 324 There are several options First you can check as many numerical variables as you like We checked a few but not all Second you can ask for a table of correlations andor a table of covariances We asked for correlations only Finally correlations and covariances are symmetric in that the correlation between any two variables X and Y is the same as the correlation between Y and X Therefore you can choose any of the three table structure options and receive exactly the same information We tend to favor the Entries Below the Diagonal Only option Figure 324 Em StatTools 39r39ariabes Select Tl39IZI er Mere Correlation and Covariance Dialog Eata Set ll ta Elma l l Box l Name l i5iililress F39a39rer 392aaa39ni2ii2a1 8 Fige 392aaa39ia2a2a1 j Eirents 392IIi339C2C2I1 IF F39iiLinils 392aaa39ib2b2a1 h Cuts i3939laile 392IIEi39E2E2I1 Equot Tap 1IIIs 392IIIIIIEi39F2F2III1 Tables to Create Tquot a Table Structure W able bl Cbrrelatibns if iin39in39ietrii F Table bf Qzri39arianies Fi Entries lZIIZI39 E the Diagonal iiquot Entries elbiri the Diagbnal Only pl ilgl Cancel You typically scan a The resulting table of correlations appears in Figure 325 You can ignore the 1000 table fC el quot 39 5 values along the diagonal because a variable is always perfectly correlated with itself f rthe Iclrge C quote39 Besides these we are looking for relatively large values either positive or negative When Iations either positive or negatVe Candi Ono the table is fairly large conditional formatting is useful Although it doesn t show up on formatting is useful the printed page we formatted all correlations between 06 and 0999 as red and all especially if the table correlations between 10 and 06 as green See the finished version of the golf file for 395 0 large 0 e instructions on how to create the conditional formatting There are three large positive val ues involving events rounds and cuts made None of these should come as a surprise There is only one large negative correlation the one between driving length and driving accuracy and we already saw the corresponding scatterplot in Figure 321 So if you want to know what a correlation of approximately 06 actually means look at the scatterplot in this gure It indicates a definite downward trend but there is quite a lot of variability around the best f1tting straight line 6Some statistical software packages provide a matrix of scatterplots option This is essentially like a table of correlations between all pairs of variables except that each correlation is replaced by a scatterplot StatTools does not provide this option at least not yet I I0 Chapter 3 Finding Relationships amongVariabes Copyright 2010 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 22 The file P0322XlsX lists financial data on movies released since 1980 with budgets at least 20 million a Reduce the size of this data set by deleting all movies with a budget of more than 100 million Also delete all movies where US Gross andor Worldwide Gross is listed as Unknown b For the remaining movies create a table of corre lations between the variables Budget US Gross and Worldwide Gross Comment on the results Are there any surprises c For the movies remaining after part a create a scatterplot of Worldwide Gross Y axis versus US Gross and another scatterplot of US Gross Y axis versus Budget Brie y explain any pattems you see in these scatterplots Do they seem to be consistent with the corresponding correlations The file P0210XlsX contains midterm and nal exam scores for 96 students in a corporate finance course a Do the students scores for the two exams tend to go together so that those who do poorly on the midterm tend to do poorly on the nal and those who do well on the midterm tend to do well on the nal Create a scatterplot along with a correlation to answer this question b Superimpose a linear trend line on the scatterplot along with the equation of the line Based on this equation what would you expect a student with a 75 on the midterm to score on the final exam 24 Recall that the file Golf StatsXlsX contains data on the 200 top golfers each year from 2003 to 2009 The question to be explored in this problem is what drives earnings and whether this is consistent from year to year a For each year create a new variable EamingsEvent This is potentially a better measure of earnings because some players enter more events than others b Create a separate table of correlations for each year that includes EarningsEvent YardsDrive Driving Accuracy Greens in Regulation Putting Average Sand Save Pct and BirdiesRound StatTools will warn you about missing data but don t worry about it Explain whether these correlations help answer the questions posed above c There is a saying in golf Drive for show putt for dough Create a separate set of scatterplots for each year of EarningsEvent Y axis versus each of YardsDrive Driving Accuracy and Putting Average Discuss whether these scatterplots tend to support the saying The le P0202XlsX contains data about 211 movies released in 2006 and 2007 The question to be explored in this problem is whether the total gross for a movie can be predicted from how it does in its rst week or two 21 Create a table of correlations between the five variables 7day Gross 14day Gross Total US I I2 Chapter 3 Finding Relationships amongVariabes 26 28 Gross International Gross and US DVD Sales StatTools will warn you about missing data but don t worry about it Does it appear that the last three variables are related to either of the first two b Explore the basic question further by creating a scatterplot of each of Total US Gross International Gross and US DVD Sales Y axis versus each of 7day Gross and 14day Gross X axis six scatterplots in all Do these support the claim that you can tell how well a movie will do by seeing how it does in its rst week or two The file P0239XlsX lists the average high school student scores on the SAT exam by state There are three components of the SAT critical reading math and writing These components are listed along with their sum The percentage of all potential students who took the SAT is also listed by state Create correlations and scatterplots to explore the following relationships and comment on the results a The relationship between the combined score and the percentage taking the exam b The relationship between the critical reading and writing components c The relationship between a combined verbal component the average of critical reading and writing and the math component d The relationship between each of critical reading math and writing with the combined score Are these bound to be highly correlated because the sum of the three components equals the combined score The file P0216XlsX contains traf c data from 256 weekdays on four variables Each variable lists the number of arrivals during a specific 5minute period of the day a What would it mean in the context of traf c for the data in the four columns to be positively correlated Based on your observations of traffic would you expect positive correlations b Create a table of correlations and check whether these data behave as you would expect The le P0211XlsX contains data on 148 houses that were recently sold in a ctional suburban community The data set includes the selling price of each house along with its appraised value square footage number of bedrooms and number of bathrooms 21 Create a table of correlations between all of the variables Comment on the magnitudes of the correlations Speci cally which of the last three variables Square Feet Bedrooms and Bathrooms are highly correlated with Selling Price b Create four scatterplots to show how the other four variables are related to Selling Price In each Selling Price should be on the Y axis Are these in line with the correlations in part 21 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it c You might think of the difference Selling Price minus Appraised Value as the error in the appraised value in the sense that this difference is how much more or less the house sold for than the appraiser expected Find the correlation between this difference and Selling Price and nd the correlation between the absolute value of this difference and Selling Price If either of these correlations is reasonably large what is it telling us Level B The le P0329XlsX contains monthly prices of four precious metals gold silver platinum and palla dium The question to be explored here is whether changes in these commodities move together through time 21 Create time series graphs of the four series Do the series appear to move together b Create four new difference variables one for each metal Each should list this month s price minus the previous month s price Then create time series graphs of the differences Note that there will be missing data for J an 97 because the Dec 96 prices are not listed Also because the source for this data set listed prices for platinum and palladium through Nov 08 only there will be missing data at the end of these series c Create a table of correlations between the differences created in part b Based on this table comment on whether the changes in the prices of these metals tend to move together over time d For all correlations in part c above 06 create the corresponding scatterplots of the differences for example gold differences versus silver differences Do these along with the time series graphs from parts a and b provide a clearer picture of how these series move together over time Discuss in some detail e Check with your own formulas using Excel s CORREL function that StatTools uses data through Dec 09 for the correlation between gold and silver but it uses data through Nov 08 for correlations between gold and platinum That is check that StatTools uses all of the available data for either correlation 30 The le P0330xlsx contains monthly data on exchange rates of various currencies versus the US dollar It is of interest to nancial analysts and economists to see whether exchange rates move together through time You could nd the correlations between the exchange rates themselves but it is often more useful with time series data to check for correlations between differences from month to month 21 Create a column of differences for each currency For example the difference corresponding to J an 06 will be blank for each currency because the Dec 05 value isn t listed but the difference for euros in Feb 06 will be 08375 08247 Create a table of correlations between all of the original variables Then on the same sheet create a second table of correlations between the difference variables On this same sheet enter two cutoff values one positive such as 06 and one negative such as 05 and use conditional formatting to color all correlations in both tables above the positive cutoff green and all correlations below the negative cutoff red Do it so that the ls on the diagonal are not colored Based on the second table and your coloring can you conclude that these currencies tend to move together in the same direction If not what can you conclude Can you explain how the correlation between two currencies like the Chinese yuan and British pound can be fairly highly negatively correlated whereas the correlation between their differences is essentially zero Would you conclude that these two currencies move together Hint There is no easy answer but scatterplots and time series graphs for these two currencies and their differences are revealing The le P0235XlsX contains data from a survey of 500 randomly selected fictional households 21 Create a table of correlations between the last ve variables First Income to Debt On the sheet with these correlations enter a cutoff correlation such as 05 in a blank cell Then use conditional formatting to color green all correlations in the table at least as large as this cutoff but don t color the ls on the diagonal The coloring should change automatically as you change the cutoff This is always a good idea for highlighting the large correlations in any correlations table When you create the table of correlations you are warned about the missing values for Second Income Do some investigation to see how StatTools deals with missing values and correlations There are two basic possibilities and both of these are options in some software packages First it could delete all rows that have missing values for any variables and then calculate all of the correlations based on the remaining data Second when it creates the correlation for any pair of variables it could tem porarily delete only the rows that have missing data for these two variables and then calculate the correlation on what remains for these two variables Why would you prefer the second option How does StatTools do it 32 We have indicated that if you have two categorical vari ables and you want to check whether they are related the best method is to create a crosstabs possibly with the 34 Relationships among NumericaVariabes I I3 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it 35 PIVOT TABLES counts expressed as percentages But suppose both categorical variables have only two categories and these variables are coded as dummy 01 variables Then there is nothing to prevent you from nding the correlation between them with the same Equation 32 from this section However if we let Ci j be the count of observations where the rst variable has value i and the second variable has value j there are only four joint counts that can have any bearing on the relationship between the two variables C00 C0l Cl0 and Cl 1 Let C1l be the count of ls for the rst variable and let C2l be the count of ls for the second variable Then it is clear that C1l Cl0 Cll and C2l C0l Cll so C1l and C2l are determined by the joint counts It can be shown algebraically that the correlation between the two 01 variables is nCl 1 C11C21 vC11gtltn C1lt1gtgtvC2lt1gtn C20 To illustrate this the le P0332xlsx contains two 01 variables The values were generated fairly randomly Create a crosstabs to nd the required counts and use the above formula to calculate the correlation Then use StatTools or Excel s CORREL function to nd the cor relation in the usual way Do your two results match Nevertheless we do not necessarily recommend nding correlations between 01 variables A crosstabs is more meaningful and easier to interpret We now look at one of Excel s most powerful and easytouse tools the pivot table This tool provides an incredible amount of useful information about a data set Pivot tables allow you to slice and dice data in a variety of ways That is they break the data down by categories so that you can see for example average sales by gender by region of country by time of day or any combination of these Sometimes pivot tables are used to display counts such as the number of customers broken down by gender and region of country These tables of counts which are often called crosstabs or contingency tables have been used by statisticians for years However Excel provides more variety and exibility with its pivot tables than most statistical software packages have traditionally provided with their crosstabs options In particular crosstabs typically list only counts whereas pivot tables can list counts sums averages and other summary measures 7 It is easiest to understand pivot tables by means of examples so we illustrate several possibilities in the following example information categorical variables EXAM PLE 34 EXAMINING CUSTOMER ORDERS AT ELECMART he le Elecmart Salesxlsx see Figure 327 contains data on 400 customer orders during a period of several months for Elecmart a ctional company This is a typical data set where pivot tables can be used to gain useful information There are several cate gorical variables and several numerical variables The categorical variables include the day of week time of day region of country type of credit card used gender of customer and buy category of the customer high medium or low based on previous behavior Even the date variable can be treated as a categorical variable The numerical variables include the number of items ordered the total cost of the order and the price of the highest priced item purchased The manager of Elecmart wants to summarize the data so that she can under stand the buying patterns of her customers How can she use pivot tables to gain useful Objective To use pivot tables to break down the customer order data by a number of 7To be fair many other statistical software packages such as SPSS and SAS now emulate Excel pivot tables I I4 Chapter 3 Finding Relationships amongVariabes Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Figure 327 Elecmart Data Pivot tables are perfect for breaking down data by categoriesWe call this slicing and dicing the data A B C D E F G H l J 1 Date Day Time Region CardType Gender BuyCategory temsOrdered TotalCost Highltem 2 6 Mar Mon Morning West EecMart Female High 4 13697 7997 3 6 Mar Mon Morning West Other Female Medium 1 2555 2555 4 6 Mar Mon Afternoon West EecMart Female Medium 5 11395 9047 5 6 Mar Mon Afternoon NorthEast Other Female Low 1 682 682 6 6 Mar Mon Afternoon West EecMart Male Medium 4 14732 8321 7 6 Mar Mon Afternoon NorthEast Other Female Medium 5 14215 5090 8 7 Mar Tues Evening West Other Male Low 1 1865 1865 9 7 Mar Tues Evening South Other Male High 4 17834 16193 10 7 Mar Tues Evening West Other Male Low 2 2583 1591 11 8 Mar Wed Morning MidWest Other Female Low 1 1813 1813 12 8 Mar Wed Morning NorthEast EecMart Female Medium 2 5452 5438 13 8 Mar Wed Afternoon South Other Male Medium 2 6193 5632 14 9 Mar Thurs Morning NorthEast EecMart Male High 3 14768 9664 15 9 Mar Thurs Afternoon NorthEast Other Male Low 1 2724 2724 16 10 Mar Fri Morning West Other Female Low 3 4618 4427 17 10 Mar Fri Afternoon West Other Male Low 5 10744 9164 Solution Before we dive into the details we first preview the results you can obtain Pivot tables are useful for breaking down numerical variables by categories or for counting observations in categories and possibly expressing the counts as percentages So for example you might want to see how the average total cost for females differs from the similar average for males Or you might simply want to see the percentage of the 400 sales made by females Pivot tables allow you to find such averages and percentages easily Actually you could find such averages or percentages without using pivot tables For example you could sort on gender and then nd the average of the Female rows and the average of the Male rows However this takes time and more complex breakdowns are even more difficult and timeconsuming They are all easy and quick with pivot tables Besides that the resulting tables can be accompanied with corresponding charts that require virtually no extra effort to create Pivot tables are a manager s dream Fortunately with Excel they are also a manager s realily8 We begin by building a pivot table to find the sum of TotalCost broken down by time of day and region of country Although we show this in a number of screen shots just to help beginners get the knack of it the process takes only a few seconds after you gain some experience with pivot tables To start click on the PivotTable button at the far left on the Insert ribbon see Figure 328 This produces the dialog box in Figure 329 The top section allows you to specify the table or range that contains the data You can also specify an extemal data source but we will not cover this option here The bottom section allows you to select the location where you want the results to be placed If you start with the cursor inside the data set Excel s guess for the table or range is usually correct although you can override it if nec essary Make sure the range selected for this example is Al J 401 This selected range should 8One Excel 2007 book by Bill J elen known as Mr Excel claims that although pivot tables have been around for years and represent Excel s arguably most powerful tool they are used by only about 10 of business people Fortunately you will be in that 10 35 PivotTabes H5 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Figure 328 PivotTable Button on the Insert Ribbon i 0 lg it39ll T T B l 39 amp 39 3 Elecnart Salemrl395r ir1icro5o1 tE rcel Hiome Insert l Page a1rout lquotormua5v Data Reiri39ew view llEi39Elil FiEr ArzioquotInst i Q an E re p i PiirotTvavloe Talolie Kl Fquoticture Elsip 5haipes Sxmvairtitrt l Iiovlornn Line Pile Ear Area Scatter D13her39 Hryperliinlr ll Tert H3ead5er Worcmirt Sizgnature Dhjiect 539r1quotlElIEIl 7 T 139 T T T 3 H l Chiartsr ll Eloii EL Footer Liner l Tables Illustrvations it EquotIhart5 Fa ll Links ll Text Figure 329 Create F39iirotTahle Create P iV0tTab1e Choose the clata that 3939IZIU want to analyze Dialog BOX Select a table or range IaIeI39RanIe lData 1JiIJ1 gee an eterna clata source Cl uriee sionnectvon Caanraectir narrre Cheese where you want the Ph39otTalle report to he placed ew Worksheet r Existing Worksheet Location l Cancel a always include the variable names at the top of each column Then click on OK Note that with these settings the pivot table will be placed in a new worksheet with a generic name such as Sheetl We recommend that you rename it to something like PivotTablel This produces a blank pivot table as shown in Figure 330 Also assuming the cursor is within this blank pivot table the PivotTable Tools super ta is selected This super tab Figure 330 Blank Pivot Table 339rro iTaquoto e1 Click in tlhia area to woirlr with the F39i1rotTaIte revpor39t lelelalelelrleelri E uanum E 19 an 21 I I6 Chapter 3 Finding Relationships amongVariabes Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it has two ribbons Options and Design The Options ribbon appears in Figure 331 and the Design ribbon appears in Figure 332 Each of these has a variety of buttons for manipulating pivot tables some of which we will explore shortly Finally the PivotTable Field List window in Figure 333 is visible By default it is docked at the right of the screen but you can move it if you like Figure 33 I PivotTable Options Ribbon i l d f A 39 Eletrnart Finislieilirlsr Microsoft EHIEEl gt3Fiii39iart39i39aliIll li39IiJiaI39Ls if PT at El d Piage La1roiiJt Fiiirn39iiiias Data Reirisew yquotiew Deireliiziper holdIna p H Piirot39aloIe Name Attire Field 3 t V 0 Group Selecli n at 39 H 39 l 39339 P1 3 l 0Y I E H V i V quot Expand Entire Field I y El 5 I t v 00 F1 I l lPiirotTalile1 Coiint UF ail Ungroup iii l l quotquot3 r3 p5 imir iiEj 3 Sort Flefresh Change Data ll Clear Select Move l F39iirotCl391art Formulas OLi1F39 l Field ll 0iPt39 i5 39 quotquotquotl 539E tt39quot95 H E E E 3F39quotJI39 FIE All v Soiircequot v v F iirot3939aLie v tools l List aaiiam 1iaa39 l3 iirotTalire j i391lsctiii39e Field i Group i sait Data fl iitcti39ons Tools l Sl1oinrquotHi de Figure 332 PivotTable Design Ribbon l 8 Q 39539 F f c8 F Eleczrnart Finisliedarlsz5 Mitroiso Bie Vl539l lgl Q lili allti llE39E39El ilili5 Home Inserlr PrEv E iaroot Forrniizizlas Data F eiriesrii quotii i7eiiii39 Developer Aiil39iil 1iquotis Dp tions P s 39lfJ39n3939s i39gliTr39iii l 0 ir i 5 aquot Ei m LE EEEEE E 5 ii 53939l3quot39quot 3quotquot 393quot 395l R quotF quott 5393quot V I7 Emumn Headsrs llfl Banded Iimumns l E l E E E E E5 E3 E5 E5 E 3333 E E3 EE 3 Totals Layout Rows is 39 hers 39 39l L F39quotrstTebIAsttrIs39 39 Ftl39 39a3975 F39l F1 EJ7E l3lLquot39 513 E15 iPiirotTrI ie Field List Choose Fields to add to report Date El Dar Time Region CardTs39pe Gender EIIi3939Cateior39 Itemstllrdered TotalCost Hiil39iIten39i 139 Figure 333 PivotTable Field List Window Drag Fields between areas below b Report Filter c Column La Row Labels 3939EilUES l Defer Layout Update 35 PivotTabes H7 Copyright 2010 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Note that the two pivot table ribbons and the pivot table eld list window are visible only when the active cell is inside a pivot table If you click outside the pivot table say in cell D1 all three of these will disappear Don t worry You can get them back by clicking anywhere inside the pivot table If you have used pivot tables in a previous version of Excel the blank pivot table in Figure 330 will look different Here are two things to be aware of First if you open a le in the old xls format Excel 2003 or earlier and go through the same steps as above you will get an old style pivot table as shown in Figure 334 Second if you prefer the old style especially for dragging and dropping Excel 2007 lets you revert back to it To do so right The pivot table look changed considerably in Excel 2007 but the fundOnoty is click on the pivot table select PivotTable Options click on the Display tab and check the virtually the same Classic PivotTable layout option see Figure 335 You can use the new layout or the old one Figure 33934 01 A l E l C Eilrlfm Paige Fihlds Hra E F H E l H l Old Style Blank 2 Pivot Table 3 J EII39ap I aiurrin Fields Hara 4 Drop Data Items Here a1aH aplagg mag dam Figure 335 F39i1ratTahla ptians Switching to Classic ame P39inutTaa1 PivotTable Layout I La39aIItBaFurn39at TIlIEll5i3 llllIE3939l395 Display I Printing Data Display haw auandI39naausa lzuuttans Shaw gantatua taaltips A Stttiw pri39aaillas ii taaltips Qisplay Fiald atians and Filter drap dawns Cassi F39i39 39nntTaua a3939nut Isnalzulas dragging at Fields in the grid 0 E l iE39IIn i atarns wltl i na data an l 139 395 V 3 tquotnquot itarras with ma data at calumas 0 l isalatr ltarra lala 39ls l tart rig fialds ara in l lquota 39Jaluas araa Fiald List Salt 5 ta 2 Sort in data saurca ardar I I8 Chapter 3 Finding Relationships amongVariabes Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it 22 y id l itdttIdEest 5tisd539 y it sduth 333535 y is dig west 5d2dtE K is Erand tdtel d151 3921 Eililj Excel applies two rules to variables checked at the top of the Field List window 1 When you check a text variable or a date variable in the eld list it is added to the Row Labels area 2 When you check a numerical variable in the field list it is added to the Values area and summarized with the Sum function 9In discussing pivot tables Microsoft uses the term eld rather than variable so we will do so as well 35 PivotTabes H9 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Starting with Excel 2007 there are three different layouts for pivot tables but the differences are relatively minor Ultimately it is a matter of taste Figure 337 Sum of TotalCost by Time and Region Tabular Layout This is exactly what happens when you check Time Region and TotalCost However this is just the beginning With very little work you can do a lot more Some of the possi bilities are explained in the remainder of this example First however we discuss the new look of pivot tables in Excel 2007 Notice that the pivot table in Figure 336 has both row elds Time and Region in column A This wasn t possible in old style pivot tables where the two row elds would have been in separate columns Microsoft decided to offer this new layout because of its clean streamlined look In fact you can now choose from three layouts Compact Outline or Tabular These are available from the Report Layout dropdown list on the Design ribbon When you create a pivot table in an xlsx le you get the compact layout by default If you would rather have the tabular or outline layout it is easy to switch to them In particular the tabular lay out shown in Figure 337 is closer to what was used in previous versions of Excel Outline layout not shown here is very similar to tabular layout except for the placement of its subtotals J5 s s 1 2 I 3 W39Isa El ssisn I3 IstaCwt y at emtterneen Midwest 31si1s y s I Nui39thEast s1ssis s smith 532932 y i Z west iissas y s ette rneen tieteI 242s5ss gig E39Eiiieriirig Midwest 2552351 y is iisieethsest ssti1ss y ii seuth 386412 y 12 West ssiss y is Evening tetel isssas y is en lquotiiItirnirig Midwest sssstss y is J iseithsest stiseis y is seuth ssssss tit P we st 5 s2sss y is ilieiierning iietel iseizmi y 1S 3gE re I39II iititel Ei15quotE121 One signi cant advantage to using tabular or outline layout instead of compact lay out is that you can see which elds are in the row and column areas Take another look at the pivot table in Figure 336 It is pretty obvious that categories such as afternoon and morning have to do with time of day and that categories such as Midwest and South have to do with region of country However there are no labels that explicitly name the row elds In contrast the tabular layout in Figure 337 names them explicitly Still you can choose the layout you prefer Hiding Categories Filtering The pivot table in Figure 336 shows all times of day for all regions but this is not necessary You can lter out any of the times or regions you don t want to see To understand how this works make sure the Options ribbon is visible In the Active Field group you will notice that one of the elds is designated as the active eld The active eld corresponds to the location of your cursor If your cursor is on a Time category such as Evening then Time is the active eld If your cursor is on a Region category such as NorthEast then Region is the active eld If your cursor is on any of the numbers then Sum of TotalCost is the active eld I 20 Chapter 3 Finding Relationships amongVariabes Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Once you understand the active eld concept then the way Excel implements ltering makes sense If Time is the active eld and you click on the Row Labels dropdown arrow you see the dialog box in Figure 338 To see data only for Afternoon and Morning for example uncheck the Select All item and then check the Aftemoon and Morning items Similarly if Region is the active eld and you click on the Row Labels dropdown arrow you can check which regions you want to lter on If you are in tabular layout it is more straight forward because each row eld then has its own dropdown list For example the pivot table in Figure 339 is obtained by ltering out the Evening and NorthEast categories Note how the lter symbols replace the arrows in row 3 to indicate that some categories have been l tered out Also note that the updated subtotals for Moming and Aftemoon and the updated grand total for all categories do not include the hidden categories10 Figure 338 Filtering on Time Figure 339 Pivot Table with Hidden Categories 5 elenli irfielrt Tirrie V El sari is ten 2 it sg rt 2 tr A are Smart rtiens glear Filter Frern quotTimequot aIe Filters P Eeilue Filters Ir lill Select All lquotlernnnui39i p Evening 0V lrlerning 5 e C D l E l p rirwe ElFissiUquot l susm or llmli srast K at Eli lltEfn39U Dn Mid West 313316 1 s l South siresir2 V pl West 1ees4 A Artern uan iiotel 1einse2 1 El EiMerning Midwest BELIquot22 pl South sesses I 141 pl west sssass I lJllFu39IilJrnirig rmtal 1334214 A is i usirsiitailEiiitsi 3 1 You have probably noticed that the dialog box in Figure 338 is exactly like the one for Excel tables discussed in the previous chapter This is no accident You already learned how to lter tables so there is nothing new to learn for filtering pivot tables 35 Pivot Tables I 2 I Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Figure 340 Placing Region in the Column Labels Area Changing the locations of fields in pivot tables has always been easy but the new user interface introduced in Excel 2007 makes it even easier We favor dragging the fields to the various areas but you can experiment with the various options Sorting on Values or Categories It is easy to sort in a pivot table either by the numbers in the Values area of the table or by the labels in a Row or Column eld To sort by the numbers in the Values area rightclick on any number and choose the Sort item If a simple AZ or ZA sort isn t enough you can use the More Sort Options item For example this allows you to sort on the column of numbers that contains the selected cell or on the row of numbers that contains this cell To sort on the labels of a Row or Column eld you can again rightclick on any of the categories such as Morning and select Sort Alternatively you can click on the dropdown arrow for the eld such as Time in Figure 339 and you will get the dialog box in Figure 338 that allows both sorting and ltering However be aware that sorting on labels is always in alphabetical or reverse alphabetical order This is not always what you want For example suppose you want the natural sort order Morning Afternoon Evening This isn t the AZ or ZA order but it is still possible to sort manually The trick is to select the cell of some label such as Morning and place the cursor on the border of the cell so that it becomes a foursided arrow Then you can drag the label up or down or to the left or right It takes a little practice but it isn t difficult I Changing Locations of Fields Pivoting Starting with the pivot table in Figure 339 you can choose where to place either Time or Region it does not have to be in the Row area To place the Region variable in the Column area for example drag the Region button from the Row Labels area of the Field List window to the Column Labels area The pivot table changes automatically as shown in Figure 340 We removed the lters on Time and Region lfium nil ll 3taE3a3t HEgian lll39iITIE MidWe3t N3rt1Ea3t g5 a uth We3t Era rm 39 i ta d ll te r39r i El El n 3 IE 39 15 S159 181 5 E329 E2 Iquot391EEL Sl39 2amp12155 Em lE1u39Enir1g 25518E 554149 335412 i3939i45E flLEE3 l393 3 Marnimg 333333 33333r39 333333 333333 1L3ai3r3lg 3 333 333133 133333 3 L 13333 3 3333 Alternatively you can categorize by a third eld such as Day and locate it in a differ ent area As before if you check Day in the Field List window it goes to the Row area by default but you can then drag it to another area The pivot table in Figure 341 shows the result of placing Day in the Report Filter area By clicking on the dropdown arrow in row 1 you can then show the pivot table for all days or any particular day In fact there is now a Show Multiple Items option you can check This option wasn t available before Excel 2007 We checked this option and then selected Friday and Saturday to obtain the pivot table in Figure 341 It reports data only for Fridays and Saturdays This ability to categorize by multiple elds and rearrange the elds as you like is a big reason why pivot tables are so powerful and useful and easy to use Changing Field Settings Depending on which eld is the active eld you can change various settings in the Field Settings dialog box You can get to this dialog box in at least two ways First there is a Field Setting button on the Options ribbon Second you can right click on any of the pivot I 22 Chapter 3 Finding Relationships amongVariabes Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Figure 34 l Placing Day in the Report Filter Area and Filtering on Day p 3919 l a l c l u l E l F l is 1 Day Il lutipIe Items 0 i 3 Su m cit lfuiiaCest Ftergiuan ll at lfime El Mizdwleet hlclirthEast Sciuth west Grandilfutel 5 i391l39l39ilEm eah 139SB13 39l23E395l 1425 II 4 28185 1 1 El l53 ES iE1y39EE391iiquotig 135S5i5 Z5S439El E39E5399 g 133El55 5ltSl4lI39Ei 391quot Matrhihg 13939E EiEi Z 15lI3939El El Eil lEE1 El55Il2 L l3FuE I139lIlll39ilIli39tEIl 512335 4 q 55i9 I 2435295 table cells and select the Field Settings item The eld settings are particularly useful for elds in the Values area as we now explain The key to summa For now rightclick on any number in the pivot table in Figure 341 and select Value quotZing 1 heflal 0 the W0 Field Settings to obtain the dialog box in Figure 342 This allows you to choose which you Want 39t S mma39 way you want to summarize the TotalCost variable by Sum Average Count or several 39 d 39 h V I F Id gfgingstdiealo ubeoxle at others You can also click on the Number Format button to choose from the usual number used to it because you formatting options and you can click on the Show Values As tab to display the data in will use it often various ways more on this later If you choose Average and format as currency with two decimals the resulting pivot table appears as in Figure 343 Now each number is the Figure 342 llalue Field Settings Value Field Settings Scnurce l39lame39 TctalCcust Dialog Box Qustcnm Name lF393939El EiIE cut Tcutallicust l Summarize her l Shaw values as l ummarize value eld by Cheese the t39ue cal calculaticun that 139cuu want tcu use tcn summarize the data Frcum selected Field Sum will 395 Min F39rcuduc l umherFcrmat 7 ll DE D P Cancel FT 1 is l at l c i E l F l e Flgu re lair Iliuiultiplet Items LEI Pivot Table i 3 i we rage vet IcltalEuet Hegic n El Wlth Average of alllfin1e 0 Mildwest hlurthEast sleuth West Gran df lfcltal TotalCost s A re m can 14130 139a23li 11asci 15 as a 15339 if Eirehihg siscis5 1rs1srs 9514 11121 13scis 7 Morning 3115 1scl2a 1IZiZ39Ei39Ei 1sci43 13433 i Grand ll39iIrtEIl 13E391I 139391 12 139l39 Ell1Illl 142Evil 51441e 9 35 Pivot Tables l 23 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Pivot charts are a great extension of pivot tables They not only tell the story graphically but they update automatically when you rearrange the pivot table Figure 344 Pivot Chart Based on Pivot Table Figure 345 Pivot Chart after Pivoting Row and Column Fields average of TotalCost for all orders in its category combination For example the average of TotalCost for all Friday and Saturday morning orders in the South is 10769 and the average of all Friday and Saturday orders in the South is 10910 Pivot Charts It is easy to accompany pivot tables with pivot charts These charts are not just typical Excel charts they adapt automatically to the underlying pivot table If you make a change to the pivot table such as pivoting the Row and Column elds the pivot chart makes the same change automatically To create a pivot chart click anywhere inside the pivot table select the PivotChart button on the Options ribbon see Figure 331 and select a chart type That s all there is to it The resulting pivot chart using the default column chart option for the pivot table in Figure 343 appears in Figure 344 If you decide to pivot the Row and Column fields the pivot chart changes automatically as shown in Figure 345 Note that the categories on the horizontal axis are always based on the row field and the categories in the legend are always based on the Column eld 25UJ j 2unuu 515 EllJl ilMi39d39Nesli u1NorthEast nrwuest EE Il1l 39 lJ 39 AFtErnUan Ex ening Ihil rnaing 25I TID39EI E39llJEl Si i I t ernanzmn l maning 1 mmorning 5EL J 39 mun hlfu rthEast South West Midwest Note that when you activate a pivot chart the PivotTable Tools super tab changes to PivotChart Tools This super tab includes four ribbons for manipulating pivot charts Design Layout Format and Analyze see Figure 346 There is not enough space here to discuss the many options on these ribbons but they are intuitive and easy to use As usual don t be afraid to experiment I 24 Chapter 3 Finding Relationships amongVariabes Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Figure 349 Rearranged Pivot Table with Two Values Fields 397quot i 1139 f B M if C it D it E if F 39 3 H in I l 1 1 1 Jayr i MuMtipEjltem5 0 Z 3 V lUE5 iiime l3 M 4 Average of ifmaCost 5um all Tuta llD51 iiital 39 39wErage mi iinta IEmist iinutaI Sum ml Tuxtaltoat 35 iEi39F I39iElquot flit 1 r39 39ET39 EvE39ir39a Mmi 15 Nitrquot391quot quot39PquotilquotiE W19rquotiquot5 M 5 iihriidwest 14135 15 39El5 1215 1 339Bi13 135855 1Er55 1354 512335 NCrr thEaE1 139FJ2339Zi 15155 15525 4i2339Ei539Cl39 25E4 EI 125385 111112 555932 M E lSIIu iIh 11395 9514 11215 51 11125 391 55593 15rEG395E 1EiE391 5 355541 393 5quotfquot 393991539f E 1 39 51131 51 43 5331 1 1amp3 5 3 i39r lFPE3931 51il33951 55r5 LE3 M 15 Grand Tnvtal 135lI39l5 131153 1lLI45333 5 EM4li15 39lE15a5 2 14415 2435 239l5 Region button You can experiment with these options but we tend to prefer option 2 which leads to the pivot table in Figure 349 In a similar manner you can experiment with the buttons in the Values area However the effect here is less striking If you drag the Sum of TotalCost button above the Average of TotalCost button in the eld list the effect is simply to switch the ordering of these sum maries in the pivot table as shown in Figure 350 Figure 350 Another Rearrangement of the Pivot Table with Two Values Fields M M 5 B M p7 r n E F 11 3 1 H if I T J 1 Dav iiMeatiaIeitemsi IE3 M P u39au39E5 iiime l M 41 5umuri ll msalEast i393Iquoti39E39I iEgE Dill IIiDtHE I5t Ilititalsumim ifataltuist quotIiumtalmreirageinf iiiitalC u5t 5 llEE ij39l1l39quot Equot39quotquot3 39quot E3939 39Ej39li39f3939E M39 Fquoti39E A E39quoti quot3 Equot39Equot3i39 quot E P quotquot quot39Fiquotj3 1 M 5 i39i391iIlquotquot u39EE t 1 STE13 135355 1T855 14135 151155 121152 512335 135illquot M P 1N0F 7thEaE39i 423 539El39 25845539 2Z5 3B5v 1393392339ElI 15155 15525 555332 1 quotL12 M 8 iSIIu ln 1425391 55595 15JZl395EI39 11E EJ 5514 139Zi5 Ei 355541 15315 5 52 15 3915 rF 339quot39393 31535393 5111I 315339 4 555 9l33 31451 M 139Zr iGran nIquotIlquotnrt 1Il3939I 153EE 5 M4D5 Fquot55 2 0 135 UEl39 13433 2 39l3525 1 M1J39l 11 Summarizing by Count The variable in the Values area whatever it is can be summarized by the Count function This is useful when you want to know for example how many of the orders were placed by females in the South When summarizing by Count the key is to understand that the actual variable placed in the Values area is irrelevant so long as you summarize it by the Count function To illustrate start with the pivot table in Figure 350 where TotalCost is summarized with the Sum function Now rightclick on any number in the pivot table select Value Field Settings and select the Count function see Figure 351 The default Custom Name you will see in this dialog box Count of TotalCost is misleading because TotalCost has nothing to do with the counts obtained Therefore we like to change this Custom Name label to Count as shown in the figure The resulting pivot table with values formatted as numbers with zero decimals appears in Figure 352 For example 27 of the 400 orders were placed in the morning in the South and 115 of the 400 orders were placed in the NorthEast Do you now see why the counts have nothing to do with TotalCost This type of pivot table with counts for various categories is the same crosstabs that we discussed in Section 32 However it has now been created much more easily with a pivot table I 26 Chapter 3 Finding Relationships amongVariabes Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Figure 35 Field Settings Dialog Box with Count Selected Counts can be displayed in a number of waysYou should choose the way that best answers the question you are asking Figure 352 Pivot Table with Counts Value Field Settings Source Name Tetalliest Qustem Iquotlame Cuunt I Summarize u39 l Shaw values as l ummarize value eld by Cheese the type nnF caluzulatien that you want ta use ta summarize the data From selected Field Sum Al Fwerage 39391a Min F39rnnlunt i l I lllllll lg Cancel m umher Fermat 7 When data are summarized by counts there are a number of ways they can be displayed The pivot table in Figure 352 shows raw counts Depending on the type of information you want it might be more useful to display the counts as percentages Three particular options are typically chosen as percentages of total as percentages of row and as percentages of column When shown as percentages of total the percentages in the table sum to 100 when shown as percentages of row the percentages in each row sum to 100 when shown as percentages of column the percentages in each column sum to 100 Each of these options can be useful depending on the question you are trying to answer For example if you want to know whether the daily pattern of orders varies from region to region showing the counts as percentages of column is useful so that you can compare columns But if you want to see whether the regional ordering pattern varies by time of day showing the counts as percentages of row is useful so that you can compare rows T7 a l B l E l D l E l F l 3 Euunt 39II39in1e El 4 Ftvergiuni lZImiternvudin E139ening Murning Erand39Il39ta 5 Midwest 2E 13 Eli E1 a NerthEast as 34 as 115 L Sleuth as 2 2 as E quotIquotll39eElt 41 42 38 121 il ranltl 39Il3939trtEIl 1511 1Zl To display the counts as percentages of some type display the Value Field Settings dialog box remember how select the Show Values As tab and select the appropriate option see Figure 353 The resulting pivot table and corresponding pivot chart appear in Figure 354 As you can see the pattern of regional orders varies somewhat by time of day 35 Pivot Tables I 27 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Figure 353 Iraiue Field Settings Value Field Settings Scuurce Heme TcueCcu5 Dialog Box with Show Values AS gustcum IquotJame ICcnIn Optlons Summarize I Ehcuw 3939EIIUE5 as Show vglues as hlcurmel I39IlD CIF fa Difference Frcum m ai quot39IE r39J1I3939 I53 quot 3939quot39c cnF ccnlumn Ii cul cue 39erI I 39rII Gender I I w w IIII IImuerFcurme n Cancel Figure 354 Pivot Table and Pivot Chart with Counts As Percentages of Rows A I e I c I n I E I F I e I H I I 1 E 45liI 3 I39ZnurIt Time Q 4n Ere 4 Ftegiim IA erneIeIn Equotli39EI IiIquotIg Mmningg ErarIdi1FcIte 393ii39 II39E 5 MidWeEt 3ee2iI39 2e IequotiI 3lee239I 139ZlIZlIZFf39lquot5 39E 3 quot393quot j3939 I 3 I I I I 25IEIIIIEe I i ternee n e Nuriticasi 11 Ir39a aE 295 quoti e 2e IIec l IZlIIJEIquotE N am II seuim 4194 29Iequot3II 2evcI3iI1 1 cIcIcicI39Is 15 g I Equot39Equot39quotg e W39e5t 33eeiE 3a139I1quotiI 31IIee 1IIcIIIII39ie 1IZIIIIi2I e Mw i EI Er39EI1l2lquotI39Elit zl 0i eimet 1IEliIi Im1l3E 5DDquot 5 quot5 ewe E 11 391ivIl Ii39II39e5F NhIrihEe5t Scruth we5F 12 Sometimes it is useful to see the raw counts and the percentages This can be done easily by dragging any variable to the Data area summarizing it by Count and displaying it as Normal Figure 355 shows one possibility where we have changed the custom names of the two Count variables to make them more meaningful Alternatively the counts and percentages could be shown in two separate pivot tables Figure 355 Pivot Table with Percentages of Rows and Raw Counts I A I e I c I I I E I F I e I H I I L L 3 Ti l39IquotI39rE39 1uFeuee L A e rrIe enI E1rening Merrning if39cIIte Pct ref Flew Teatel Raw mInt 5 Ftegiein 3 Pct nil Hiew Flew II1Iunt Iii Haw Zen nt Pct anti Flew Haw DI Illt L MidW39e5t 3ee239Ee 25 2I539 quoti39E 1e 3ee2Ee 2e 1cIcIeceI ii i N III739i39cIn Ea St 111139 393 iiE 2 S395I339E 34 EB II39ZlquotE 3 3 139EIE39Zi 339E 115 i SE iJ39lIiquot39l l139S4 9E 339EJ 2 S3939Ii33939539E 2 39 E39SIEI3quot E3 11IlIrEiI39EIIlI9I39 I El3 El quotquot39II39e EA 3 3 E E3933939E 41 34 391 939E 42 3 14 DEE 3 E 1E I939E 121 H IE IquotEIrIIi Tilatazl 3 122 3a1 l 395E i1E139II p 11 I 28 Chapter 3 Finding Relationships amongVariabes Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Grouping Finally categories in a Row or Column variable can be grouped This is especially useful when a Row or Column variable has many distinct values Because a pivot table creates a row or column for each distinct value the results can be unwieldy We present two possibilities First suppose you want to break Sum of TotalCost down by Date Starting with a blank pivot table check both Date and TotalCost in the pivot table eld list window This creates a separate row for each distinct date in the data set l 12 separate dates This is too much detail so it is useful to group on the Date variable To do so rightclick on any date in column A and select the Group item Group options are also available on the Options ribbon Accept the default selections in the Grouping dialog box see Figure 356 to obtain the pivot table in Figure 357 Figure 356 Grouping Dialog Box starting at siei2nne ending e ei2si2nns E3 Secends Minutes Heurs m Quarters tears iiLiiTaEE if I jesrs A UK L Cancel Figure 357 17 ii i E n g H Pivot Table after Sf Grouping by Month T uate I 395ucn1nii 39IFnteEest i Met 333325 5 Apr s14sesis1 L Meat s1sieee11 Ir iluin sisniess3 i Grand irinteil 45152121 9 Pivot Table Tip Suppose you have multiple years of data and you would like a monthly grouping such as January 2007 through December 2009 Ifyou simply select Months as in Figure 356 all ofthe Januaries for example will be lumped together The trick is to select both Months and Years in the dialog box As a second possibility for grouping suppose you want to see how the average of TotalCost varies by the amount of the highest priced item in the order Place TotalCost in the Data area summarized by Average and place Highltem in the Row area Because Highltem has nearly 400 distinct values the resulting pivot table is virtually worthless Again however the trick is to group on the Row variable This time there are no natural groupings as there are 35 Pivot Tables I 29 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Figure 358 Grouping Dialog Box for a Non Date Variable Figure 359 Pivot Table after Grouping by 50 on Highltem for a date variable so it is up to you to create the groupings Excel provides a suggestion as shown in Figure 358 but you can override it For example changing the bottom entry to 50 leads to the pivot table in Figure 359 Some experimentation is typically required to obtain the grouping that presents the results in the most appropriate way Ernuping IIIInu atartina an i gncling at 33133 33 P UK J Cancel 3 1 3 1 1 ml 2 ll 3lHi3t3rI1 EJ5u1r33U 3t met K 43333333 3343343 3 3333 1 u3333 333333134 3 f1rn333 13 333 1133333 Li133333u3r333 3333433 I E Q3u n333 33333 3333113 33333343r333 33133n333i 13 13u n333 33 333 3313313 l i3553933 f395 Si 5i1455 I 13 3ramI intal 3153331 E By now we have illustrated the pivot table features that are most commonly used Be aware however that there are many more features available These include but are not limited to the following I Showinghiding subtotals and grand totals check the Layout options on the Design ribbon I Dealing with blank rows that is categories with no data right click on any number choose PivotTable Options and check the options on the Layout amp Format tab I Displaying the data behind a given number in a pivot table double click on the number to get a new worksheet I Formatting a pivot table with various styles check the style options on the Design ribbon Sorting pivot tables in various ways check the Sort options on the Options ribbon Moving or renaming pivot tables check the PivotTable and Action options on the Options ribbon I Refreshing pivot tables as the underlying data changes check the Refresh dropdown list on the Options ribbon I Creating pivot table formulas for calculated elds or calculated items check the Formulas dropdown list on the Options ribbon I Basing pivot tables on external databases not covered here I 30 Chapter 3 Finding Relationships amongVariabes Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Not only are these and other features available but Excel usually provides more than one way to implement them The suggestions above are just some of the ways they can be implemented The key to learning pivot table features is to experiment There are entire books written on pivot tables but we don t recommend them You can learn a lot more and a lot more quickly by experimenting with data such as the Elecmart data Don t be afraid to mess up Pivot tables are very forgiving and you can always start over We complete this section by providing one last quick example to illustrate how pivot tables can answer business questions very quickly EXAMPLE 35 FROZEN LASAGNA DINNERS he f1le Lasagna Triersxlsx contains data on over 800 potential customers being tracked by a fictional company that has been marketing a frozen lasagna dinner The le contains a number of demographics on these customers as indicated in Figure 360 their age weight income pay type car value credit card debt gender whether they live alone dwelling type monthly number of trips to the mall and neighborhood It also indicates whether they have tried the company s frozen lasagna The company wants to understand why some potential customers are triers and others are not Does gender make a difference Does income make a difference In general what distinguishes triers from nontriers How can the company use pivot tables to explore these questions Figure 360 Lasagna Trier Data A B c D E F G H 1 1 K L M 1 Person Age Weight Income PayType CarVaue CCDebt Gender LiveAone DweType MaTrips Nbhd HaveTried 2 1 48 175 65500 Hourly 2190 3510 Male No Home 7 East No 3 2 33 202 29100 Hourly 2110 740 Female No Condo 4 East Yes 4 3 51 188 32200 Salaried 5140 910 Male No Condo 1 East No 5 4 56 244 19000 Hourly 700 1620 Female No Home 3 West No 6 5 28 218 81400 Salaried 26620 600 Male No Apt 3 West Yes 7 6 51 173 73000 Salaried 24520 950 Female No Condo 2 East No 8 7 44 182 66400 Salaried 10130 3500 Female Yes Condo 6 West Yes 9 8 29 189 46200 Salaried 10250 2860 Male No Condo 5 West Yes 10 9 28 200 61100 Salaried 17210 3180 Male No Condo 10 West Yes 11 10 29 209 9800 Salaried 2090 1270 Female Yes Apt 7 East Yes 12 11 29 171 46600 Salaried 16350 5520 Male Yes Home 11 West Yes 13 12 30 243 24500 Salaried 5410 300 Male No Home 3 West Yes 14 13 62 246 110900 Salaried 8410 730 Male Yes Condo 7 West Yes 15 14 29 228 37200 Salaried 6420 700 Male Yes Apt 3 East Yes 16 15 40 230 21800 Hourly 3230 1650 Male No Home 4 East Yes 17 16 61 185 28900 Hourly 1300 1030 Male Yes Apt 2 South No Objective To use pivot tables to explore which demographic variables help to distinguish lasagna triers from nontriers Solution Pivot tables with The key is to set up a pivot table that shows counts of triers and nontriers for different C t3 39 the Values categories of any of the potential explanatory variables For example one such pivot table area are a gregt way shows the percentages of triers and nontriers for males and females separately If the d39S Ve W 39 1 d39ff 1 f 1 th f f 1 th 3911 kn th 1 d Va Owes have the percen ages are 1 eren or ma es an or ema es e company w1 ow a gen er argest effect on a has an effect On the other hand if the percentages for males and females are about the YesNo variable same the company will know that gender does not make much of a difference 35 PivotTabes I3I Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Figure36 I1 P I B I C I D I E I F I G Pivot Table and Pivot Chart for P El WT EdE Yes Era nd Total Examining the Effect L Fer39narE a15r3 542i139a t1uunn39a 0fGende1 ii l39u39lElE39 3i3IIIE39il EUEIE39irl 39lllll llllEiquoti39i E39 Grand Tntal IE1quot l 39l39 BIIl l39E 1llill39lllllllirquota 3 i in 14295 13 1 Elllllillll 13 l3L 5l2IlZIlII l i V 4IlIlZEi3 P6 I N10 39 3niun9a E I FEE 0 21 IlZIlZ1 E E 1l2 Il lZ l 0 0 lIlIl39lI39El Femailre Fl39139alE E5 You should set up the typical pivot table as shown in Figure 361 The Row variable is any demographic variable you want to investigate in this case Gender The Column variable is HaveTried Yes or N o The Values variable can be any variable so long as it is expressed as a count Finally it is useful to show these counts as percentage of row This way you can easily look down column C to see whether the percentage in one category Female who have tried the product is any different from the percentage in another category Male who have tried the product Speci cally males are somewhat more likely to try the product than females 6092 versus 5427 This is also apparent from the associated pivot chart Once this generic pivot table and associated pivot chart are set up you can easily explore other demographic variables by swapping them for Gender For example Figure 362 Figure 362 I 1 395 I E I C I D I E I F I G I Pivot Table and P P h f i lI u unt HatweTriedll Not C art or 4 Li1ll3939E l 39n E E lilo FEE l3 rantl Total Examining the Effect E No 453533 5 l3r 1nnnu of LiVeA1One E quotfaE 21552 llllllZlI i39eE i ErannlI Tntal quot121quotlL39E quot39r Cl39Elquota 1vlllIllIlllllll E E 1 ElIIDIlI J l1 EIvlZIl3I t 13 i EllPlquot fl4 5lZIlIIl3IquotE y 4nlnn3t I ma T 73739 3lZIlII39EI395 quotE5 T3 E 2lIIElIl 3E 1lIJIIIrIZI Sta E IIZIJIIIIIZIEE lilo Yes 35 I 32 Chapter 3 Finding Relationships amongVariabes Copyright 2010 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it indicates that people who live alone are not surprisingly much more likely to try this frozen microwave product than people who don t live alone As another example Figure 363 indicates that people with larger incomes are slightly more likely to try the product There are two things to note about this income pivot table First because there are so many individual income values grouping is useful You can experiment with the grouping to get the most meaningful results Second you should be a bit skeptical about the last group which has 100 triers It is possible that there are only one or two people in this group It turns out that there are four For this reason it is a good idea to show two pivot tables of the counts one showing percentage of row and one showing the raw counts This second pivot table is shown at the bottom of Figure 363 Figure 363 Pivot Table and Pivot Chart for Examining the Effect of Income l t T A l E l C l D l E l F l G l 11 7 3 Envunt H1ur eTrie 1 Incame B Hu I525 Errand Tntal 5 ll Fl5555 454W5 515355 15 55555 25 55 55555555 339fl5 55 525 5 l II 555 gig l El Elllll ll 1l555539 3543 5395 555 5 l II 555 5 l5EJlllIll15l3955El5 55555 15555 15555 ESL IE rand Tnital 11E1 iiquotu H513 5 1llll lll39IllI39ll l39lI39El 391lL Elli 13 l 1 El JIIIIIIIEE l l ZL 1luu llIlI E5fa 15 Eli an lZIII2I 5 1 3915 l V E13 EPl3939J 55 I N13 9 I i39 E5 31 J an lIlIT5 35 E 3 25 lmr5la ii IIZI lZIlZIquot35 l EI El399 E199 SEIEIIIIJEI El 9 BE B IKIZIEIIIEEII39rIIJ 1439EI BE E 1 5IlZl39IIIJ 139E39E 9E E I545 gg Euunt HvveTrired El 35 Incamile l39lln TEE 139Grand Tutal 3 l 5455 55 25 5 3 El 553 w 55 D555l553955 51 391 52 243 35 l 1 Elllll D14553955 M 32 45 34 l5 IZIEHII ll1539555539 l 4 35n Grand Tntal 351 4515 E55 The problem posed in this example is a common one in real business situations One variable indicates whether people are in one group or another triers or nontriers and there are a lot of other variables that could potentially explain why some people are in one group and others are in the other group There are a number of rather sophisticated tech niques for attacking this classi cation problem most of which are beyond the level of this book However you can go a long way toward understanding which variables are impor tant by the simple pivot table method illustrated here CHANGES IN EXCEL 20 I 0 Microsoft has made the already userfriendly pivot tables even friendlier in Excel 20 I 0 with the addition of sicersThese are essentially lists of the distinct values of any variable which you can then filter on You add a slicer from the PivotTabe ToolsOptions ribbon For example in the 35 Pivot Tables I 3 3 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Figure 364 Pivot Table with Slicers PROBLEMS Level A Solve problem 1 with pivot tables and create corre sponding pivot charts Express the counts as per centage of row What do these percentages indicate about this particular data set Then repeat expressing the counts as percentages of column 34 Solve problem 2 with pivot tables and create corresponding pivot charts Express the counts as percentage of row What do these percentages indicate 38 about this particular data set Then repeat expressing the counts as percentages of column 35 Solve problem 3 with pivot tables and create corre sponding pivot charts Express the counts as percentage of row What do these percentages indicate about this 39 particular data set Then repeat expressing the counts as percentages of column 36 Solve problem 4 with pivot tables and create corre sponding pivot charts Express the counts as percentage Elecmart sales data you can choose Region as a slicer You then see a list on the screen with a button for each possible value Midwest Northeast South and WestYou can then click any combination of these buttons to filter on the chosen regions Note that a slicer variable does not have to be part of the pivot table For example if you are showing sum of TotaCost and Region is not part of the pivot table a Region slicer will still lter sum ofTotaCost for the regions selected On the other hand if Region is already in the row area say you can lter on it through the slicer In this case clicking on regions from the slicer is equivalent to filtering on the same regions in the row area Basically the slicers have been added as a convenience to usersThey make ltering easier and more transparent As an example Figure 364 shows a pivot table accompanied by two sicersThe row field is Timewhich has been ltered in the usual way through the dropdown list in the row area to show only Afternoon and EveningThe two slicers appear next to the pivot tabeThe Region slicer has been filtered on Midwest and South and the Gender slicer has been filtered on Male So for exampethe sum ofTotaCost for all sales in the evening by males in the Midwest and South is 382403 2 l Fl 39 Es GEri iler W 3 f IElIllIlE39l39 39ilgIl 39 39aalll39l lEllilll lljllillt lrlillIlif Egmn 39 4 if39l l39tlII2E39liquotII1ZII1ZIlquotI 151lEi I MiElwe5I I Female 5 le llreninegee 4 53quot39g39m39 3 NirEthEa5i I llhlliEllE I E ii randimital 55T549 l Eouth I 3 Y quotquotlfEE39lI 399 E I of row What do these percentages indicate about this particular data set Then repeat expressing the counts as percentages of column 37 Solve problem 7 with pivot tables and create corre sponding pivot charts However find only means and standard deviations not medians or quartiles This is one drawback of pivot tables Medians quar tiles and percentiles are not in the list of summary measures Solve problem 8 with pivot tables and create cor responding pivot charts However nd only means and standard deviations not medians This is one drawback of pivot tables Medians are not among their summary measures Solve problem 9 with pivot tables and create cor responding pivot charts However nd only means and standard deviations not medians This is one drawback of pivot tables Medians are not among their summary measures I 34 Chapter 3 Finding Relationships amongVariabes Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it 40 The le P0340XlsX contains monthly data on the 41 42 number of border crossings from Mexico into four southwestern states a Restructure this data set on a new sheet so that there are three long columns Month State and Crossings Essentially you should stack the original columns B through E on top of one another to get the Crossings column and you should also indicate which state each row corresponds to in the State column The Month column should have four replicas of the original Month column b Create a pivot table and corresponding pivot table chart based on the restructured data It should break down the average of Crossings by Year and State Comment on any pattems you see in the chart The Wall Street Journal CEO Compensation Study analyzed CEO pay from many US companies with fiscal year 2008 revenue of at least 5 billion that filed their proxy statements between October 2008 and March 2009 The data are in the le P0230XlsX Note This data set is a somewhat different CEO compensation data set from the one used as an example in the next section 21 Create a pivot table and a corresponding pivot chart that simultaneously shows average of Salary 2008 and average of Bonus 2008 broken down by Company Type Comment on any striking results in the chart b In the Data sheet create a new column Total 2008 which is the sum of Salary 2008 and Bonus 2008 Then create two pivot tables and corresponding pivot charts on a single sheet The first should show the counts of CEOs broken down by Company Type and the second should simultaneously show the average of Total 2008 the minimum of Total 2008 and the maximum of Total 2008 all broken down by Company Type Comment on any striking results in these charts One pivot table element we didn t explain is a calculated item This is usually a new category for some categorical variable that is created from existing categories It is easiest to learn from an example Open the file Elecmart SalesXlsX from this section create a pivot table and put Day in the row area Proceed as follows to create two new categories Weekday and Weekend 21 Select any day and select Calculated Item from the Formulas dropdown list on the PivotTable Tools Options ribbon This will open a dialog box Enter Weekend in the Name box and enter the formula SatSun in the formula box You can double click on the items in the Items list to help build this formula When you click on OK you will see Weekend in the pivot table b Do it yourself Create another calculated item Weekday for Mon through Fri c Filter out all of the individual days from the row area so that only Weekday and Weekend remain and then find the sum of TotalCost for these two new categories How can you check whether these sums are what you think they should be Notes about calculated items First if you have Weekend Weekday and some individual days showing in the row area the sum of TotalCost will double count these individual days so be careful about this Second be aware that if you create a calculated item from some variable such as Day you are no longer allowed to drag that variable to the Report Filter area We are not sure why 43 Building on the previous problem another pivot table element we didn t explain is a calculated eld This is usually a new numerical variable built from numerical variables that can be summarized in the Values area It acts somewhat like a new column in the spreadsheet data but there is an important difference Again it is easiest to learn from an example Open the le Elecmart SalesXlsX and follow the instructions below 21 Create a new column in the data CostPerItem which is TotalCost divided by ItemsOrdered Then create a pivot table and find the average of CostPerItem broken down by Region You should find averages such as 5041 for the MidWest Explain exactly how this value was calculated Would such an average be of much interest to a manager at Elecmart Why or why not b Select any average in the pivot table and then select Calculated Field from the Formulas dropdown list on the PivotTable Tools Options ribbon This will open a dialog box Enter CFCostPerItem in the name box we added CF for calculated eld because we are not allowed to use the CostPerItem name that already exists enter the formula TotalCostfItemsOrdered and click on OK You should now see a new column in the pivot table Sum of CFCostPerItem with different values than in the Average of CostPerItem column For example the new value for the MidWest should be 4647 Do some investigation to understand how this sum was calculated From a manager s point of view does it make any sense Note on calculated fields When you summarize a calculated field it doesn t matter whether you express it as sum average max or any other summary measure It is calculated in exactly the same way in each case 44 The le P0218XlsX contains daily values of the SampP Index from 1970 to 2009 It also contains percentage changes in the index from each day to the next Create a pivot table with average of Change in the Values area and Date in the Row area You will see every single date with no real averaging taking place This 35 Pivot Tables I 3 5 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it problem lets you explore how you can group naturally on a date variable For each part below explain the result brie y a Group by Month b Group by Year c Group by Month and Year select both in the Group dialog box Can you make it show the year averages from part b d Group by Quarter e Group by Month and Quarter Can you make it show the averages from part c f Group by Quarter and Year g Group by Month Quarter and Year 45 For Excel 2010 users only Using the Elecmart SalesXlsX file from this section experiment with slicers as follows 21 Create a pivot table that shows the average of TotalCost broken down by Region in the row area and Time in the column area Then insert two slicers one for Region and one for Time Select the West and NorthEast buttons on the Region slicer and the Moming and Aftemoon buttons on the Time slicer Explain what happens in the pivot table b Create a pivot table that shows the average of TotalCost broken down by Region in the row area and Time in the column area Insert a Day slicer and select the Sat and Sun buttons Explain what averages are now showing in the pivot table Verify this by deleting the slicer and instead making Day a report lter with Sat and Sun selected 46 For Excel 2010 users only We used the Lasagna Triersxlsx le in this section to show how pivot tables can help explain which variables are related to the buying behavior of customers Illustrate how the same information could be obtained with slicers Speci cally set up the pivot table as in the example but use a slicer instead of a row variable Then set it up exactly as in the example with a row variable but include a slicer for some other variable Comment on the type of results you obtain with these two versions Do slicers appear to provide any advantage in this type of problem Level B 47 Solve problem 5 with pivot tables and create corr esponding pivot charts If you rst nd the quartiles of Salary and AmountSpent by any method is it possible to create the desired crosstabs by grouping without recoding these variables 48 Solve problem 17 with pivot tables However find only means and standard deviations not medians This is one drawback of pivot tables Medians are not among their summary measures 49 The file P0322xlsX lists financial data on movies released since 1980 with budgets at least 20 million I 36 Chapter 3 Finding Relationships amongVariabes Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it 50 21 Create three new variables Ratiol Ratio2 and Decade Ratiol should be US Gross divided by Budget Ratio2 should be Worldwide Gross divided by Budget and Decade should list 1980s 1990s or 2000s depending on the year of the release date If either US Gross or Worldwide Gross is listed as Unknown the corresponding ratio should be blank Hint For Decade use the YEAR function to ll in a new Year column Then use a lookup table to populate the Decade column b Use a pivot table to find counts of movies by various distributors Then go back to the data and create one more column Distributor New which lists the distributor for distributors with at least 30 movies and lists Other for the rest Hint Use a lookup table to populate Distributor New but also use an IF to ll in Other where the distributor is missing c Create a pivot table and corresponding pivot chart that shows average and standard deviation of Ratiol broken down by Distributor New with a report lter for Decade Comment on any striking results d Repeat part c for Ratio2 The le P0350XlsX lists NBA salaries for ve seasons Each NBA season straddles two calendar years a Merge all of the data into a single new sheet called All Data In this new sheet add a new column Season that lists the season such as 20062007 b Note that many of the players list a position such as C F or F C Presumably the rst means the player is primarily a center but sometimes plays forward whereas the second means the opposite Recode these so that only the primary position remains C in the rst case F in the second To complicate matters further the source lists positions differently in 20072008 than in other years It lists PG and SG point guard and shooting guard instead of just G and it lists SF and PF small forward and power forward instead of just F Recode the positions for this season to be consistent with the other seasons so that there are only three positions G F and C c Note that many players have p or t in their Contract Thru value The Source sheet explains this Create two new columns in the All Data sheet Years Remaining and Option The Years Remaining column should list the years remaining in the contract For example if the season is 20042005 and the contract is through 20062007 years remaining should be 2 The Option column should list Player if there is a p Team if there is a t and blank if neither d Use a pivot table to find the average Salary by Season Change it to show average Salary by Team Change it to show average Salary by Season and Team Change it to show average Salary by a Explain why the current format of either data set Primary Position Change it to show average Salary limits the kind of information you can obtain with a by Team and Primary Position with lters for pivot table For example does it allow you nd the Season Contract Years Years Remaining and average on time arrival percentage by year for any Option Comment on any striking findings selected subset of airports such as the average for The les P0229XlsX contain monthly percentages O Hare L03 Angeles Intemauonal and La Guardla b Restructure the data appropriately and then use a of on time arrivals at several of the largest US pivot table to answer the speci c question in part a airports 36 AN EXTENDED EXAMPLE Now that you are equipped with a collection of tools for describing data it is time to apply these tools to some serious data analysis In this section we examine a very interesting data set that contains real data on CEO compensation in 2008 With a data set as rich as this one there are always many summary measures that could be calculated many tables that could be formed and many charts that could be created We illustrate some of the outputs that might be of interest but you should realize that there are many other analyses you could perform Given all the attention CEO compensation has received in the recent economic recession you probably have your own questions you would like to answer Therefore we encourage you to take the analysis a few steps beyond what we present here 36 CEO COMPENSATION he le CEO Compensation 2008 Forbesxlsx contains data on the 500 mostly highly compensated CEOs in 2008 according to a Forbes Web site The data was gathered by going to Web sites such as wwwforbescomlists200912best boss09John B Hess7YAEht1nl one per CEO and copying the data to Excel A small subset of this data appears in Figures 365 and 366 where the rows are sorted in decreasing order of total compensation in millions of dollars in column J The data set includes some personal data about each CEO the CEO s total compensation for 2008 and its components columns K N the CEO s total 5year compensation and the value of company shares owned by the CEO For those CEOs with tenure at least six years it also shows some sixyear values including a performance versus pay ranking 1 is best in column U Finally the last two columns indicate the CEO s total retum versus tenure and how this compares to the market More details about the variables in this le can be found in the cell comments in row 1 and at the Web links in column A The le also includes median compensation Figure 365 CEO Compensation Columns A L A B C D E F G H I J K L Years as Years with Total 2008 2008 2008 1 CEO Gender Company Ticker Industry Founder company CEO company Age compensation Salary Bonus 2 LawrenceJ Ellison M Oracle ORCL Software amp Services Yes 32 32 64 55698 100 1078 L Ray R Irani M Occidental Petroleum OXY Oil amp Gas Operations No 18 26 74 22264 130 363 4 John B Hess M Hess HES Oil amp Gas Operations No 14 32 55 15458 150 350 5 Michael D Watford M Ultra Petroleum UPL Oil amp Gas Operations No 10 10 55 11693 060 175 E Mark G Papa M EOG Resources EOG Oil amp Gas Operations No 11 28 62 9047 094 100 7 William R Berkley M WR Berkley WRB Insurance Yes 42 42 63 8748 100 850 T Matthew K Rose M Burlington Santa Fe BNI Transportation No 8 16 50 6862 118 168 T PauJ Evanson M Allegheny Energy AYE Utilities No 6 17 67 6726 112 123 3 Hugh Grant M Monsanto MON Chemicals No 6 28 51 6460 129 333 11 Robert W Lane M Deere amp Co DE Capital Goods No 9 27 59 6130 144 359 36 An Extended Example I37 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Figure 366 CEO Compensation Columns M W M N O P Q R S T U V W 5year 6year return 6year return 2008 2008 Stock compensation Shares owned 6year average 6year annual relative to relative to Performance Total return Relative to 1 Other gains total S millions compensation total return industry market vs pay rank during tenure market 2 145 54375 94445 219874 16426 9 101 107 103 27 118 3 3332 18439 74355 3943 12882 28 106 125 106 14 106 4 3666 11292 23483 20168 3968 27 106 125 115 10 104 5 110 11348 17417 1041 2947 45 120 142 28 48 153 6 1886 6967 17069 517 2892 23 102 120 92 23 123 7 542 7256 17829 6274 3079 13 118 111 67 14 104 8 2070 4506 14073 398 2368 19 114 117 40 13 117 9 2228 4263 14354 331 18 119 10 932 5067 13530 280 44 144 11 1524 4104 14240 118 2413 13 100 111 109 9 113 Figure 367 Industry Medians Y Z AA AB AC AD Lookup table for industry medians in 1 2008 2 Industry Total compensation Salary Bonus Other Stock Gains 3 Aerospace amp Defense 1429 120 246 474 299 4 Banking 229 090 000 037 000 5 Business Services amp Supplies 247 094 045 109 000 6 Capital Goods 900 112 136 306 088 7 Chemicals 486 100 155 093 000 8 Conglomerates 1221 116 269 834 000 9 Construction 766 099 120 191 145 10 Consumer Durables 338 123 000 178 000 values for the various industries a few of which are shown in Figure 367 so that you can compare any CEO to these median values in hisher industry Objective To use the tools in this chapter to explore the CEO compensation data Solution We first present a short discussion on CEO compensation A Primer on CEO Compensation If a normal person gets a salary of 80000 in a given year that s about the end of the story However CEO compensation is considerably more complex Each CEO receives a base salary column K and an incentive bonus column L the latter decided by negotiation A CEO can also receive other compensation column M including vested restricted stock grants LTIP longterm incentive plan payouts and perks However the big difference between CEOs and the rest of us is the granting of stock options A stock option allows a CEO to purchase company stock at a xed stated price during a certain period of time often 10 years If the price of the company stock increases during that period the CEO can then exercise the stock options by buying I 38 Chapter 3 Finding Relationships amongVariabes Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it the stock at the low xed price and selling it at the high current price thereby making a windfall This explains the huge stock gains in column N for several of the CEOs at the top of the list They evidently exercised at least some of their options in 2008 In a sense these huge stock gains for some CEOs overstate their compensation for 2008 They had been holding these valuable stock options for years but their gains showed up only in 2008 when they exercised the options This is essentially an accounting issue When and how should stock options show up in compensation gures The data in our le indicate one possibility In particular the total compensation in column I is relevant for tax purposes in 2008 this is the amount subject to 2008 taxes However it is not the value that appears on company nancial statements For example if you visit other Forbes Web sites such as httppeopleforbesco1npro lemark g papa305 17 for Mark Papa number 5 on our list you will see a much different total compensation about 2344 million rather than the 9047 million we report The difference can be explained by the way stock options are accounted for on corporate nancial statements Obviously accounting for CEO compensation is a complex issue and much academic research has been devoted to it However now that you understand the basic issues we will begin analyzing the data in the le With a data set this large and including so many variables it is dif cult to know where to start Probably the best strategy is to ask some interesting questions and then nd the best tools to answer them Here are some possibilities Question Set 1 There are clearly very few female CEOs How many are there Do they tend to be in certain industries Does their tenure tend to be shorter than that of their male counterparts Do they tend to be younger than their male counterparts Answer Set 1 The pivot table in Figure 368 answers the rst two questions It shows the counts of males and females across industries See the nished version of the CEO le Figure pfC I eC ljggrun aiu e g TE umnAL hE5M Erandllmtal cEosweender B 2 2 and Industry 1 2 jauainesa5enricesa25uppIies 2 25 25 p M l iapital uuda 22 22 V jfJ q Eher21iIaI5 1 15 15 ID Lili iunglumerateai 10 10 0 11Ecm5truntian 5 5 K ulumaiumer Eliurableai 12 12 J J 7Jwer5ified Financiala 1 42 43 M 12 13Drug5E Eiatechr2alaggur 2 20 12525555 urinlaa ampTta n 5 22 22 p 25 F55cI Marketa 4 2 Z 213quotflealth Care Equipment Sasemicea 1 2 28 k 13 llElutela Reatauranta E2 Leiaure E E lHuu5ehald S2Per5una Products 1 141 15 p 24 ln25urfar1e 2 23 21W1ateriaI5 25 25 p 22 media 25 25 Dil 82 Gaa Dperatiuna 1 32 33 p 22 faetailing 2 35 22 25 Semicanduntcnr52 11 11 p 25 Et ewicea 2 25 2 9 fTeehmolug539 Hardware 82 Equipment 2 2122 V 23 jTeen5mmunicati5n5Sewicea 21 11 25 Trar15pur tatiun 241 242 p 25 llultilitiea 42 22 222 ram n5 llfntal 14 5525 Y 36 An Extended Example I39 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it for detailed instructions on getting all of the results in this example There are only 14 female CEOs and five of them are in the Food Drink amp Tobacco industry Note that if you doubleclick on the 5 count in this pivot table you can drill down to the underlying data to see which ve companies these female CEOs represent They are Archer Daniels Sara Lee Kraft Foods Reynolds American and Pepsico One way to investigate age and tenure across gender is to use StatTools with one variable summary statistics broken down by gender Remember to use the Stacked format option Selected results appear in Figure 369 Although there are only 14 female CEOs their tenure tends to be much lower than for men and less spread out whereas the differ ences in age are fairly minimal with females being slightly younger than males on average Figure 369 I Years as company CEO F Years as company CEO M One Variable Summary Data Set 1 Data Set 1 CEO Tenure and Mean 3643 7013 Age W Gender Std Dev 2783 7208 Median 3000 5000 Age F Age M One Variable Summary Data Set 1 Data Set 1 Mean 54643 56062 std Dev 4162 6393 Median 55000 56000 Question Set 2 How much do these CEOs make and how is this allocated across the four different components of compensation Answer Set 2 Several tools including numerical summary measures histograms box plots and pivot tables can be used to answer these questions We illustrate only the first two of these Figure 370 shows StatTools numerical summary measures of total Figure 370 Summary Measures of CEO Compensation Total 2008 compensation 2008 Salary 2008 Bonus 2008 Other 2008 Stock gains One Variable Summary Full Data Set Full Data Set Full Data Set Full Data Set Full Data Set Mean 1143 10595 1757 3096 552 Std Dev 2988 05310 2285 4726 2787 Median 539 10000 1300 1440 000 Minimum 000 00000 0000 0000 000 Maximum 55698 81000 18500 36660 54375 1st Quartile 288 08400 0400 0350 000 3st Quartile 1099 12000 2250 3730 280 100 013 00000 0000 0000 000 250 077 03200 0000 0000 000 500 109 04500 0000 0010 000 1000 146 06000 0000 0050 000 2000 237 07800 0000 0200 000 8000 1287 12800 2530 4790 422 9000 2429 15000 3700 7650 1418 9500 3601 17500 5000 11690 2504 9750 5129 20000 8280 18800 3952 9900 9047 24300 13950 25240 7256 I40 Chapter 3 Finding Relationships amongVariabes Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Figure 37 Histogram of Total Compensation Figure 372 Histogram of Base Salary compensation and its four components Note that these are not broken down by any cate gories so the Unstacked format option in StatTools should be used Evidently base salary varies through a fairly limited range with the middle 90 of CEOS between 045 million and 175 million see the 5th and 95th percentiles The bonuses are spread out a bit more but not much the median bonus is 13 million and 95 are no more than 5 million Probably the most striking feature is the extreme skewness in the other and stock gains components For example at least a quarter of the CEOS had no stock gains but at least 10 made more than 14 million in stock gains If you create histograms of these ve variables you will see that some of them have almost no shape being totally dominated by outliers For example the histogram of total compensation in Figure 371 has only one bar of appreciable size This bar contains 489 of the 500 CEOS Even the histogram of base salary in Figure 372 is affected by outliers although it is not nearly as skewed as the histogram of total compensation Histogram of Total 2008 compensationl Full Data Set 600 500 5 7l C 3 300 Cquot 8 LL 200 100 0 E 55 5 3 Fl 9 L a a lt 5 8 s 2 m 9 rl 2 N Y 739 lt1 ltI39 LO Histogram of 2008 Salary Full Data Set 400 350 300 5 250 8 3 200 8 II 150 100 50 0 O O O O O O O O O O LO LO LO LO LO LO LO LO LO LO Q Fl N Y lt1 Ln KO I O0 O3 lt1 N Q oo LO ltr rxi o oo Lo O Fl 39l 39J Y lt1 Ln K0 K0 I If you want to see more clearly how most of the CEO compensations are distributed you can create a new StatTools data set with the outliers removed Here it is useful to copy the data to a new sheet and then work with the copy As an example we sorted in increas ing order of total compensation and then deleted the bottom 50 the ones with the largest compensations A histogram of the remaining 450 appears in Figure 373 There is still skewness but not nearly as pronounced as before 36 An Extended Example I4I Copyright 2010 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Histogram of Total 2008 compensationl Reduced Data Set Figure 373 140 Histogram of Total 120 Compensation 100 without Top 50 5 Earners 80 60 40 20 39 l I 0 1 215 3644 6 073 8 502 10 931 13 360 15789 18 218 20 647 23 076 Question Set 3 How do CEO compensations vary by industry Is the variability about the same in each industry Answer Set 3 The rst question can be answered partly by the table in Figure 367 which we obtained directly from the Web site However it is easy to calculate additional results with one or more pivot tables and associated pivot charts One such pivot chart appears in Figure 374 This summarizes total compensation in two ways by average and by standard deviation sorted on the averages The smallest average compensation is for Consumer Durables and the largest is for Software amp Services But note that the two cate gories to the right have by far the largest standard deviations Evidently a few very highly paid CEOS in these two industries not only pulled up the averages but also the amount of variability The top ve CEOS are in these two industries Remember also that you can lter in pivot tables Figure 374 shows all industries but to see only a few you can easily lter out the ones you want to hide For example Figure 375 shows a clearer picture of industries in the nancial sector It took almost no work to create this chart After ltering out all but the three industries the pivot chart in Figure 374 updated automatically Figure 374 Pivot Chart of Total Compensation by Industry 16000 S Average of Total 2008 compensation I Std Dev of Total 2008 compensation 14000 12000 10000 8000 6000 4000 2000 000 3 3 T 3 8quot 33S3 Q 3 E 5 4 3 4 co 396 E V J5 Q Q E 2 O gt C J5 0 395 L J G gt C L C G 44 Q 5 G T O O quotU U D gt L as 9 Q 3 L 0 1 OJ C 3 4 L quotquot E I3 L9 E L 1 as L 3 3 O 5 9 0 U quot5 co G L G 395 Q 3 C q 3 L L 0 8 5 58a2 gne cn939g 2amp5 9 a wo 5 E O Q 3 C 0 C E quotE 3 U 8 9 F0 g in E 3 3939 L E 8 F5 0 21 as E I co co U L 395 as W 5 OJ L L 5 S gt Q l Q 3 5 E D V 0 9 L 3 0 on m 03 O3 4 O 0 C 0 C Q B E E O V O L U Q 3 2 U U 0 L OD U U O gt5 E D 0 g 3 E O 8 3939 E E I 5 C D 5 0 2 To 8 U 398 8 G G 5 W E 5 I 2 I42 Chapter 3 Finding Relationships amongVariabes Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Average of Total 2008 compensation I Std Dev of Total 2008 compensation Figure 375 51300 Pivot Chart of Total 1600 Compensation for 1400 Financial Industries 12oo 1000 81M EL00 41M 200 CL00 Banking Insurance Diversi ed Financials Question Set 4 An obvious question deals with the relationship if any between the size of the CEO s compensation and the performance of the company We would hope there is a positive relationship but is there one Answer Set 4 First you need to be aware that any conclusions drawn from this data set are tentative at best This is the area that has been researched most by academics Are CEOS worth the huge compensations they are receiving There are no easy answers and we have to be very careful to use the most appropriate data but we can take a look at the evidence here It is probably best to look at long term results rather than just a single year so we focus on the data in columns Q W in Figure 366 which is reported only for CEOS who have a Sixyear tenure and Sixyear compensation history There are 179 such CEOS so we created another StatTools data set called Sixyear Data Set for these 179 CEOS Column Q contains the six year average compensation for the CEO Columns R T show the annualized stock returns including dividends for the company columns S and T show this as an index relative to the industry and the market SampP 500 respectively where a score of 100 is par Columns V and W are similar to columns R and T except that they are for the CEO s entire tenure Finally column U lists Forbes s ranking of these 179 CEOS based on a performance versus pay score with 1 being best 179 being worst Probably the best way to explore these data is with correlations and scatterplots The correlations among the seven variables appear in Figure 376 As always we are on the Figure 376 Correlations among Pay and Performance Variables A B c D E F G H L 6year average compensation 6year annual total return 6year return relative to industry 6year return relative to market Performance vs pay rank Total return during tenure Relative to market 8 Correlation Table Sixyear Data Set Sixyear Data Set Sixyear Data Set Sixyear Data Set Sixyear Data Set Sixyear Data Set Sixyear Data Set i 6year average compensation 1000 L0 6year annual total return 0188 1000 3 6year return relative to industry 0111 0629 1000 L2 6year return relative to market 0188 1000 0630 1000 3 Performance vs pay rank 0114 0668 0667 0668 1000 iTota return during tenure 0272 0638 0423 0637 0589 1000 15 Relative to market 0084 0113 0075 0113 0093 0197 1000 We are not told exactly how these rankings were made but you can nd more information at wwwforbes com2009O422compensation chief executive salary leadership best boss O9 ceo introhtml 36 An Extended Example I43 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Figure 377 Scatterplots of Pay Versus Performance Variables Seatter plint all Ei 39II39ear auerzage E lfl IEIHSHHEH 1s 1rean HIHHUJHI total f acattierpilmiui 5year mrerage mmpiensaliunu u5 TF39e39ta li ifE li1Ulrn1sd39IJlfing il39EtLI39n f Sia neaar Data Set tie nme all Eijewgrear Data 5 r 5 i 5 35 5 ll IJTEnquotilquotE39ia39IquotEE39E39DI39l39lI21Ei aitlitm F5 ix1rearDaiEe39t 3 Ewreini anI3u39rEqe 1imiapeni ll39lll1Zillllquot5ii39ii39iFil Daia E E1 4 I13 rEIL2 El E E I E El 3 fl 3 iii 139 Eu5i llE Eaea an H u ml imaI re139 i mnquot5i391ieear jEaia 5 ei Tatar rei u r n duniing quotme mwefviiguwlear ifiafrg q2 lookout for large correlations but it is also interesting to look at small correlations for the lack of relationships In this case the six year average compensation which includes stock gains has very small correlations with all other variables There is a hint of a positive relationship in the 0188 and 0272 correlations between sixyear average compensation and total return and total return during tenure but these are certainly not large This is apparent in the two corresponding scatterplots in Figure 377 where there is again a hint of a positive relationship between pay and performance but certainly not a strong one We changed the scale of the vertical axis to a maximum of 50 so that the shapes of the scatters are clearer This hides a few outliers The correlations in Figures 376 and 377 are probably the best evidence we have at least from this data set that CEO compensation and company performance are at best weakly related However some other correlations are also interesting Specifically note the perfect correlation of 10 between six year total return and six year return relative to market Although it is not clear from the Forbes footnotes these two variables evidently express exactly the same information in different ways But then why is the correlation between the last two variables not also 10 Its small value 0197 raises the question of whether we misinterpreted the last variable The lesson here is that correlations in real data sets can sometimes lead to surprises and further questions We have certainly not answered all of the questions that could be asked about the CEO data set and we have not exploited all of the tools that could be used to answer such questions With a data set as rich as this one all you can hope to do is use the various data analysis tools in your arsenal to uncover as much interesting information as possible 37 CONCLUSION Finding relationships among variables is arguably the most important task in data analysis This chapter has equipped you with some very powerful tools for detecting relationships As we have discussed the tools vary depending on whether the variables are categorical or numerical Again refer to the diagram in the Data Analysis Taxonomyxlsx file Tables and charts of counts are useful for relationships among categorical variables Summary measures broken down by categories and sideby side box plots or side by side histograms are useful for finding relationships between a categorical and a numerical variable Scatterplots and correlations are useful for finding relationships among numerical variables Finally pivot tables are useful for all types of variables I44 Chapter 3 Finding Relationships amongVariabes Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Summary of Key Terms Term Explanation Excel Pages Equation Crosstabs or Table of counts of joint categories COUNTIFS function 88 contingency table of two categorical variables or pivot table Comparison Comparing a numerical variable across 92 problem two or more subpopulations Stacked or Stacked means long columns 93 unstacked one for categories and another for values data formats unstacked means a separate values column for each category Scatterplot Chart for detecting a relationship Scatter from Insert 93 or X Y chart between two numerical variables ribbon or StatTools one point for each observation Trend line Line or curve t to scatterplot Right click on chart 105 or time series graph point select Add Trendline Covariance Measure of linear relationship COVAR function or 108 31 between two numerical variables StatTools but affected by units of measurement Correlation Measure of linear relationship CORREL function or 108 32 between two numerical variables StatTools always from 1 to 1 Pivot table Table for breaking down data by PivotTable from 114 category can show counts averages or other summary measures Insert ribbon Pivot chart Chart corresponding to a pivot table PivotChart from 124 PivotTable Tools Options ribbon Slicers Graphical elements for ltering in pivot tables New to Excel 2010 133 PROBLEMS Conceptual Questions C1 When you are trying to discover whether there is a relationship between two categorical variables why is it useful to transform the counts in a crosstabs to percentages of row or column Once you do this how can you tell if the variables are related C2 Suppose you have a crosstabs of two YesNo categorical variables with the counts shown as percentages of row What will these percentages look like if there is absolutely no relationship between the variables Besides this case list all possible types of relationships that could occur There aren t many C3 If you suspect that a company s advertising expen ditures in a given month affect its sales in future months what correlations would you look at to con rm your suspicions How would you nd them C4 C5 C6 Suppose you have customer data on whether they have bought your product in a given time period along with various demographics on the customers Explain how you could use pivot tables to see which demographics are the primary drivers of their yesno buying behavior Suppose you have data on student achievement in high school for each of many school districts In spreadsheet format the school district is in column A and various student achievement measures are in columns B C and so on If you nd fairly low correlations magnitudes from 0 to 04 say between the variables in these achievement columns what exactly does this mean A supermarket transactions data set is likely to have hierarchical columns of data For example for the product sold there might be columns like Product Family Product Department Product Category 37 Conclusion I45 Copyright 2010 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it C7 C8 C9 C10 and maybe even more See the le Supermarket TransactionsXlsX as an example Another hierarchy is for store location where there might be columns for Country State or Province City and possibly more One more hierarchy is time with the hierarchy Year Quarter Month and so on How could a supermarket manager use pivot tables to drill down through a hierarchy to examine revenues For example you might start at the Drink level then drill down to Alcoholic Beverages and then to Beer and Wine Illustrate with the le mentioned Suppose you have a large data set for some sport Each row might correspond to a particular team as in the le P0357xlsX on football outcomes for example or it might even correspond to a given play Each row contains one or more measures of success as well as many pieces of data that could be drivers of success How might you nd the most important drivers of success if the success measure is categorical such as Win or Lose How might you nd the most important drivers of success if the success measure is numerical and basically continuous such as Points Scored in basketball If two variables are highly correlated does this imply that changes in one cause changes in the other If not give at least one example from the real world that illustrates what else could cause a high correlation Suppose there are two commodities A and B with strongly negatively correlated daily retums such as a stock and gold Is it possible to nd another com modity with daily retums that are strongly negatively correlated with both A and B In checking whether several times series such as monthly exchange rates of various currencies move together why do most analysts look at correlations between their di erences rather than correlations between the original series Level A 52 Unfortunately StatTools doesn t have a stacked option for its correlation procedure which would allow you to get a separate table of correlations for each category of a categorical variable The only altemative is to sort on the categorical variable insert some blank rows between values of different categories copy the headings to each section create separate StatTools data sets for each and then ask for correlations from each Do this with the movie data in the le P0202Xlsx Speci cally separate the data into three data sets based on Genre one for Comedy one for Drama and one for all the rest For this problem you can ignore the third group For each of Comedy and Drama create a table of correlations between 7day Gross 14day Gross Total US Gross I46 Chapter 3 Finding Relationships amongVariabes Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it 56 Intemational Gross and US DVD Sales Comment on whether the correlation structure is much different for these two popular genres The le P0353xlsx lists campaign contributions by number of contributors and contribution amount by state including Washington DC for the four leading contenders in the 2008 presidential race Create a scatterplot and corresponding correlation between Dollar Amount Y axis and Contributors for each of the four contenders For each scatterplot superimpose a linear trend line and show the corresponding equation Interpret each equation and compare them across candi dates Finally identify the state for any points that aren t on or very close to the corresponding trend line The file P0354XlsX lists data for 5 39 movies released in 2009 Obviously some movies are simply more popular than others but success in 2009 measured by 2009 gross or 2009 tickets sold could also be in uenced by the release date To check this create a new variable Days Out which is the number of days the movie was out during 2009 For example a movie released on 1215 would have Days Out equal to 17 which includes the release day Create two scatterplots and corresponding correlations one of 2009 Gross Y axis versus Days Out and one of 2009 Tickets Sold Y axis versus Days Out Describe the behavior you see Do you think a movie s success can be predicted very well just by knowing how many days it has been out The le P0355XlsX lists the average salary for each MLB team from 2004 to 2009 along with the number of team wins in each of these years a Create a table of correlations between the Wins columns What do these correlations indicate Are they higher or lower than you expected b Create a table of correlations between the Salary columns What do these correlations indicate Are they higher or lower than you expected c For each year create a scatterplot and the associated correlations between Wins for that year Y axis and Salary for that year Does it appear that teams are buying their way to success d The coloring in the Wins columns indicates the playoff teams Create a new YesNo column for each year indicating whether the team made it to the playoffs that year Then create a pivot table for each year showing average of Salary for that year broken down by the YesN o column for that year Do these pivot tables indicate that teams are buying their way into the playoffs The le P0356xlsx lists the average salary for each NBA team from 2004 to 2009 along with the number of team wins each of these years Answer the same questions as in the previous problem for this basketball data 57 The le P0357Xlsx lists the average salary for each NFL team from 2004 to 2009 along with the number of team wins each of these years Answer the same questions as in the problem 55 for this football data 58 The le P0358xlsx lists salaries of MLB players in the years 2007 to 2009 Each row corresponds to a particular player As indicated by blank salaries some players played in one of these years some played in two of these years and the rest played in all three years 21 Create a new YesN o variable All 3 Years that indicates which players played all three years b Create two pivot tables and corresponding pivot charts The rst should show the count of players by position who played all three years The second should show the average salary each year by position for all players who played all three years For each of these put the All 3 Years variable in the Report Filter area Explain brie y what these two pivot tables indicate c Define a StatTools data set on only the players who played all three years Using this data set create a table of correlations of the three salary variables What do these correlations indicate about player salaries The le P0359xlsx lists the results of about 20000 runners in the 2008 New York Marathon a For all runners who nished in 35 hours or less create a pivot table and corresponding pivot chart of average of Time by Gender To get a fairer com parison in the chart change it so that the vertical axis starts at zero For the same runners and on the same sheet create another pivot table and pivot chart of counts by Gender Comment on the results b For all runners who finished in 35 hours or less create a pivot table and corresponding pivot chart of average of Time by Age Group by Age so that the teens are in one category those in their twenties are in another category and so on For the same runners and on the same sheet create another pivot table and pivot chart of counts of these age groups Comment on the results c For all runners who finished in 35 hours or less create a single pivot table of average of Time and of counts broken down by Country Then filter so that only the 10 countries with the 10 lowest average times appear Finally sort on average times so that the fastest countries rise to the top Guess who the top two are H int Try the Value Filters for the Country variable Comment on the results 60 The le P0212Xlsx includes data on the 50 top graduate programs in the US according to a recent US News amp World Report survey 21 Create a table of correlations between all of the numerical variables Discuss which variables are highly correlated with which others b The Overall score is the score schools agonize about Create a scatterplot and corresponding correlation of each of the other variables versus Overall with Overall always on the Y axis What do you learn from these scatterplots Recall from an example in the previous chapter that 62 the le Supermarket Transactionsxlsx contains over 14000 transactions made by supermarket customers over a period of approximately two years Set up a single pivot table and corresponding pivot chart with some instructions to a user like the supermarket manager in a text box on how the user can get answers to any typical question about the data For example one possibility of many could be total revenue by product department and month for any combination of gender marital status and homeowner The point is to get you to explain pivot table basics to a nontechnical user The le P0315Xlsx contains monthly data on the various components of the Consumer Price Index 21 Create differences for each of the variables You can do this quickly with StatTools using the Difference item in the Data Utilities dropdown list or you can create the differences with Excel formulas b Create a times series graph for each CPI component including the All Items component Then create a time series graph for each difference variable Comment on any patterns or trends you see c Create a table of correlations between the differences Comment on any large correlations or the lack of them d Create a scatterplot for each difference variable versus the difference for All Items Y axis Comment on any patterns or outliers you see Level B 63 The file P0363XlsX contains financial data on 85 US companies in the Computer and Electronic Product Manufacturing sector NAICS code 334 with 2009 earnings before taxes of at least 10000 Each of these companies listed RampD research and development expenses on its income statement Create a table of correlations between all of the variables and use conditional formatting to color green all correlations involving RampD that are strongly positive or negative Use cutoff values of your choice to define strongly Then create scatterplots of RampD Y axis versus each of the other most highly correlated variables Comment on any patterns you see in these scatterplots including any obvious outliers and explain why or if it makes sense that these variables are highly correlated with RampD If there are highly correlated variables with RampD can you tell which way the causality goes I47 37 Conclusion Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it 64 The file P0364Xlsx lists monthly data since 1950 on the well known Dow Jones Industrial Average DJ IA as well as the less well known Dow Jones Transportation Average DJ TA and Dow Jones Utilities Average DJUA Each of these is an index based on 20 to 30 leading companies which change over time 21 Create monthly differences in three new columns The J an 50 values will be blank because there are no Dec 49 values Then for example the Feb 50 difference is the Feb 50 value minus the J an 50 value You can easily calculate these with Excel formulas but you might want to try the StatTools Difference procedure from its Data Utilities dropdown list b Create a table of correlations of the three difference columns Does it appear that the three Dow indexes tend to move together through time c It is possible and has been claimed that one of the indexes is a leading indicator of another For example a change in the DJ UA in September might predict a similar change in the DJ IA in the following December To check for such behavior create lags of the difference variables To do so select Lag from the StatTools Data Utilities dropdown list select one of the difference variables and enter the number of lags you want For this problem try four lags Then press OK and accept the StatTools warnings Do this for each of the three difference variables You should end up with 12 lag variables Explain in words what these lag variables contain For example what is the Dec 50 lag3 of the DJ IA difference d Create a table of correlations of the three differ ences and the 12 lags Use conditional formatting to color green all correlations greater than 05 or any other cutoff you choose Does it appear that any index is indeed a leading indicator of any other Explain 65 The file P0365Xlsx lists a lot of data for each NBA team for the seasons 20042005 to 20082009 The variables are divided into groups 1 Overall success 2 Offensive and 3 Defensive The basic question all basketball fans and coaches ponder is what causes success or failure a Explore this question by creating a correlation matrix with the variable Wins the measure of success and all of the variables in groups 2 and 3 Based on these correlations which five variables appear to be the best predictors of success Keep in mind that negative correlations can also be important b Explore this question in a different way using the Playoff Team column as a measure of success Here it makes sense to proceed as in the Lasagna Triers example in Section 35 using the variables I48 Chapter 3 Finding Relationships amongVariabes Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it in groups 2 and 3 as the predictors However these predictors are all basically continuous so grouping would be required for all of them in the pivot table and grouping is always somewhat arbitrary Instead create a copy of the Data sheet Then for each variable in groups 2 to 13 create a formula that returns 1 2 3 or 4 depending on which quarter of that variable the value falls in 1 if it is less than or equal to the first quartile and so on This sounds like a lot of work but a single copyable formula will work for the entire range Now use these discrete variables as predictors and proceed as in the Lasagna Triers example List the five variables that appear to be the best or at least good predictors of making the playoffs The le P0366XlsX lists a lot of data for each NFL team for the years 2004 to 2009 The variables are divided into groups 1 Overall success 2 Team Offense 3 Passing Offense 4 Rushing Offense 5 Turnovers Against 6 Punt Retums 7 Kick Retums 8 Field Goals 9 Punts 10 Team Defense 11 Passing Defense 12 Rushing Defense and 13 Tumovers Caused The basic question all football fans and coaches ponder is what causes success or failure Answer the same questions as in the previous problem for this football data but use all of the variables in groups 2 to 13 as possible predictors The le P0257XlsX contains data on mortgage loans in 2008 for each state in the US The file is different from others in this chapter in that each state has its own sheet with the same data in the same format Each state sheet breaks down all mortgage applications by loan purpose applicant race loan type outcome and denial reason for those that were denied The question is how a single data set for all states can be created for analysis The Typical Data Set sheet indicates a simple way of doing this using the powerful but little known INDIRECT function This sheet is basically a template for bringing in any pieces of data from the state sheets you would like to examine a Do whatever it takes to populate the Typical Data Set sheet with information in the range B7D11 and B14D14 18 variables in all of each state sheet Add appropriate labels in row 3 such as Asian Dollar Amount Applied For b Create a table of correlations between these variables Color yellow all correlations between a given applicant race such as those between Asian Mortgage Application Asian Dollar Amount Applied For and Asian Average Income Comment on the magnitudes of these Are there any surprises c Create scatterplots of White Dollar Amount Applied For X axis versus the similar variable for each of the other five applicant races Comment on any patterns in these scatterplots and identify any obvious outliers 3l CUSTOMER ARRIVALS AT BANK98 ank98 operates a main location and three branch locations in a mediumsize city All four locations perform similar services and customers typically do business at the location nearest themThe bank has recently had more congestion ong waiting ines than it or its customers would like As part of a study to learn the causes of these long lines and to suggest possible solutions all locations have kept track of customer arrivals during onehour intervals for the past IO weeks All branches are open Monday through Friday from 9 AM until 5 RM and on Saturday from 9 AM until noon For each location the le Bank98 Arrivasxsx contains the number of customer arrivals during each hour of a IOweek period The manager of Bank98 has hired you to make some sense of these data Speci cally your task is to present charts andor tables that indicate how customer traf c into the bank locations varies by day of week and hour of day There is also interest in whether any daily or hourly patterns you observe are stable across weeks Although you don t have full information about the way the bank currently runs its operations you know only its customer arrival pattern and the fact that it is currently experiencing long ines you are encour aged to append any suggestions for improving operations based on your analysis of the data I Case 3 Customer Arrivals at Bank98 I49 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it 32 SAVING SPENDING AND SocIAL CLIMBING he bestselling book The Millionaire Next Door by Thomas J Stanley and William D Danko Longstreet Press I996 presents some very interesting data on the characteristics of millionaires We tend to believe that people with expensive houses expensive cars expensive clothes country club memberships and other outward indications of wealth are the miionairesThe authors de ne wealth however in terms of savings and investments not consumer items In this sense they argue that people with a lot of expensive things and even large incomes often have surprisingly little weathThese people tend to spend much of what they make on consumer items often trying to keep up with or impress their peers In contrast the real millionaires in terms of savings and investments frequently come from ungamorous I 50 Chapter 3 Finding Relationships amongVariabes professions particularly teaching own unpretentious homes and cars dress in inexpensive clothes and otherwise lead rather ordinary lives Consider the ctional data in the le Social Cimbersxsx For several hundred couples it lists their education evetheir annual combined salary the market value of their home and cars the amount of savings they have accumulated in savings accounts stocks retirement accounts and so on and a self reported social climber index on a scale of I to IO with I being very unconcerned about social status and material items and I0 being very concerned about these Prepare a report based on these data supported by relevant charts andor tables that could be used in a book such as The Millionaire Next Door Your conclusions can either support or contradict those of Stanley and Danko I Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it 33 CHURN IN THE CELLULAR PHONE MARKET he term churn is very important to managers in the cellular phone business Churning occurs when a customer stops using one company s service and switches to another company s service Obviously managers try to keep churning to a minimum not only by offering the best possible service but by trying to identify conditions that lead to churning and taking steps to stop churning before it occurs For example if a company learns that customers tend to churn at the end of their twoyear contract they could offer customers an incentive to stay a month or two before the end of their twoyear contract The le Churnxlsx contains data on over 2000 customers of a particular cellular phone company Each row contains the activity of a particular customer for a given time period and the last column indicates whether the customer churned during this time period Use the tools in this chapter and possibly the previous chapter to learn I how these variables are distributed 2 how the variables in columns B R are related to each other and 3 how the variables in columns B R are related to the Churn variable in column S Write a short report of your ndings including any recommendations you would make to the company to reduce churn I Case 33 Churn in the Cellular Phone Market I 5 I Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not naterially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER Probability and Probability Distributions MCDOr td I 4 r HlESTp UyH NT MCDRJE j ViviolsenDreamstimecom GAME AT MCDONALD S Several years ago McDonad s ran a campaign in which it gave game cards to its customersThese game cards made it possible for customers to win hamburgers french fries soft drinks and other fastfood items as well as cash prizes Each card had IO covered spots that could be uncovered by rubbing them with a coin Beneath three of these spots were zaps Beneath the other seven spots were names of prizes two of which were identical Some cards had variations of this pattern but we ll use this type of card for purposes of illustration For example one card might have two pictures of a hamburger one picture of a Coke one of french fries one of a milk shake one of 5 one of OOO and three zaps For this card the customer could win a hamburgerTo win on any card the customer had to uncover the two matching spots which showed the potential prize for that card before uncovering a zap any card with a zap uncovered was automatically voidAssuming that the two matches and the three zaps were arranged randomly on the cards what is the probability of a customer winning I55 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it We will label the two matching spots MI and M2 and the three zaps ZZ2 and Z3 Then the probability of winning is the probability of uncovering MI and M2 before uncovering ZZ2 or Z3 In this case the relevant set of outcomes is the set of all orderings ofM I M2ZZ2 and Z3 shown in the order they are uncoveredAs far as the outcome of the game is concerned the other five spots on the card are irreevantThus an outcome such as M2MZ3ZZ2 is a winner whereas M2 Z2ZMZ3 is a loser Actually the rst of these would be declared a winner as soon as MI was uncovered and the second would be declared a loser as soon as Z2 was uncovered However we show the whole sequence ofMs and Zs so that we can count outcomes correctyWe then nd the probability of winning using the argument of equally likely outcomes Specifically we divide the number of outcomes that are winners by the total number of outcomes It can be shown that the number of outcomes that are winners is I2 whereas the total number of outcomes is 20Thereforethe probability ofa winner is I2I20 0 This cacuationwhich showed that on the average I out of I0 cards could be winners was obviously important for McDonad s Actually this provides only an upper bound on the fraction of cards where a prize was awardedThe fact is that many customers threw their cards away without playing the game and even some of the winners neglected to claim their prizes So for example McDonad s knew that if they made 50000 cards where a milk shake was the winning prize somewhat less than 5000 milk shakes would be given away Knowing approximately what their expected osses would be from winning cards McDonad s was able to design the game how many cards of each type to print so that the expected extra revenue from customers attracted to the game would cover the expected losses I 4 1 INTRODUCTION The world is full of uncertainty and this is certainly true in business A key aspect of solving real business problems is dealing appropriately with uncertainty This involves recognizing explicitly that uncertainty exists and using quantitative methods to model uncertainty If you want to develop realistic models of business problems you should not simply act as if uncertainty doesn t exist For example if you don t know next month s demand you shouldn t build a model that assumes next month s demand is a sure 1500 units This is only wishful thinking You should instead incorporate the uncertainty about demand explicitly into your model To do this you need to know how to deal quantita tively with uncertainty This involves probability and probability distributions We will introduce these topics in this chapter and then use them in a number of later chapters There are many sources of uncertainty Demands for products are uncertain times between arrivals to a supermarket are uncertain stock price retums are uncertain changes in interest rates are uncertain and so on In many situations the uncertain quantity demand time between arrivals stock price retum change in interest rate is a numerical quantity In the language of probability such a numerical quantity is called a random variable More formally a random variable associates a numerical value with each possible random outcome Associated with each random variable is a probability distribution that lists all of the possible values of the random variable and their corresponding probabilities A probabil ity distribution provides very useful information It not only indicates the possible values of the random variable but it also indicates how likely they are For example it is useful to know that the possible demands for a product are say 100 200 300 and 400 but it is even more useful to know that the probabilities of these four values are say 01 02 04 and 03 Now we know for example that there is a 70 chance that demand will be at least 300 I 56 Chapter 4 Probability and Probability Distributions Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it It is often useful to summarize the information from a probability distribution with sev eral well chosen numerical summary measures These include the mean variance and stan dard deviation and for distributions of more than one random variable the covariance and correlation As their names imply these summary measures are much like the corresponding summary measures in Chapters 2 and 3 However they are not identical The summary mea sures in this chapter are based on probability distributions not an observed data set We will use numerical examples to explain the difference between the two and how they are related The purpose of this chapter is to explain the basic concepts and tools necessary to work with probability distributions and their summary measures We begin by brie y dis cussing the basic rules of probability which we need in this chapter and in several later chapters We also introduce computer simulation an extremely useful tool for illustrating important concepts in probability and statistics Modeling uncertainty as we will be doing in the next few chapters and later in Chapters 15 and 16 is sometimes difficult depending on the complexity of the model and it is easy to get so caught up in the details that you lose sight of the big picture For this reason the ow chart in Figure 41 is useful A colored version of this chart is available in the file Modeling Uncertainty Flow Chartxlsx Take a close look at the middle row of this chart It indicates that we begin with inputs some of which are uncertain quantities use Excel formulas to incorporate the logic of the model end with probability distributions of important outputs that we can summarize in various ways and finally use this informa tion to make decisions The abbreviation EMV stands for expected monetary value It will be discussed extensively in Chapter 6 The other boxes in the chart deal with implementation issues particularly with software you can use to perform the analysis Read this chart carefully and return to it as you proceed through the next few chapters and Chapters 15 and 16 Figure 4 I Flow Chart for Modeling Uncertainty Two fundamental approaches Assess probability distributions of uncertain inputs 1 Build an exact probability model that incorporates the rules of probability Pros It is exact and amenable to sensitivity analysis Cons It is often difficult mathematically maybe not even possible If a lot of historical data is available use software like RSK to find the distribution that best fits the data For simulation models this can be done manually with data tables and built in functions like AVERAGE Choose a probability distribution normal triangular that seems reasonable Software like RSK is helpful for exploring distributions Build a simulation model Pros It is typically much easier especially with an add in like RSK and extremely versatile Cons It is only approximate and runs can be time consuming for complex models Use decision trees made easier with an add in like Precision Tree if the number of possible decisions and the STDE etc But an add in like RSK takes care of these bookkeeping details automatically number of possible outcomes are not too large Gather relevant information ask experts and do the best you can Examine important outputs Decide which inputs are important Model the problem The result of these formulas should Make defriiifrrnsaziied on this for the model be probability distributions of 39 Use Excel formulas to relate inputs important outputs Summarize Criterion is usually EMV but it could 1 Which are known with certainty to important outputs ie enter the these probability distributions with be something else eg minimize 2 Which are uncertain business logic 1 histograms risk profiles 2 means and standard deviations the probabmty of Iosmg moneyquot 3 percentiles 4 possibly others For simulation models random values for uncertain inputs are necessary This is an overview of spreadsheet modeling 1 They can sometimes be generated with uncertainty The with built in Excel functions This often involves tricks and can be obscure main process is in red The blue boxes deal with quot Fquots39 s a ss ss 2 Add ins like RSK provide functions like RISKNORMAL RISKTRIANG that make it much easier 4 I Introduction I 5 7 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Before proceeding we discuss two terms you often hear in the business world uncer tainty and risk They are sometimes used interchangeably but they are not really the same You typically have no control over uncertainty it is something that simply exists A good example is the uncertainty in exchange rates You cannot be sure what the exchange rate between the US dollar and the euro will be a year from now All you can try to do is mea sure this uncertainty with a probability distribution In contrast risk depends on your position Even though you don t know what the exchange rate will be it makes no difference to you there is no risk if you have no European investments you aren t planning a trip to Europe and you don t have to buy or sell anything in Europe You might be interested in the exchange rate but you have no risk You have risk only when you stand to gain or lose money depending on the eventual exchange rate Of course the form of your risk depends on your position If you are hold ing euros in a money market account you are hoping that euros gain value relative to the dollar But if you are planning a European vacation you are hoping that euros lose value relative to the dollar Uncertainty and risk are inherent in many of the examples in this book By leaming about probability you will learn how to measure uncertainty and you will also leam how to measure the risks involved in various decisions One important topic you will not leam much about is risk mitigation by various types of hedging For example if you know you have to purchase a large quantity of some product from Europe a year from now you face the risk that the value of the euro could increase dramatically thus costing you a lot of money Fortunately there are ways to hedge this risk so that if the euro does increase relative to the dollar your hedge minimizes your losses Hedging risk is an extremely important topic and it is practiced daily in the real world but it is beyond the scope of this book 42 PROBABILITY ESSENTIALS We begin with a brief discussion of probability The concept of probability is one that we all encounter in everyday life When a weather forecaster states that the chance of rain is 70 she is making a probability statement When we hear that the odds of the Los Angeles Lakers winning the NBA Championship are 2 to 1 this is also a probability statement The concept of probability is quite intuitive However the rules of probability are not always as intuitive or easy to master We examine the most important of these rules in this section A probability is a number between 0 and 1 that measures the likelihood that some event will occur An event with probability 0 cannot occur whereas an event with probability 1 is certain to occur An event with probability greater than 0 and less than 1 involves uncertainty and the closer its probability is to 1 the more likely it is to occur As the examples in the preceding paragraph illustrate probabilities are sometimes expressed as percentages or odds However these can easily be converted to probabilities on a 0 to 1 scale If the chance of rain is 70 then the probability of rain is 07 Similarly if the odds of the Lakers winning are 2 to 1 then the probability of the Lakers winning is 23 or 06667 There are only a few probability rules you need to know and we will discuss them in the next few subsections Surprisingly these are the only rules you need to know Probability is not an easy topic and a more thorough discussion of it would lead to considerable mathe matical complexity well beyond the level of this book However it is all based on the few relatively simple rules discussed next I 58 Chapter 4 Probability and Probability Distributions Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it 421 Rule of Complements The simplest probability rule involves the complement of an event If A is any event then the complement of A denoted by A or in some books by AC is the event that A does not occur For example if A is the event that the Dow Jones Industrial Average will nish the year at or above the 11000 mark then the complement of A is that the Dow will nish the year below 11000 If the probability of A is PA then the probability of its complement PA is given by Equation 41 Equivalently the probability of an event and the probability of its complement sum to 1 For example if we believe that the probability of the Dow nishing at or above 11000 is 025 then the probability that it will nish the year below 11000 is 1 025 075 Rule of Complements PA 1 PA 41 422 Addition Rule We say that events are mutually exclusive if at most one of them can occur That is if one of them occurs then none of the others can occur For example consider the following three events involving a company s annual revenue for the coming year 1 revenue is less than 1 million 2 revenue is at least 1 million but less than 2 million and 3 revenue is at least 2 million Clearly only one of these events can occur Therefore they are mutually exclusive They are also exhaustive events which means that they exhaust all possibilities one of these three events must occur Let A1 through A n be any n events Then the addition rule of probability involves the probability that at least one of these events will occur In gen eral this probability is quite complex but it simpli es considerably when the events are mutually exclusive In this case the probability that at least one of the events will occur is the sum of their individual probabilities as shown in Equation 42 Of course when the events are mutually exclusive at least one is equivalent to exactly one In addition if the events A1 through A n are exhaustive then the probability is 1 In this case we are certain that one of the events will occur Addition Rule for Mutually Exclusive Events Pat least one of A1 through A PA1 PA2 PAn 42 In a typical application the events A1 through A n are chosen to partition the set of all possible outcomes into a number of mutually exclusive events For example in terms of a company s annual revenue de ne A1 as revenue is less than 1 million A2 as revenue is at least 1 million but less than 2 million and A3 as revenue is at least 2 million Then these three events are mutually exclusive and exhaustive Therefore their probabilities must sum to 1 Suppose these probabilities are PA1 05 PA2 03 and PA3 02 Note that these probabilities do sum to 1 Then the additive rule enables us to calculate other probabilities For example the event that revenue is at least 1 million is the event that either A2 or A3 occurs From the addition rule its probability is Prevenue is at least 1 million PA2 PA3 05 Similarly Prevenue is less than 2 million PA1 PA2 08 42 Probability Essentials I59 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it and Prevenue is less than 1 million or at least 2 million PA1 PA3 07 Again the addition rule works only for mutually exclusive events If the events over lap then the situation is more complex For example suppose you are dealt a bridge hand 13 cards from a 52 card deck Let H D C and S respectively be the events that you get at least 5 hearts at least 5 diamonds at least 5 clubs and at least 5 spades What is the probability that at least one of these four events occurs It is not the sum of their individual probabilities because they are not mutually exclusive For example you could get 5 hearts and 5 spades Probabilities such as this one are actually quite dif cult to calculate and we will not pursue them here Just be aware that the addition rule does not apply unless the events are mutually exclusive 423 Conditional Probability and the Multiplication Rule Probabilities are always assessed relative to the information currently available As new infor mation becomes available probabilities often change For example if you read that Kobe Bryant pulled a hamstring muscle your assessment of the probability that the Lakers will win the NBA Championship would obviously change A formal way to revise probabilities on the basis of new information is to use conditionalprobabilities Let A and B be any events with probabilities PA and PB Typically the probability PA is assessed without knowledge of whether B occurs However if you are told that B has occurred then the probability of A might change The new probability of A is called the conditional probability of A given B It is denoted by PAB Note that there is still uncertainty involving the event to the left of the vertical bar in this notation you do not know whether it will occur However there is no uncertainty involving the event to the right of the vertical bar you know that it has occurred Conditional Probability PA and B PAIB T 43 The conditional probability formula enables you to calculate PAB as shown in Equation 43 The numerator in this formula is the probability that both A and B occur This probability must be known to find PAB However in some applications PAIB and PB are known Then you can multiply both sides of the conditional probability formula by PB to obtain the multiplication rule for PA and B in Equation 44 Multiplication Rule PA and B PABPB 44 The conditional probability formula and the multiplication rule are both valid in fact they are equivalent The one you use depends on which probabilities you know and which you want to calculate as illustrated in the following example I60 Chapter 4 Probability and Probability Distributions Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it EXAM PLE 4l ASSESSING UNCERTAINTY AT THE BENDER COMPANY he Bender Company supplies contractors with materials for the construction of houses The company currently has a contract with one of its customers to ll an order by the end of July However there is some uncertainty about whether this deadline can be met due to uncertainty about whether Bender will receive the materials it needs from one of its suppliers by the middle of July Right now it is July 1 How can the uncertainty in this situation be assessed Objective To apply several of the essential probability rules in determining the probabil ity that Bender will meet its end of July deadline given the information the company has at the beginning of July Solution Let A be the event that Bender meets its end of July deadline and let B be the event that Bender receives the materials from its supplier by the middle of July The probabilities Bender is best able to assess on July 1 are probably PB and PA I B At the beginning of July Bender might estimate that the chances of getting the materials on time from its supplier are 2 out of 3 that is PB 23 Also thinking ahead Bender estimates that yquot it receives the required materials on time the chances of meeting the end of July deadline are 3 out of 4 This is a conditional prob ability statement namely that PA I B 34 Then the multiplication rule implies that PA and B PAIBPB 3423 05 That is there is a fiftyfifty chance that Bender will get its materials on time and meet its end of July deadline This uncertain situation is depicted graphically in the form of a probability tree in Figure 42 Note that Bender initially faces at the leftmost branch of the tree diagram the uncertainty of whether event B or its complement will occur Regardless of whether event B takes place Bender must next confront the uncertainty regarding event A This uncertainty is re ected in the set of two parallel pairs of branches that model whether event A or its complement will occur next Hence there are four mutually exclusive outcomes regarding the two uncertain events as shown on the righthand side of Figure 42 Initially we are Figure 42 pA B 34 PA and B 3423 Probability Tree for Example 41 PB 23 PKI B 14 PM and B 1423 PA and quot 1513 PA I E 15 PB 13 gtI PK I E 45 PK and 4513 42 Probability Essentials l6l Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it interested in the first possible outcome the joint occurrence of events A and B found at the top of the probability tree diagram Another way to compute the probability of both events B and A occurring is to multiply the probabilities associated with the branches along the path from the root of the tree on the left hand side to the desired terminal point or outcome of the tree on the right hand side In this case we multiply the probability of B corre sponding to the first branch along the path of interest by the conditional probability of A given B associated with the second branch along the path of interest There are several other probabilities of interest in this example First let B be the complement of B it is the event that the materials from the supplier do not arrive on time We know that PB 1 PB 13 from the rule of complements However we do not yet know the conditional probability PAB the probability that Bender will meet its end of July deadline given that it does not receive the materials from the supplier on time In particular PAB is not equal to 1 PAB Can you see why Suppose Bender esti mates that the chances of meeting the end of July deadline are 1 out of 5 if the materials do not arrive on time that is PAB 15 Then a second use of the multiplication rule gives PA and E PABPB 1513 00667 In words there is only 1 chance out of 15 that the materials will not arrive on time and Bender will meet its end of July deadline Again you can use the probability tree for Bender in Figure 42 to compute the proba bility of the joint occurrence of events A and B This outcome is the third from the top of the diagram terminal point of the tree To nd the desired probability multiply the probabilities corresponding to the two branches included in this path from the left hand side of the tree to the right hand side This con rms that the probability of interest is the product of the two rel evant probabilities namely 15 and 13 Simply stated probability trees can be quite useful in modeling and assessing such uncertain outcomes in real life situations The bottom line for Bender is whether it will meet its end of July deadline After mid July this probability is either PAB 34 or PA IB 15 because by this time Bender will know whether the materials arrived on time But on July 1 the relevant probability is PA there is still uncertainty about whether B or B will occur Fortunately you can calcu late PA from the probabilities you already know The logic is that A consists of the two mutually exclusive events A and B and A and B That is if A is to occur it must occur with B or with B Therefore using the addition rule for mutually exclusive events we obtain PA PA and B PA and E 12 115 1730 05667 The chances are 17 out of 30 that Bender will meet its end of July deadline given the information it has at the beginning of July I 424 Probabilistic Independence A concept that is closely tied to conditional probability is probabilistic independence You just saw how the probability of an event A can depend on whether another event B has occurred Typically the probabilities PA PAIB and PAIB are all different as in Example 41 However there are situations where all of these probabilities are equal In this case we say that the events A and B are independent This does not mean they are mutually exclusive Rather probabilistic independence means that knowledge of one event is of no value when assessing the probability of the other The main advantage to knowing that two events are independent is that in that case the multiplication rule simplifies to Equation 45 This follows by substituting PA for PAB in the multiplication rule which is allowed because of independence In words the probability that both events occur is the product of their individual probabilities I62 Chapter 4 Probability and Probability Distributions Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Multiplication Rule for Independent Events PA and B PA PB 45 How can you tell whether events are probabilistically independent Unfortunately this issue usually cannot be settled with mathematical arguments typically you need empirical data to decide whether independence is reasonable As a simple example let A be the event that a family s first child is male and let B be the event that its second child is male Are A and B independent You could argue that they aren t independent if you believe say that a boy is more likely to be followed by another boy than by a girl You could argue that they are independent if you believe the chances of the second child being a boy are the same regardless of the gender of the first child Note that neither argument has anything to do with boys and girls being equally likely In any case the only way to settle the argument is to observe many families with at least two children If you observe say that 55 of all families with first child male also have the second child male and only 45 of all families with first child female have the second child male then you can make a good case that A and B are not independent It is probably fair to say that most events in the real world are not truly independent However because of the simpli ed multiplication rule for independent events many mathematical models assume that events are independent the math is much easier with this assumption The question is then whether the results from such a model are believable All we can say in general is that it depends on how unrealistic the independence assumption really is 425 Equally Likely Events Much of what you know about probability is probably based on situations where outcomes are equally likely These include ipping coins throwing dice drawing balls from urns and other random mechanisms that are often discussed in introductory probability books For example suppose an urn contains 20 red marbles and 10 blue marbles You plan to randomly select ve marbles from the urn and you are interested say in the probability of selecting at least three red marbles To nd this probability you argue that because of ran domness every possible set of ve marbles is equally likely to be chosen Then you count the number of sets of ve marbles that contain at least three red marbles you count the total number of sets of ve marbles that could be selected and you set the desired proba bility equal to the ratio of these two counts Let us put this method of calculating probabilities into proper perspective It is true that many probabilities particularly in games of chance can be calculated by using an equally likely argument It is also true that probabilities calculated in this way satisfy all of the rules of probability including the rules we have already discussed However many probabilities especially those in business situations cannot be calculated by equally likely arguments simply because the possible outcomes are not equally likely For example just because you are able to identify ve possible scenarios for a company s future there is probably no reason whatsoever to conclude that each scenario has probability 15 The bottom line is that we will have almost no need in this book to discuss complex counting rules for equally likely outcomes If you dreaded learning about probability in terms of balls and urns rest assured that you will not have to do so here 426 Subjective Versus Objective Probabilities We now ask a very basic question Where do the probabilities in a probability distribu tion come from A complete answer to this question could lead to a chapter by itself 42 Probability Essentials I63 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it so we only brie y discuss the issues involved There are essentially two distinct ways to assess probabilities objectively and subjectively Objective probabilities are those that can be estimated from long run proportions whereas subjective probabilities can not be estimated from long run proportions Some examples will make this distinction clearer Consider throwing two dice and observing the sum of the two sides that face up What is the probability that the sum of these two sides is 7 You might argue as follows Because there are 6 X 6 36 ways the two dice can fall and because exactly 6 of these result in a sum of 7 the probability of a 7 is 636 16 This is the equally likely argument we discussed previously It reduces probability to counting What if the dice are weighted in some way Then the equally likely argument is no longer valid You can however toss the dice many times and record the proportion of tosses that result in a sum of 7 This proportion is called a relative frequency The relative frequency of an event is the proportion of times the event occurs out of the number of times the random experiment is run A relative frequency can be recorded as a proportion or a percentage A famous result called the law of large numbers states that this relative frequency in the long run will get closer and closer to the true probability of a 7 This is exactly what we mean by an objective probability It is a probability that can be estimated as the long run proportion of times an event occurs in a sequence of many identical experiments If you are ipping coins throwing dice or spinning roulette wheels objective proba bilities are certainly relevant You don t need a person s opinion of the probability that a roulette wheel say will end up pointing to a red number you can simply spin it many times and keep track of the proportion of times it points to a red number However there are many situations particularly in business that cannot be repeated many times or even more than once under identical conditions In these situations objective probabilities make no sense and equally likely arguments usually make no sense either so you must resort to subjec tive probabilities A subjective probability is one person s assessment of the likelihood that a certain event will occur We assume that the person making the assessment uses all of the information available to make the most rational assessment possible This de nition of subjective probability implies that one person s assessment of a prob ability can differ from another person s assessment of the same probability For example consider the probability that the Indianapolis Colts will win the next Super Bowl If you ask a casual football observer to assess this probability you will get one answer but if you ask a person with a lot of inside information about injuries team cohesiveness and so on you might get a very different answer Because these probabilities are subjective people with different information typically assess probabilities in different ways Subjective probabilities are usually relevant for unique one time situations However most situations are not completely unique you often have some history to guide you That is historical relative frequencies can be factored into subjective probabilities For example suppose a company is about to market a new product This product might be quite different in some ways from any products the company has marketed before but it might also share some features with the company s previous products If the company wants to assess the probability that the new product will be a success it will certainly analyze the unique features of this product and the current state of the market to obtain a subjective assessment However the company will also look at its past successes and failures with reasonably similar products If the proportion of successes with past products was 40 say then this value might be a starting point in the assessment of this product s probability of success I64 Chapter 4 Probability and Probability Distributions Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it All of the given probabilities in this chapter and later chapters can be placed some where on the objectivetosubjective continuum usually closer to the subjective end An important implication of this placement is that these probabilities are not cast in stone they are only educated guesses Therefore it is always a good idea to run a sensitivity analysis especially on a spreadsheet where this is easy to do to see how any bottom line answers depend on the given probabilities Sensitivity analysis is especially important in Chapter 6 when we study decision making under uncertainty PROBLEMS Note Student solutions for problems whose numbers appear within a colored box are available for purchase at www cengagebraincom Level A In a particular suburb 30 of the households have installed electronic security systems a If a household is chosen at random from this suburb what is the probability that this household has not installed an electronic security system b If two households are chosen at random from this suburb what is the probability that neither has installed an electronic security system 2 Several major automobile producers are competing to have the largest market share for sport utility vehicles SUVs in the coming quarter A profes sional automobile market analyst assesses that the odds of General Motors not being the market leader are 6 to l The odds against Toyota and Ford having the largest market share in the coming quarter are similarly assessed to be 12 to 5 and 8 to 3 respectively a Find the probability that General Motors will have the largest market share for SUVs in the coming quarter b Find the probability that Toyota will have the largest market share for SUVs in the coming quarter c Find the probability that Ford will have the largest market share for SUVs in the coming quarter d Find the probability that some other automobile manufacturer will have the largest market share for SUVs in the coming quarter The publisher of a popular financial periodical has decided to undertake a campaign in an effort to attract new subscribers Market research analysts in this company believe that there is a l in 4 chance that the increase in the number of new subscriptions resulting from this campaign will be less than 3000 and there is a l in 3 chance that the increase in the number of new subscriptions resulting from this campaign will be between 3000 and 5000 What is the probability that the increase in the number of new subscriptions resulting from this campaign will be less than 3000 or more than 5000 Suppose that 18 of the employees of a given cor poration engage in physical exercise activities during the lunch hour Moreover assume that 57 of all employees are male and 12 of all employees are males who engage in physical exercise activities during the lunch hour a If you choose an employee at random from this corporation what is the probability that this person is a female who engages in physical exercise activities during the lunch hour b If you choose an employee at random from this corporation what is the probability that this person is a female who does not engage in physical exercise activities during the lunch hour In a study designed to gauge married women s partici pation in the workplace today the data provided in the le P0405XlsX were obtained from a sample of 750 randomly selected married women Consider a woman selected at random from this sample in answering the following questions a What is the probability that this randomly selected woman has a job outside the home b What is the probability that this randomly selected woman has at least one child c What is the probability that this randomly selected woman has a full time job and no more than one child d What is the probability that this randomly selected woman has a part time job or at least one child but not both Suppose that you draw a single card from a standard deck of 52 playing cards a What is the probability that a diamond or club is drawn b What is the probability that the drawn card is not a 4 c Given that a black card has been drawn what is the probability that it is a spade d Let E1 be the event that a black card is drawn Let E2 be the event that a spade is drawn Are E1 and E2 independent events Why or why not 42 Probability Essentials I65 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it 43 DISTRIBUTION OF A SINGLE RANDOM VARIABLE FUNDAMENTAL INSIGHT e Let E3 be the event that a heart is drawn Let E 4 be the event that a 3 is drawn Are E3 and E 4 indepen dent events Why or why not Level B In a large accounting rm the proportion of accountants with MBA degrees and at least ve years of professional experience is 75 as large as the proportion of accountants with no MBA degree and less than ve years of profes sional experience Furthermore 35 of the accountants in this rm have MBA degrees and 45 have fewer than ve years of professional experience If one of the rm s accountants is selected at random what is the probability that this accountant has an MBA degree or at least ve years of professional experience but not both A local beer producer sells two types of beer a regular brand and a light brand with 30 fewer calories The company s marketing department wants to verify that its traditional approach of appealing to local white collar workers with light beer commercials and appealing to local blue collar workers with regular beer commercials is indeed a good strategy A randomly selected group of 400 local workers are questioned about their beer drinking preferences and the data in the le P0408XlsX are obtained a If a blue collar worker is chosen at random from this group what is the probability that he she prefers light beer to regular beer or no beer at all If a white collar worker is chosen at random from this group what is the probability that heshe prefers light beer to regular beer or no beer at all If you restrict your attention to workers who like to drink beer what is the probability that a randomly selected blue collar worker prefers to drink light beer If you restrict your attention to workers who like to drink beer what is the probability that a randomly selected white collar worker prefers to drink light beer Does the company s marketing strategy appear to be appropriate Explain why or why not 9 Suppose that two dice are tossed For each die it is equally likely that 1 2 3 4 5 or 6 dots will tum up Let S be the sum of the two dice a b What is the probability that S will be 5 or 7 What is the probability that S will be some number other than 4 or 8 Let E1 be the event that the rst die shows a 3 Let E2 be the event that S is 6 Are E1 and E2 independent events Again let E1 be the event that the rst die shows a 3 Let E3 be the event that S is 7 Are E1 and E3 independent events Given that S is 7 what is the probability that the first die showed 4 dots Given that the rst die shows a 3 what is the probability that S is an even number We now discuss the topic of most interest in this chapter probability distributions In this section we examine the probability distribution of a single random variable In later sec tions we discuss probability distributions of two or more related random variables Concept of Probability Distribution A probability distribution is a way of describing the uncertainty of some numerical outcome It is not based at least not directly on a data set of the type discussed in the previous two chapters Instead it is essentially a list of all possible outcomes and their corresponding probabilities There are really two types of random variables discrete and continuous A discrete random variable has only a nite number of possible values whereas a continuous random variable has a continuum of possible values1 Usually a discrete distribution results from a count whereas a continuous distribution results from a measurement For example the number of children in a family is clearly discrete whereas the amount of rain this year in San Francisco is clearly continuous 1Actually a more rigorous discussion allows a discrete random variable to have an in nite number of possible values such as all positive integers The only time this occurs in this book is when we discuss the Poisson distri bution in Chapter 5 I66 Chapter 4 Probability and Probability Distributions Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it A discrete probability distribution is a set ofpossible values and a corresponding set of nonnegative probabilities that sum to I This distinction between counts and measurements is not always clear cut For exam ple what about the demand for refrigerators at a particular store next month The number of refrigerators demanded is clearly an integer a count but it probably has many possible val ues such as all integers from 0 to 100 In some cases like this we often approximate in one of two ways First we might use a discrete distribution with only a few possible values such as all multiples of 20 from 0 to 100 Second we might approximate the possible demand as a continuum from 0 to 100 The reason for such approximations is to simplify the mathe matics and they are frequently used Mathematically there is an important difference between discrete and continuous probability distributions Specifically a proper treatment of continuous distributions analogous to the treatment we will provide in this chapter requires calculus which we do not presume for this book Therefore we discuss only discrete distributions in this chapter In later chapters we often use continuous distributions particularly the bell shaped normal distribution but we simply state their properties without trying to derive them mathematically The essential properties of a discrete random variable and its associated probability distribution are quite simple We discuss them in general and then analyze a numerical example Let X be a random variable Usually capital letters toward the end of the alpha bet such as X Y and Z are used to denote random variables To specify the probability distribution of X we need to specify its possible values and their probabilities We assume that there are k possible values denoted V1 V2 vk The probability of a typical value vi is denoted in one of two ways either PX vi or pvl The first reminds you that this is a probability involving the random variable X whereas the second is a simpler shorthand notation Probability distributions must satisfy two criteria 1 the probabilities must be nonnegative and 2 they must sum to 1 In symbols we must have k 2pltv gt 1 pltvgt 2 0 i1 This is basically all there is to it a list of possible values and a list of associated proba bilities that sum to 1 It is also sometimes useful to calculate cumulative probabilities A cumulative probability is the probability that the random variable is less than or equal to some particular value For example assume that 10 20 30 and 40 are the possible values of a random variable X with corresponding probabilities 015 025 035 and 025 Then a typical cumulative probability is PX S 30 From the addition rule it can be calculated as PX S 30 PX 10 PX 20 PX 30 075 The point is that the cumulative probabilities are completely determined by the individual probabilities It is often convenient to summarize a probability distribution with two or three well chosen numbers The first of these is the mean often denoted it It is also called the expected value of X and denoted EX for expected X The mean is a weighted sum of the possible values weighted by their probabilities as shown in Equation 46 In much the same way that an average of a set of numbers indicates central location the mean indicates the center of the probability distribution You will see this more clearly when we analyze a numerical example Mean of a Probability Distribution u k M EX Evpvgt 46 i1 43 Distribution of a Single RandomVariabe I67 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it To measure the variability in a distribution we calculate its variance or standard devi ation The variance denoted by 02 or Var0 is a weighted sum of the squared deviations of the possible values from the mean where the weights are again the probabilities This is shown in Equation 47 As in Chapter 2 the variance is expressed in the square of the units of X such as dollars squared Therefore a more natural measure of variability is the standard deviation denoted by 0 or StdevX It is the square root of the variance as indicated by Equation 48 Variance of a Probability Distribution 02 k 0392 VarX Edi EX2pvl 47 i1 Standard Deviation of a Probability Distributiono 02 StdevX VarX 48 Equation 47 is useful for understanding variance as a weighted average of squared deviations from the mean However the following is an equivalent formula for variance and is somewhat easier to implement in Excel It can be derived with straightforward algebra In words you find the weighted average of the squared values weighted by their probabilities and then subtract the square of the mean Variance computing formula k 02 Evl2pvl M2 49 i1 We now consider a typical example EXAMPLE 42 MARKET RETURN ScENARIos FOR THE NATIONAL ECONOMY In reality there is 0 An investor is concerned with the market return for the coming year where the market C m fP 3339ble return is defined as the percentage gain or loss if negative over the year The returns Her assump investor believes there are five possible scenarios for the national economy in the coming tion of only ve POSSwe returns is year rapid expansion moderate expansion no growth moderate contraction and serious clear an a ma contraction Furthermore she has used all of the information available to her to estimate Y PP matron to reality but that the market returns for these scenarios are respectively 23 18 15 9 and 3 Such 0 quot33 mP That is the possible returns vary from a high of 23 to a low of 3 Also she has assessed is often useful that the probabilities of these outcomes are 012 040 025 015 and 008 Use this infor mation to describe the probability distribution of the market return Objective To compute the mean variance and standard deviation of the probability dis tribution of the market return for the coming year Solution To make the connection between the general notation and this particular example let X denote the market return for the coming year Then each possible economic scenario leads to l68 Chapter 4 Probability and Probability Distributions Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it a possible value of X For example the first possible value is V1 23 and its probability is pv1 012 These values and probabilities appear in columns B and C of Figure 432 See the le Market Returnxlsx Note that the ve probabilities sum to 1 as they should This probability distribution implies for example that the probability of a market retum at least as large as 18 is 012 040 052 because it could occur as a result of rapid or moderate expansion of the economy Similarly the probability that the market return is 9 or less is 015 008 023 because this could occur as a result of moderate or serious contraction of the economy Figure 43 Probability Distribution of Market Returns A B c D E F G H 1 Mean variance and standard deviation of the market return Range names Used 2 Marketreturn MarketlC4C8 3 Economic outcome Probability Market return Sq dev from mean Mean MarketlB11 4 Rapid Expansion 012 23 0005929 Probability MarketB4B8 5 Moderate Expansion 040 18 0000729 Sqdevfrommean MarketlD4D8 6 No Growth 025 15 0000009 Stdev MarketlB13 7 Moderate Contraction 015 9 0003969 Variance MarketlB12 8 Serious Contraction 008 3 0015129 9 10 Summary measures of return 11 Mean 153 12 Variance 0002811 0002811 Xi Quick alternative formula 13 Stdev 53 53 l The summary measures of this probability distribution appear in the range B 11B13 They can be calculated with the following steps Note that the formulas make use of the range names listed in the figure PROCEDURE FOR CALCULATING SUMMARY MEAsUREs 0 Mean return Calculate the mean return in cell B11 with the formula SUMPRODUCTMarketreturnProbability Excel Tip Excel s S UMPROD U C T function is a gem and you should use it whenever possible It takes at least two arguments which must be ranges of exactly the same size and shape It sums the products of the values in these ranges For example SUMPRODUCT AI39A3B39B3 is equivalent to theformula A B A2 B2 lA3 B3 Ifthe ranges contain only afew cells there isn t much advantage to using SUMPRODUCT but when the ranges are large such as AI39A00 and BI39BI00 SUMPRODUCT is the only viable choice This formula illustrates the general rule in Equation 46 The mean is the sum of products of possible values and probabilities 6 Squared deviations To get ready to compute the variance from equation 47 cal culate the squared deviations from the mean by entering the formula C4Meanquot2 in cell D4 and copying it down through cell D8 2From here on we often shade the given inputs in the spreadsheet figures blue so that you can immediately tell which cells contain inputs This shading comes through clearly in the Excel files On the printed page the shad ing is a light blue 43 Distribution of a Single RandomVariabe I69 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it 9 Variance Calculate the variance of the market return in cell B 12 with the formula SUMPRODUCTSqdevfrommeanProbability As always range This illustrates the general formula for variance in Equation 47 The variance is always a quotam e5 are quot073 sum of products of squared deviations from the mean and probabilities Altematively you can reqmred but they skip the calculation of the squared deviations from the mean and use Equation 49 directly k h Ex I gfmeultaseeasir to This 1S done in cell C12 with the formula readY0 of Se SUMPRODUCTMarketreturnMarketreturnProbabilityMean 2 them or omit them 05 Y0 Wish By entering the Marketretum range twice in this SUMPRODUCT formula you get the squares From now on we will use this simpli ed formula for variance and dispense with squared deviations from the mean But regardless of how it is calculated you should remem ber the essence of variance it is a weighted average of squared deviations from the mean 0 Standard deviation Calculate the standard deviation of the market return in cell B 13 with the formula SQRTVariance You can see that the mean return is 153 and the standard deviation is 53 What do these measures really mean First the mean or expected return does not imply that the most likely return is 153 nor is this the value that the investor expects to occur In fact the value 153 is not even a possible market return at least not according to the model You can understand these measures better in terms of long run averages Specifically if you could imagine the coming year being repeated many times each time using the probability distribution in columns B and C to generate a market return then the average of these market returns would be close to 153 and their standard deviation calculated as in Chapter 2 would be close to 53 Before leaving this section we want to emphasize a key point a point that is easy to forget with all the details The whole point of discussing probability and probability distri butions especially in the context of business problems is that uncertainty is often a key factor and you cannot simply ignore it For instance you saw in Example 42 that the mean return is 153 However it would be far from realistic to treat the actual return as a sure 153 with no uncertainty If you did this you would be ignoring the uncertainty completely and it is often the uncertainty that makes business problems interesting and difficult Therefore to model such problems in a realistic way you are forced to deal with probability and probability distributions I 431 Conditional Mean and Variance There are many situations where the mean and variance of a random variable depend on some external event In this case you can condition on the outcome of the external event to find the overall mean and variance or standard deviation of the random variable It is best to motivate this with an example Consider the random variable X represent ing the percentage change in the price of stock A from now to a year from now This change is driven partly by circumstances specific to company A but it is also driven partly by the economy as a whole In this case the outcome of the economy is the external event Let s assume that the economy in the coming year will be awful stable or great with probabilities 020 050 and 030 respectively In addition we make the following assumptions 1 Given that the economy is awful the mean and standard deviation of X are 20 and 30 2 given that the economy is stable the mean and standard deviation of X are 5 and 20 and 3 given that the economy is great the mean and standard I 70 Chapter 4 Probability and Probability Distributions Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it deviation of X are 25 and 15 Each of these latter statements is a statement about X conditional upon the economy What can you say about the unconditional mean and stan dard deviation of X That is what are the mean and standard deviation of X before you learn the state of the economy The answers come from Equations 410 and 411 In the context of the example pl is the probability of economy state i and E I X and Var I X are the mean and variance of X given that economy state i occurs Conditional Mean Formula k EX 2ElXpl 410 i1 Conditional Variance Formula k VarX 2VarlX ElX2pl EX2 411 i1 In the example the mean percentage change in the price of stock A from Equation 4 10 is EX 02 20 055 0325 6 To calculate the standard deviation of X first use Equation 411 to calculate the variance and then take its square root The variance is VarX 02302 202 05202 52 02152 252 62 006915 Taking the square root gives StdevX 006915 2630 Of course these calculations can be done easily in Excel See the file Stock Price and Economyxlsx for the details The point of this example is that it is often easier to assess the uncertainty of some random variable X by conditioning on every possible outcome of some external event like the economy However before that outcome is known the relevant mean and standard deviation of X are those calculated from Equations 410 and 411 In this particular example before you know the state of the economy the relevant mean and standard devia tion of the change in the price of stock A are 6 and 263 respectively PROBLEMS Level A d Find the expected value of X e Find the standard deviation of X 10 A fair coin ie heads and tails are equally likely is tossed three times Let X be the number of heads observed in three tosses of this fair coin a Find the probability distribution of X b Find the probability that two or fewer heads are and PX 4 0391 observed in three tosses 339 Find PX S 2 c Find the probability that at least one head is 13 Find 100 lt X 3 3 observed in three tosses 0 Find PX gt 0 Consider a random variable with the following probability distribution PX 0 01 PX 1 02 PX 2 03 PX 3 03 43 Distribution of a Single RandomVariabe I 7 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it d e Find PX gt 3 Ix gt 2 Find the expected value of X 139 Find the standard deviation of X 12 A study has shown that the probability distribution of X the number of customers in line including the 15 one being served if any at a checkout counter in a department store is given by PX 0 025 PX 1 025 PX 2 020 PX 3 020 and P 2 4 010 Consider a newly arriving customer to the checkout line 3 b What is the probability that this customer will not have to wait behind anyone What is the probability that this customer will have to wait behind at least one customer On average behind how many other customers will the newly arriving customer have to wait A construction company has to complete a project no later than three months from now or there will be significant cost overruns The manager of the construction company believes that there are four possible values for the random variable X the 16 number of months from now it will take to complete this project 2 25 3 and 35 The manager cur rently thinks that the probabilities of these four possibilities are in the ratio 1 to 2 to 4 to 2 That is X 25 is twice as likely as X 2 X 3 is twice as likely as X 25 and X 35 is half as likely as X 3 a b Find the probability distribution of X What is the probability that this project will be completed in less than three months from now What is the probability that this project will not be completed on time What is the expected completion time in months of this project from now How much variability in months exists around the expected value you found in part d 14 Three areas of southern California are prime candidates for forest fires each dry season You believe based on historical evidence that each of 17 these areas independently of the others has a 30 chance of having a major forest fire in the next dry season 3 Find the probability distribution of X the number of the three regions that have major forest res in the next dry season What is the probability that none of the areas will have a major forest re What is the probability that all of them will have a major forest fire What is expected number of regions with major forest res Each major forest re is expected to cause 20 million in damage and other expenses What is I 72 Chapter 4 Probability and Probability Distributions the expected amount of damage and other expenses in these three regions in the next dry season Level B The National Football League playoffs are just about to begin Because of their great record in the regular season the Colts get a bye in the rst week of the playoffs In the second week they will play the winner of the game between the Ravens and the Patriots A football expert estimates that the Ravens will beat the Patriots with probability 045 This same expert estimates that if the Colts play the Ravens the mean and standard deviation of the point spread Colts points minus Ravens points will be 65 and 105 whereas if the Colts play the Patriots the mean and standard deviation of the point spread Colts points minus Patriots points will be 35 and 125 Find the mean and standard deviation of the point spread Colts points minus their opponent s points in the Colts game Because of tough economic times the Indiana legis lature is debating a bill that could have signi cant negative implications for public school funding There are three possibilities for this bill 1 it could be passed in essentially its current version 2 it could be passed but with amendments that make it less harsh on public school funding or 3 it could be defeated The proba bilities of these three events are estimated to be 04 025 and 035 respectively The estimated effect on per centage changes in salaries next year at Indiana University are estimated as follows If the bill is passed in its current version the mean and standard deviation of salary percentage change will be 0 and 1 If the bill is passed with amendments the mean and standard deviation will be 15 and 35 Finally if the bill is defeated the mean and standard deviation will be 35 and 6 Find the mean and standard deviation of the percentage change in salaries next year at Indiana University The house edge in any game of chance is defined as E player s loss on a bet Size of player s loss on a bet For example if a player wins 10 with probability 048 and loses 10 with probability 052 on any bet the house edge is 10048 10052 10 004 Give an interpretation to the house edge that relates to how much money the house is likely to win on average Which do you think has a larger house edge roulette or sports gambling Why Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it 44 AN INTRODUCTION T O SIMULATION In the previous section we asked you to imagine many repetitions of an event with each repetition resulting in a different random outcome Fortunately you can do more than imagine you can make it happen with computer simulation Simulation is an extremely useful tool that can be used to incorporate uncertainty explicitly into spreadsheet models A simulation model is the same as a regular spreadsheet model except that some cells include random quantities Each time the spreadsheet recalculates new values of the ran dom quantities occur and these typically lead to different bottom line results By forcing the spreadsheet to recalculate many times a business manager is able to discover the results that are most likely to occur those that are least likely to occur and best case and worstcase results We will use simulation several places in this book to help explain diffi cult concepts in probability and statistics We begin in this section by using simulation to explain the connection between summary measures of probability distributions and the corresponding summary measures from Chapter 2 We continue to use the market return distribution in Figure 43 Because this is your first discussion of computer simulation in Excel we proceed in some detail Our goal is to simulate many retums we arbitrarily choose 400 from this distribution and analyze the resulting retums We want each simulated retum to have probability 012 of being 23 probability 040 of being 18 and so on Then using the methods for summarizing data from Chapter 2 we calculate the average and standard deviation of the 400 simulated returns The method for simulating many market returns is straightforward once you know how to simulate a single market return The key to this is Excel s RAND function which generates a random number between 0 and 1 The RAND function has no arguments so every time you call it you must enter RAND3 Although there is nothing inside the parentheses next to RAND the parentheses cannot be omitted That is to generate a ran dom number between 0 and 1 in any cell enter the formula RAND in that cell The RAND function can also be used as part of another function For example you can simulate the result of a single ip of a fair coin by entering the formula IFRANDlt05quotHeadsquotquotTailsquot Random numbers generated with Excel s RAND function are said to be uniformly dis tributed between 0 and 1 because all decimal values between 0 and 1 are equally likely These uniformly distributed random numbers can then be used to generate numbers from any discrete distribution such as the market return distribution in Figure 43 To see how this is done note first that there are ve possible values in this distribution Therefore we divide the interval from 0 to 1 into ve parts with lengths equal to the probabilities in the probabil ity distribution Then we see which of these parts the random number from RAND falls into and generate the associated market return If the random number is between 0 and 012 of length 012 we generate 23 as the market return if the random number is between 012 and 052 of length 040 we generate 18 as the market return and so on See Figure 44 This procedure is accomplished most easily in Excel through the use of a lookup table A lookup table is useful when you want to compare a particular value to a set of values and 3Before Excel 2007 RAND was the only built in function for generating random numbers The RANDBE TWEEN appeared in Excel 2007 Actually RANDBETWEEN was available in the Analysis Toolpak add in but most people weren39t aware of it It generates uniformly distributed random integers within a given range For example RANDBETWEEN16 generates a random integer from 1 to 6 with all values equally likely This could be used to simulate the roll of a single die for example 44 An Introduction to Simulation I73 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it 0 012 052 077 092 1 Figure 44 I I I I I I Associating RAND Interval length 012 040 025 015 008 Values With Market Market return if 023 018 015 009 003 Returns RAND falls in this interval depending on where the particular value falls assign a given answer or value from an asso ciated list of values In this case we want to compare a generated random number to values between 0 and 1 falling in each of the ve intervals shown in Figure 44 and then report the corresponding market return This process is made relatively simple in Excel by applying the VLOOKUP function as explained in the following steps4 Refer to Figure 45 and the Market Returnxlsx file Figure 45 Simulation of Market Returns A B c D E F G H I 1 Simulating market returns Range names used 2 l LTabe Simuation D13E17 3 Summary statistics from simulation below Simuatedmarketreturn SimuationB13B412 4 Average return 152 5 Stdev of returns 52 6 7 Exact values from previous sheet for comparison L Average return 153 9 Stdev of returns 53 10 Simulation Lookup table 12 Random Simulated market return Cum Prob Return 13 0937678 3 0 23 0925121 3 012 18 15 0447876 18 052 15 16 0915249 9 077 9 3 0125884 18 092 3 18 0966630 3 19 0614663 15 L10 0087177 23 411 0576865 15 412 0564437 15 PROCEDURE FOR GENERATING RANDOM MARKET RETURNS IN ExcEL 0 Lookup table Copy the possible returns to the range E13E17 Then enter the cumulative probabilities next to them in the range D13D17 To do this enter the value 0 in cell D13 Then enter the formula D13MarketB4 in cell D14 and copy it down through cell D17 Note that the MarketB4 in this formula refers to cell B4 in the Market sheet that is cell B4 in Figure 43 Each value in column D is the current probability plus the previous value The table in this range D13E17 becomes the lookup range For convenience we have named this range LTable 4This could also be accomplished with nested IF functions but the resulting formula would be much more complex I 74 Chapter 4 Probability and Probability Distributions Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it 9 Random numbers Enter random numbers in the range A13A412 An easy way to do this is to highlight the range then type the formula RAND and nally press Ctrl Enter Note that these random numbers are live That is each time you do any calculation in Excel or press the recalculation key the F9 key these random numbers change Excel Tip A quick way to enter aformula or value into a range of cells is to highlight the range type in the formula or value and press Ctrl Enter both keys at once This is equivalent to entering the formula in the first cell of the range in the usual way and then copying it to the rest of the range 6 Market returns Generate the random market returns by referring the random num bers in column A to the lookup table Speci cally enter the formula VLOOKUPA13LTable2 in cell B13 and copy it down through cell B412 This formula compares the random num ber in cell A13 to the cumulative probabilities in the first column of the lookup table and sees where it fits as illustrated in Figure 44 Then it returns the corresponding market return in the second column of the lookup table It uses the second column because the third argument of the VLOOKUP function is 2 Excel Tip In general the VLOOK UP function takes three arguments I the value to be compared 2 a table of lookup values with the values to be compared against always in the leftmost column and 3 the column number of the lookup table that contains the answer It also takes afourth optional argument not needed here You can look it up in online help 0 Summary statistics Summarize the 400 market returns by entering the formulas AVERAGESimulatedmarketretum and STDEVSimulatedmarketretum in cells B4 and B5 For comparison copy the average and standard deviation from the Market sheet in Figure 43 to cells B8 and B9 Now let s step back and see what has been accomplished The following points are relevant I Simulations like this are very common and we will continue to use them to illustrate concepts in probability and statistics I The numbers you obtain will be different from the ones in Figure 45 because of the nature of simulation The results depend on the particular random numbers that happen to be generated I The way we entered cumulative probabilities and then used a lookup table is generally the best way to generate random numbers from a discrete probability distribution However there is an easier way if a simulation add in is available We will discuss this in Chapter 15 I Each generated market return in the Simulatedmarketretum range is one of the ve possible market returns If you count the number of times each retum appears and then divide by 400 the number of simulated values you will see that the resulting fractions 44 An Introduction to Simulation I75 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it FUNDAMENTAL INSIGHT are approximately equal to the original probabilities For example the fraction of times the highest return 23 appears is about 012 This is the essence of what it means to simulate from a given probability distribution I The average and standard deviation in cells B4 and B5 calculated from the formulas in Chapter 2 are very close to the mean and standard deviation of the probability distribution in cells B8 and B9 Note however that these measures are calculated in entirely different ways For example the average in cell B4 is a simple average of 400 numbers whereas the mean in cell B8 is a weighted sum of the possible market returns weighted by their probabilities This last point allows you to interpret the summary measures of a probability distribution Speci cally the mean and standard deviation of a probability distribution are approxi mately what you would obtain if you calculated the average and standard deviation using the formulas from Chapter 2 of many simulated values from this distribution In other words the mean is the long run average of the simulated values Similarly the standard deviation measures their variability Role of Simulation Spreadsheet simulation is one of the most important tools in an anayst s arsenal For this reason it will be discussed in much more depth in later chapters partic ularly the last two chapters Simulation doesn t show you what will occur instead it shows you many of the possible scenarios that might occur By seeing a variety of scenarios including those that are normal and those that are extreme you understand the situation much better and can make more informed decisions PROBLEMS Level A 18 A quality inspector picks a sample of 15 items at ran dom from a manufacturing process known to produce 10 defective items Let X be the number of defective items found in the random sample of 15 items Assume that the condition of each item is independent of that of each of the other items in the sample The probability distribution of X is provided in the le P0418xlsx a Use simulation to generate 500 values of this random variable X b Calculate the mean and standard deviation of the simulated values How do they compare to the mean and standard deviation of the given probability distribution A personnel manager of a large manufacturing plant is investigating the number of reported on the j ob I 76 Chapter 4 Probability and Probability Distributions Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it 20 You might ask whether this long run average interpretation of the mean is relevant if the situation is going to occur only once For example the market retum in the example is for the coming year and the coming year will occur only once So what is the use of a long run average In this type of situation the long run average interpretation is probably not very relevant but fortunately there is another use of the expected value that we exploit in Chapter 6 Speci cally when a decision maker must choose among several actions that have uncertain outcomes the preferred decision is often the one with the largest expected monetary value This makes the expected value of a probability distribution extremely impor tant in decision making contexts accidents at the facility over the past several years Let X be the number of such accidents reported during a one month period Based on past records the manager has established the probability distribution for X as shown in the le P0419xlsx a Use simulation to generate 1000 values of this random variable X b Is the simulated distribution indicative of the given probability distribution Explain why or why not Let X be the number of heads when a fair coin is ipped four times a Find the distribution of X and then use simulation to generate 1000 values of X b Is the simulated distribution indicative of the given probability distribution Explain why or why not c Calculate the mean and standard deviation of the simulated values How do they compare to the mean and standard deviation of the given proba Level B bility distribution The probability distribution of X the number of cus tomers in line including the one being served if any at a checkout counter in a department store is given by PX 0 025 PX 1 025 PX 2 020 PX 3 020 and PX 4 010 21 Use simulation to generate 500 values of this random variable X b Is the simulated distribution indicative of the given probability distribution Explain why or why not c Calculate the mean and standard deviation of the simulated values How do they compare to the mean and standard deviation of the given probability distribution d Repeat parts a through c with 5000 simulated values rather than 500 Explain any differences you observe 22 Betting on a football point spread works as follows Suppose Michigan is favored by 175 points over Indiana If you bet a unit on Indiana and Indiana loses by 17 or less you win 10 If Indiana loses by 18 or more points you lose 11 Find the mean and standard deviation of your winnings on a single bet Assume that there is a 05 probability that you will win your bet and a 05 probability that you will lose your bet Also simulate 1600 bets to estimate the average loss per bet Note Do not be too disappointed if you are off by up to 50 cents It takes many say 10000 simulated bets to get a really good estimate of the mean loss per bet This is because there is a lot of variability on each bet 45 DISTRIBUTION OF TWO RANDOM VARIABLES SCENARIO APPROACH5 We now turn to the distribution of two related random variables In this section we discuss the situation where the two random variables are related in the sense that they both depend on which of several possible scenarios occurs In the next section we discuss a second way of relating two random variables probabilistically These two methods differ slightly in the way they assign probabilities to different outcomes However for both methods there are two summary measures covariance and correlation that measure the relationship between the two random variables As with the mean variance and standard deviation covariance and correlation are similar to the corresponding measures from Chapter 3 but they are conceptually different In Chapter 3 correlation and covariance were calculated from data here they are calculated from a probability distribution We denote the covariance and correlation between two random variables X and Y by CorrelX Y and CorrelX Y These are defined by equations 412 and 413 Here pxl y I in Equation 412 is the probability that X and Y equal the values x I and y 1 respec tively it is called a joint probability Covariance between X and Y k CovarX Y Em EXyl EYpxl yl 412 i1 Correlation between X and Y C KX Y Covar X Y 4 13 one StdevX gtlt StdevY 39 5The rest of this chapter is optional Although it is very useful for applied probability models it is not used in the rest of the book 45 Distribution ofTwo RandomVariabes Scenario Approach I77 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it As with variance the following is an equivalent formula for covariance that is easier to implement in Excel because it avoids the need for deviations from the means This formula says to find a weighted sum of all the products of xs and ys weighted by their joint proba bilities and then subtract the product of the means Covariance between X and Y computing formula k CovarX Y 2xlylpxl yl EXEY 414 39 1 ti Although covariance and correlation based on a joint probability distribution are calculated differently than for known data their interpretation is essentially the same as that dis cussed in Chapter 3 Each indicates the strength of a linear relationship between X and Y That is if X and Y tend to vary in the same direction then both measures are positive If they vary in opposite directions both measures are negative As before the magnitude of the covariance is more difficult to interpret because it depends on the units of measurement of X and Y However the correlation is always between 1 and 1 The following example illustrates the scenario approach as well as covariance and correlation Simulation is used to explain the relationship between the covariance and cor relation as defined here and the corresponding measures from Chapter 3 EXAM PLE 43 ANALYZING A PORTFOLIO or INVESTMENTS IN GM STOCK AND GOLD An investor plans to invest in General Motors GM stock and in gold He assumes that the returns on these investments over the next year depend on the general state of the economy during the year To keep things simple he identifies four possible states of the economy depression recession normal and boom Also given the most up to date infor mation he can obtain he assesses the probabilities of these four states to be 005 030 050 and 015 For each state of the economy he estimates the resulting return on GM stock and the return on gold These appear in the shaded section of Figure 46 See the file GM vs Goldxlsx For example if there is a depression he estimates that GM stock will Figure 46 A l B l C l D l E 1 Calculating covariance and correlation between two random variables Distribution of GM 2 and Gold Returns T Economic outcome I Probability GM Return Gold Return 4 Depression 005 20 5 5 Recession 030 10 20 T Normal 050 30 12 7 Boom 015 50 9 8 9 GM Gold E Means 245 16 11 Variances 00275 00203 12 Stdevs 166 142 13 F Covariance 00097 15 Correlation 0410 I 78 Chapter 4 Probability and Probability Distributions Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it As in the previous example this assump tion of discreteness is clearly an approxi mation of reality We could again use range names but there would be too many ofthem Besides it is usually easier to copy formulas when range names are not used This simulation is not necessary for the cal culation of the covari ance and correlation but it provides some insight into their meanings decrease by 20 and the price of gold will increase by 5 The investor wants to analyze the joint distribution of returns on these two investments He also wants to analyze the dis tribution of a portfolio of investments in GM stock and gold Objective To obtain the relevant joint distribution and use it to calculate the covariance and correlation between returns on the two given investments and also to analyze a port folio containing these two investments Solution To obtain the joint distribution use the distribution of GM return defined by columns B and C of the shaded region in Figure 46 and the distribution of gold return defined by columns B and D The scenario approach applies because a given state of the economy determines both GM and gold returns so that only four pairs of returns are possible For example 20 is a possible GM return and 9 is a possible gold return but they cannot occur simultaneously The only possible pairs of returns according to our assumptions are 20 and 5 10 and 20 30 and 12 and 50 and 9 These possible pairs have the probabilities shown in column B To calculate means variances and standard deviations GM and gold returns can be treated separately For example the formula for the mean GM return in cell B 10 is SUMPRODUCTC4 C7 B4B7 The only new calculations in Figure 46 involve the covariance and correlation between GM and gold returns To obtain these use the following steps PROCEDURE FOR CALCULATING THE COvARIANcE AND CORRELATION 0 Covariance Calculate the covariance between GM and gold returns in cell B 14 with the formula SUMPRODUCTC4C7D4D7B4B7B10C10 Note the use of the SUMPRODUCT function in this formula It usually takes two range arguments but it can take more than two all of which must have exactly the same dimen sion This function multiplies corresponding elements from each of the three ranges and sums these products exactly as prescribed by the summation in Equation 414 Then subtract the product of the means 9 Correlation Calculate the correlation between GM and gold returns in cell B 15 with the formula B14B12C12 as prescribed by Equation 413 The negative covariance indicates that GM and gold returns tend to vary in opposite directions although it is difficult to judge the strength of the relationship between them by the magnitude of the covariance The correlation of 0410 on the other hand is also negative and indicates a moderately negative relationship You can t rely too much on this correlation however because the relationship between GM and gold returns is not linear From the values in the range C4D7 it is apparent that GM does better and better as the economy improves whereas gold does better then worse then better A simulation of GM and gold returns sheds some light on the covariance and correla tion measures This simulation is shown in Figure 47 There are two keys to this simula tion First we simulate the states of the economy not at least not directly the GM and I79 45 Distribution ofTwo Random Variables Scenario Approach Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Figure 47 A B l C D E F G Simulation of GM Simulating GM and Gold returns and Gold Returns 3 Summary measures from simulation below 4 GM Gold 5 Means 236 25 6 Stdevs 164 145 f 8 Covariance 00106 9 Correlation 0448 10 11 Exact results from previous sheet for comparison 12 GM Gold 13 Means 245 16 14 Stdevs 166 142 15 0 16 Covariance 00097 17 Correlation 0410 18 19 Simulation results Lookup table for generating returns 20 Random GM return Gold return CumProb GM return Gold return 21 00100693 20 5 0 20 5 22 09107821 50 9 005 10 20 T3 04105589 30 12 035 30 12 24 00385696 20 5 085 50 9 25 09010982 50 9 26 07536752 30 12 418 08730589 50 9 419 01297612 10 20 420 07331896 30 12 gold returns For example any random number between 005 and 035 implies a recession The returns for GM and gold from a recession are then known to be 10 and 20 You can implement this by entering a RAND function in cell A21 and then entering the formulas VLOOKUPA21LTable2 and VLOOKUPA21LTable3 in cells B21 and C21 Then copy these formulas down through row 420 This way the same random number hence the same scenario is used to generate both returns in a given row and the effect is that only four pairs of returns are possible Second once you have the simulated returns in the range B21 C420 you can calculate the covariance and correlation of these numbers in cells B8 and B9 with the formulas6 COVARB21B420C21C420 and CORRELB21B420C21C420 6These formulas implement the covariance and correlation definitions from Chapter 3 not Equations 412 and 413 of this chapter because these formulas are based on the simulated rows of data I 80 Chapter 4 Probability and Probability Distributions Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Here COVAR and CORREL are the builtin Excel functions discussed in Chapter 3 for calculating the covariance and correlation between pairs of numbers A comparison of cells B8 and B9 with B16 and B17 shows that there is a reasonably good agreement between the covariance and correlation of the probability distribution from Equations 412 and 413 and the measures based on the simulated values This agreement is not perfect but it typically improves as you simulate more pairs The final question in this example involves a portfolio consisting of GM stock and gold The analysis appears in Figure 48 We assume that the investor has 10000 to invest He puts some fraction of this in GM stock see cell B6 and the rest in gold Of course these fractions determine the total dollar values invested in row 7 The key to the analysis is the following Because there are only four possible scenarios there are only four possible portfolio returns For example if there is a recession the GM and gold returns are 10 and 20 so the portfolio return per dollar is a weighted average of these returns weighted by the fractions invested Portfolio return in recession 0610 0420 14 Figure 48 Distribution of Portfolio Return A B c D E F G H I J K L 1 Analyzing a portfolio of GM and Gold 2 l l i Total to invest 10000 4 5 Investments GM Gold 6 Fraction of total 060 040 7 Dollar value 6000 4000 8 9 Distribution of portfolio Mean Portfolio Return 10 Economic outcome Return Total dollars 330000 Depression 100 1000 25oooo 12 Recession 140 1400 2000 13 Normal 132 1320 14 Boom 336 3360 5150000 15 100000 i 16 Summary measures of portfolio 50000 17 Return Total dollars 0 00 i 18 Mean 15340 153400 39 00 01 02 03 04 05 06 07 08 09 10 L9Variance 0008495 849484 Fraction in GM 20 Stdev 9217 92167 21 l l l l 22 Data table for mean and stdev of portfolio return as a function of GM investment stdev f P quot39f quot Retum Z 23 GM investmentl Meanl Stdevl 180000 24 153400 92167 160000 i 25 00 16000 142422 140000 26 01 38900 122328 31720000 T 27 02 61800 104816 5130000 T 28 03 84700 91381 580000 29 04 107600 84004 560000 E 05 130500 84290 izlgggg 31 06 153400 92167 000 32 07 176300 105957 00 01 02 03 04 05 06 07 08 09 10 33 08 199200 123697 Fraction in GM i 34 09 222100 143934 35 10 245000 165756 l l l l In this way you can calculate the entire portfolio return distribution either per dol lar or total dollars and then calculate its summary measures in the usual way The details which are similar to other spreadsheet calculations in this chapter can be found in the GM vs Goldxlsx file In particular the possible returns are listed in the ranges B11B14 and 8 45 Distribution ofTwo Random Variables Scenario Approach Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Recall that an Excel data table is used for whatif analysis It allows you to vary an input over some range and see how one or more outputs change For more details refer to the Excel Tutorialxlsx le PROBLEMS Level A The quarterly sales levels in millions of dollars of C11C14 of Figure 48 and the associated probabilities are the same as those used previ ously in this example These lead to the summary measures in the range B 18C20 In par ticular the investor s expected return is 1534 and the standard deviation is 9217 Based on a 10000 investment these translate to an expected total dollar return of 1534 and a standard deviation of 92167 Because an investor can choose the fractions to invest in GM and gold it is important to see how the expected portfolio retum and the standard deviation of portfolio retum change as these fractions vary To do this make sure that the value in cell B6 is a constant and that for mulas are entered in cells C6 B7 and C7 In this way these last three cells update automati cally when the value in cell B6 changes and the total investment amount remains 10000 Then form a data table in the range A24C35 that calculates the mean and standard deviation of the total dollar portfolio retum for each of several GM investment proportions in column A To do this enter the formulas C18 and C20 in cells B24 and C24 highlight the range A24C35 select Data Table from the What If Analysis dropdown list on the Data ribbon and enter cell B6 as the column input cell No row input cell is necessary The graphs of the means and standard deviations from this data table appear in Figure 48 They show that the expected portfolio retum steadily increases as more and more is put into GM and less is put into gold However the standard deviation often used as a measure of risk first decreases then increases This means there is a tradeoff between expected retum and risk as measured by the standard deviation The investor could obtain a higher expected retum by putting more of his money into GM but past a fraction of approximately 04 the risk also increases I Answer by calculating the correlation between them How might the answer to this question in uence your decision to purchase shares of one two US retail giants are dependent on the general state of the national economy in the coming months The file P0423XlsX provides the probability distri bution for the projected sales volume of each of these two retailers in the next quarter 21 Find the mean and standard deviation of the quarterly sales volume for each of these two retailers Compare these two sets of summary measures b Find the covariance and correlation for the given quarterly sales volumes Interpret your results 24 The possible annual percentage returns of the stocks of Alpha Inc and Beta Inc are distributed as shown in the file P0424XlsX a What is the expected annual return of Alpha s stock What is the expected annual return of Beta s stock b What is the standard deviation of the annual return of Alpha s stock What is the standard deviation of the annual return of Beta s stock c On the basis of your answers to the questions in parts a and b which of these two stocks would you prefer to buy Defend your choice d Are the annual returns of these two stocks posi tively or negatively associated with each other I 82 Chapter 4 Probability and Probability Distributions or both of these companies The annual bonuses awarded to members of the management team and assembly line workers of an automobile manufacturer depend largely on the cor poration s sales performance during the preceding year The le P0425XlsX contains the probability dis tribution of possible bonuses measured in hundreds of dollars awarded to white collar and blue collar employees at the end of the company s scal year 3 How much do a manager and an assembly line worker expect to receive in their bonus check at the end of a typical year For which group of employees within this organization does there appear to be more variability in the distribution of possible annual bonuses How strongly associated are the bonuses awarded to the white collar and blue collar employees of this company at the end of the year Answer by calculating the correlation between them What are some possible implications of this result for the relations between members of the management team and the assembly line workers in the future Copyright 2010 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 26 Consumer demand for small economical automobiles depends somewhat on recent trends in the average price of unleaded gasoline For example consider the information given in the le P0426XlsX on the distributions of average annual sales of the Honda Civic and the Toyota Prius in relation to the trend of the average price of unleaded fuel over the past two years 21 Find the annual mean sales levels of the Honda Civic and the Toyota Prius b For which of these two models are sales levels more sensitive to recent changes in the average price of unleaded gasoline c Given the available information how strongly associated are the annual sales volumes of these two popular compact cars Answer by calculating the correlation between them Provide a qualitative explanation of the results at a large state university enjoy hanging out at the local tavern in the evenings The file P0427Xlsx contains the distribution of number of hours spent by these students at the tavern in a typical week along with typical cumulative grade point averages on a 4point scale for marketing and accounting students with similar social habits 21 Compare the means and standard deviations of the grade point averages of the two groups of students Does one of the two groups consistently perform better academically than the other Explain b Does academic performance as measured by cumu lative GPA seem to be associated with the amount of time students typically spend at the local tavem If so characterize the observed relationship c Find the covariance and correlation between the typical grade point averages earned by the two subgroups of students What do these measures of 27 Upon completing their respective homework assignments marketing majors and accounting majors assoclauon mdlcate m thls Case 46 DISTRIBUTION OF TWO RANDOM VARIABLES IOINT PROBABILITY APPROACH The previous section illustrated one possibility the scenario approach for specifying the joint distribution of two random variables You first identify several possible scenarios next specify the value of each random variable that will occur under each scenario and then assess the probability of each scenario For people who think in terms of scenarios and this includes many business managers this is a very appealing approach In this section we illustrate an alternative method for specifying the probability distribution of two random variables X and Y You first identify the possible values of X and the possible values of Y Let x and y be any two such values Then you directly assess the joint probability of the pair x y and denote it by PX x and Y y or more simply by px y This is the probability of the joint event that X x and Y y both occur As always the joint probabilities must be nonnegative and sum to 1 A joint probability distribution speci ed by all probabilities of the form px y provides a tremendous amount of information It indicates not only how X and Y are related but also how each of X and Y is distributed in its own right In probability terms the joint distribution of X and Y determines the marginal distributions of both X and Y where each marginal distribution is the probability distribution of a single random variable They are called marginal because they are usually displayed in the margins of a table The joint distribution also determines the conditional distributions of X given Y and of Y given X The conditional distribution of X given Y for example is the distribution of X given that Y is known to equal a certain value These concepts are best explained by means of an example as we do next EXAMPLE 44 UNDERSTANDING THE RELATIONSHIP BETWEEN DEMANDS FOR SUBSTITUTE PRODUCTS Acompany sells two products product 1 and product 2 that tend to be substitutes for one another That is if a customer buys product 1 she tends not to buy product 2 and vice versa The company assesses the joint probability distribution of demand for the two products during the coming month This joint distribution appears in the shaded region of 46 Distribution ofTwo Random Variableszjoint Probability Approach I83 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Figure 49 See the Demand sheet of the le Substitute Productsxlsx Column B and row 4 of this table show the possible values of demand for the two products Speci cally the company assumes that demand for product 1 can be from 100 to 400 in increments of 100 and demand for product 2 can be from 50 to 250 in increments of 50 Furthermore each possible value of demand 1 can occur with each possible value of demand 2 with the joint probability given in the table For example the joint probability that demand 1 is 200 and demand 2 is 100 is 008 Given this joint probability distribution describe more fully the probabilistic structure of demands for the two products Figure 49 A l B l C l D l E F 1 Probability distribution of demands for substitute products Ioint Probability Distribution of 2 i i D d 3 Demand for product 1 em 3 4 100 200 300 400 5 50 0015 0040 0050 0035 6 Demand 100 0030 0080 0075 0025 7 for 150 0050 0100 0100 0020 8 product 2 200 0045 0100 0050 0010 9 250 0060 0080 0025 0010 Objective To use the given joint probability distribution of demands to find the condi tional distribution of demand for each product given the demand for the other product and to calculate the covariance and correlation between demands for these substitutes Solution Let D1 and D2 denote the demands for products 1 and 2 You first find the marginal distributions of D1 and D2 These are the row and column sums of the joint probabilities in Figure 410 An example of the reasoning is as follows Consider the probability PD1 200 If demand for product 1 is to be 200 it must be accompanied by some value of D2 that is exactly one of the joint events D1 200 and D2 50 through D1 200 and D2 250 must occur Using the addition rule for probability find the total probabil ity of these joint events by summing the corresponding joint probabilities The result is PD1 200 040 the column sum corresponding to D1 200 Similarly marginal probabilities for D2 such as PD2 150 027 are the row sums calculated in column G in Figure 410 Note that the marginal probabilities either those in row 10 or those in column G sum to 1 as they should These marginal probabilities indicate how the demand for either product behaves in its own right aside from any considerations of the other product The marginal distributions indicate that in between values of D1 or of D2 are most likely whereas extreme values in either direction are less likely However these marginal distributions tell you nothing about the relationship between D1 and D2 After all prod ucts 1 and 2 are supposedly substitute products The joint probabilities spell out this rela tionship but they are rather difficult to interpret A better way is to calculate the conditional distributions of D1 given D2 or of D2 given D1 You can do this in rows 12 through 29 of Figure 410 Focus on the conditional distribution of D1 given D2 shown in rows 12 through 19 In each row of this table rows 1519 you fix the value of D2 at the value in column B and calculate the conditional probabilities of D1 given this fixed value of D2 The conditional probability is the joint probability divided by the marginal probability I 84 Chapter 4 Probability and Probability Distributions Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Figure 4 I 0 Marginal and Conditional Distributions and Summary Measures A B c D E F G 1 Probability distribution of demands for substitute products 2 l l 3 Demand for product 1 4 100 200 300 400 5 50 0015 0040 0050 0035 0140 6 Demand 100 0030 0080 0075 0025 0210 7 for 150 0050 0100 0100 0020 0270 8 product 2 200 0045 0100 0050 0010 0205 9 250 0060 0080 0025 0010 0175 10 020 040 030 010 11 12 Conditional distribution of demand for product 1 given demand for product 2 13 Demand for product 1 14 100 200 300 400 15 50 011 029 036 025 100 16 Demand 100 014 038 036 012 100 17 for 150 019 037 037 007 100 18 product 2 200 022 049 024 005 100 19 250 034 046 014 006 100 20 21 Conditional distribution of demand for product 2 given demand for product 1 22 Demand for product 1 23 100 200 300 400 24 50 008 010 017 035 25 Demand 100 015 020 025 025 26 for 150 025 025 033 020 27 product 2 200 023 025 017 010 28 250 030 020 008 010 29 100 100 100 100 30 31 Product 1 Product 2 32 Means 23000 15325 33 Variances 810000 417694 34 Stdevs 9000 6463 35 36 Products of demands 1 and 2 for covariance calculation 37 100 200 300 400 38 50 5000 10000 15000 20000 39 100 10000 20000 30000 40000 40 150 15000 30000 45000 60000 41 200 20000 40000 60000 80000 42 250 25000 50000 75000 100000 43 44 Covariance 164750 45 Correlation 0283 PD1 200ID2 150 46 Distribution of Two Random Variableszjoint Probability Approach PD1 200 andD2 150 1 PD2 150 of D2 For example the conditional probability that D1 equals 200 given that D2 equals 150 is I85 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it This formula follows These conditional probabilities can be calculated all at once by entering the formula from the general conditional probability C5G5 formula in Section 423 in cell C15 and copying it to the range C15F19 Make sure you see why only column G not row 5 is held absolute in this formula You can also check that each row of this table is a probability distribution in its own right by summing across rows The row sums shown in column G are all equal to 1 as they should be Similarly the conditional distribution of D2 given D1 is in rows 21 through 29 Here each column represents the conditional probability distribution of D2 given the xed value of D1 in row 23 These probabilities can be calculated by entering the formula C5C10 in cell C24 and copying it to the range C24F28 Now the column sums shown in row 29 are 1 indicating that each column of the table represents a probability distribution Various summary measures can now be calculated Some of these are shown in Figure 410 The following steps present the details PROCEDURE FOR CALCULATING SUMMARY MEASURES 0 Expected values The expected demands in cells B32 and C32 follow from the mar ginal distributions To calculate these enter the formulas SUMPRODUCTC4F4C10F10 and SUMPRODUCTB5B9G5G9 in these two cells Note that each of these is based on Equation 46 for an expected value that is a sum of products of possible values and their marginal probabilities Variances and standard deviations These measures of variability are also calcu lated from the marginal distributions by appealing to Equation 49 For example to find the variance of D1 enter the formula SUMPRODUCTC4F4C4F4C10F10B32quot2 in cell B33 and take its square root in cell B34 9 Covariance and correlation The formulas for covariance and correlation are the same as before see Equations 414 and 413 However unlike Example 43 a com plete table of products of possible demands in the range C38F42 is required Then calculate the covariance in cell B44 with the formula SUMPRODUCTC38F42C5F9B32C32 Finally calculate the correlation in cell B45 with the formula B44B34C34 Now let s step back and examine the results If you are interested in the behavior of a single demand only say D1 then the relevant quantities are the marginal probabilities in row 10 and the mean and standard deviation of D1 in cells B32 and B34 However you are often more interested in the joint behavior of D1 and D2 The best way to see this behavior is in the conditional probability tables For example compare the probability distributions in rows 15 through 19 As the value of D2 increases the probabilities for D1 tend to shift to the left That is as demand for product 2 increases demand for product 1 tends to decrease I 86 Chapter 4 Probability and Probability Distributions Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it This is only a tendency When D2 equals its largest value there is still some chance that D1 will be large but this probability is fairly small This behavior can be seen more clearly from the graph in Figure 411 Each line in this graph corresponds to one of the rows 15 through 19 The legend shows the different values of D2 You can see that when D2 is large D1 tends to be small although again this is only a tendency not a perfect relationship When economists say that the two products are substitutes for one another this is the type of behavior they imply Figure 4 I Conditional Distributions of Demand 1 Given Demand 2 Distribution of Demand 1 Given Demand 2 060 050 040 50 030 I 100 020 150 010 200 000 4 250 100 200 300 400 Value of Demand 1 By symmetry the conditional distribution of D2 given D1 shows the same type of behavior This is illustrated in Figure 412 where each line represents one of the columns C through F in the range C24F28 and the legend shows the different values of D1 Figure 4 I 2 Conditional Distributions of Demand 2 Given Demand 1 Distribution of Demand 2 Given Demand 1 040 035 030 025 j 020 100 015 I 200 010 300 005 4 400 000 50 100 150 200 250 Value of Demand 2 The information in these graphs is conf1rmed to some extent at least by the covari ance and correlation between D1 and D2 In particular their negative values indicate that demands for the two products tend to move in opposite directions Also the rather small magnitude of the correlation 0283 indicates that the relationship between these demands is far from perfect When D1 is large there is still a reasonably good chance that D2 will be large and when D1 is small there is still a reasonably good chance that D2 will be small I 46 1 How to Assess Ioint Probability Distributions In the scenario approach from Section 45 only one probability for each scenario has to be assessed In the joint probability approach a whole table of joint probabilities must be assessed This can be quite difficult especially when there are many possible values for 46 Distribution ofTwo Random Variableszjoint Probability Approach I87 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it each of the random variables In Example 44 the approach requires 4 X 5 20 joint probabilities that not only sum to 1 but imply the desired substitute product behavior One approach is to proceed backward from the way illustrated in the example Instead of specifying the joint probabilities and then deriving the marginal and conditional distrib utions you can specify either set of marginal probabilities and either set of conditional probabilities and then use these to calculate the joint probabilities The reasoning is based on the multiplication rule for probability in the form given by Equation 415 Assessing the joint probability distribution oftwo or more random variables is never easy for a manager but the suggestions here are useful Joint Probability Formula PX x andY y PX xY yPY y 415 In words the joint probability on the left is the conditional probability that X x given Y y multiplied by the marginal probability that Y y The roles of X and Y can be reversed yielding the alternative formula in Equation 416 In general you would choose the formula that makes the probabilities on the right hand side easiest to assess Alternative Joint Probability Formula PX x andY y PY yX xPX x 416 The advantage of this procedure over assessing the joint probabilities directly is that it is probably easier and more intuitive for a business manager The manager has more control over the relationship between the two random variables as determined by the conditional probabilities she assesses Still it is not easy especially if these are subjective probabili ties The manager will need to make many assessments of the likelihoods of events based on her knowledge of the business PROBLEMS Level A 28 Let X and Y represent the number of Dell and HP laptop computers respectively sold per month from online sites The le P0428xlsx contains the probabilities of various combinations of monthly sales volumes of these The joint probability distribution of the weekly demand for two brands of diet soda is provided in the file P0429XlsX In particular let D1 and D2 represent the weekly demand in hundreds of two liter bottles for brand 1 and brand 2 respectively in a small town in central Indiana COmF etit0rS39 a Find the mean and standard deviation of this com a Find the marginal distributions of X and Y Interpret munityas Weekly demand for each brand of diet SOda your ndings b What is the probability that the weekly demand for b Calculate the expected monthly laptop computer sales volumes for Dell and HP at these sites c Calculate the standard deviations of the monthly laptop computer sales volumes for Dell and HP at these sites d Find and interpret the conditional distribution of X given Y e Find and interpret the conditional distribution of Y given X f Find and interpret the correlation between X and Y Are these random variables independent or nearly so I 88 Chapter 4 Probability and Probability Distributions Copyright 2010 Cengage Leaniing All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaniing experience Cengage Leaniing reserves the right to remove additional content at any time if subsequent rights restrictions require it 30 each brand will be at least one standard deviation above its mean c What is the probability that at least one of the two weekly demands will be at least one standard deviation above its mean d What is the correlation between the weekly demands for these two brands of diet soda What does this measure of association tell you about the relationship between these two products A local pharmacy has two checkout stations available to its customers a regular checkout station and an express checkout station Customers with six or fewer Level B items are assumed to join the express line Let X and Y be the numbers of customers in the regular checkout 32 The TeeehtWeek1y trehds Of tWO Partteular StOek Prtees line and the express checkout line respectively at eah best he desertbed by the lOtht l3TObabthty Chsth the busiest time of a typical day Note that these butiOh ShOWh ih the the P0432X1SX numbers include the customers being served if any a What is the l3TObabthty that the Prtee Of StOek 1 Will The joint distribution for x and Y is given in the file net increase in the eemihg Week p43X1sX b What is the probability that the price of stock 2 will 21 Find the marginal distributions of X and Y What ehahge th the eOhhhg Week does each of these distributions teii you c What is the probability that the price of stock 1 will b Find the conditional distribution of x given Y not decrease given that the price of stock 2 What is the practical benefit of knowing this Telhaihs eOhStaht th the eOhhhg Week eonditionai distribution d What is the probability that the price of stock 2 will c What is the probability that no one is waiting or ehahgea gtyeh that the Prtee Of StOek 1 ehahges th being served in the regular checkout line the eOhhhg Week d what is the probability that no one is Waiting or e Why is it impossible to nd the correlation between being served in the express checkout line the tyl3tea1Week1y 1hOVe1hehtS Of these tWO StOek e what is the probability that no more than two prices from the information given Nevertheless customers are waiting in both lines combined dOeS it aPl3eaT that they are l3OStttVe1y OT hegattyely f on average how niany eustonieis would you related Why What are the implications of this expect to see in each of these two iines during the result for choosing an investment portfolio that may busiest time of day at the phannaey or may not include these two particular stocks Suppose that the manufacturer of a particular product Two service elevators are used in parallel by employees assesses the joint distribution of price P per unit and of a three story hotel building At any point in time demand D for its product in the coming quarter as when both elevators are stationary let X1 and X2 be the provided in the file P0431XlsX oor numbers at which elevators 1 and 2 respectively 21 Find the expected price and the expected demand are currently located The joint probability distribution in the coming quarter of X1 and X2 is given in the le P0433xlsX b What is the probability that the price of this a What is the probability that these two elevators are product will be above its mean in the coming not stationed on the same oor quarter b What is the probability that elevator 2 is located on c What is the probability that the demand for this the third oor product will be below its mean in the coming c What is the probability that elevator 1 is not quarter located on the first oor d What is the probability that demand for this d What is the probability that elevator 2 is located on product will exceed 2500 units during the coming the first oor given that elevator 1 is not stationed quarter given that its price is less than 40 on the first oor e What is the probability that demand for this e What is the probability that a hotel employee product will be fewer than 3500 units during the approaching the rst oor elevators will nd at coming quarter given that its price is greater than least one available for service 30 f Repeat part e for a hotel employee approaching f Find the correlation between price and demand Is each of the second and third oor elevators the result consistent with your expectations g How might this hotel s operations manager respond Explain to your findings in the previous questions 47 INDEPENDENT RANDOM VARIABLES A very important special case of joint distributions is when the random variables are independent Intuitively this means that any information about the values of any of the random variables is worthless in terms of predicting any of the others In particular if there are only two random variables X and Y then information about X is worthless in terms of predicting Y and vice versa Usually random variables in real applications are not inde pendent they are usually related in some way in which case we say they are dependent 47 Independent RandomVariabes I89 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it However we often make an assumption of independence in mathematical models to sim plify the analysis The most intuitive way to express independence of X and Y is to say that their condi tional distributions are equal to their marginals For example the conditional probability that X equals any value x given that Y equals some value y equals the marginal probabil ity that X equals x and this statement is true for all values of x and y In words knowledge of the value of Y has no effect on probabilities involving X Similarly knowledge of the value of X has no effect on probabilities involving Y An equivalent way of stating the independence property is that for all values x and y the events X x and Y y are probabilistically independent in the sense of Section 424 This leads to the important property that joint probabilities equal the product of the marginals as shown in Equation 417 This follows from Equation 415 and also because conditionals equal marginals under independence Equation 4 17 might not be as intuitive but it is very useful as illustrated in the following example Joint Probability Formula for Independent Random Variables PX x andY y PX xPY y 417 EXAM PLE 45 ANALYZING THE SALES or Two POPULAR PERSONAL DIGITAL ASSISTANTS local office supply and equipment store Office Station sells several different brands of personal digital assistants PDAs One of the store s managers has studied the daily sales of its two most popular personal digital assistants the Palm M505 and the Palm Vx over the past quarter In particular she has used historical data to assess the joint probability distribution of the sales of these two products on a typical day The assessed distribution is shown in Figure 413 see the le PDA Salesxlsx The manager would like to use this distribution to determine whether there is support for the claim that the sales of the Palm Vx are often made at the expense of Palm M505 sales and vice versa ABCDEF Figure 4 I 3 1 Assessed probability distribution of sales of two popular PDAs Ioint Probability T 39 Distribution of Sales 3 Daily Sales of Palm W i 0 1 2 3 5 0 001 003 006 009 6 Daily sales of 1 002 006 012 009 L Palm M505 2 003 012 006 009 8 3 004 009 006 003 Objective To use the assessed joint probability distribution to find the conditional distri bution of daily sales of each PDA given the sales of the other PDA and to determine whether the daily sales of these two products are independent random variables Solution As in the solution of Example 44 begin by applying the addition rule for probability to find the marginals for each of the two personal digital assistants as shown in Figure 414 I 90 Chapter 4 Probability and Probability Distributions Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Figure 4 I 4 Marginal and Conditional Distributions of Sales A B c D E F G 1 Assessed probability distribution of sales of two popular PDAs 2 l 3 Daily sales of Palm Vx T 0 11 2 3 5 0 001 003 006 009 019 6 Daily sales of 1 002 006 012 009 029 7 Palm M505 2 003 012 006 009 030 8 3 004 009 006 003 022 9 010 030 030 030 To 11 Conditional distribution of sales of Palm Vx given sales of Palm M505 12 Daily sales of Palm Vx T3 0 1 2 30 14 0 005 016 032 047 1 15 Daily sales of 1 007 021 041 031 1 16 Palm M505 2 010 040 020 030 1 17 3 018 041 027 014 1 18 19 Conditional distribution of sales of Palm M505 given sales of Palm Vx 20 Daily sales of Palm Vx 21 0 1 2 3 22 0 010 010 020 030 23 Daily sales of 1 020 020 040 030 E Palm M505 2 030 040 020 030 25 3 040 030 020 010 26 1 1 1 1 Before nding the conditional distribution of sales for each product you can check whether these two random variables are independent Let M and V denote the daily sales for the Palm M505 and Palm Vx respectively Equation 417 states that PM m and V v PM m PV V for all values of m and v if M and V are independent However the marginal probabilities indicate that PM 0 PV 0 019010 0019 whereas PM 0 and V 0 001 from the table Therefore there is at least one case where the joint probability does not equal the product of the marginals This inequal ity rules out the possibility that M and V are independent random variables If you are not yet convinced of this conclusion compare the products of other marginal probabilities with corresponding joint probabilities in Figure 414 You can verify that Equation 417 fails to hold for virtually all of the different combinations of sales levels The conditional distributions of V given M and M given V are shown in the ranges C14F17 and C22F25 calculated exactly as in Example 44 What can the Office Station manager infer from these conditional probability distributions Observe in the first table that the likelihood of achieving the highest daily sales level of the Palm Vx decreases as the daily sales level of the Palm M505 increases This same table reveals that the probabil ity of experiencing the lowest daily sales level of the Palm Vx increases as the daily sales level of the Palm M505 increases Furthermore by closely examining the second table you can see that the likelihood of achieving the highest daily sales level of the Palm M505 decreases as the daily sales level of the Palm Vx increases This same table reveals that the probability of experiencing the lowest daily sales level of the Palm M505 increases as the daily sales level of the Palm Vx increases 47 Independent RandomVariabes I9I Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it PROBLEMS Therefore there is considerable support for the claim that sales of the Palm Vx are often made at the expense of Palm M505 sales and vice versa This result makes sense from a business point of view and it implies that the daily sales of these two products are not independent of one another In other words by knowing the sales level of one of these PDAs the manager has a better understanding of the likelihood of achieving particu lar sales of the other product I Level A 34 35 36 The le P0434XlsX shows the conditional distribution of the daily number of accidents at a given intersection during the winter months X given the amount of snowfall in inches for the day X1 The marginal dis tribution of X1 is provided in the bottom row of the table a Are X1 and X2 independent random variables Explain why or why not b What is the probability of no accidents at this inter section on a winter day with no snowfall c What is the probability of no accidents at this intersection on a randomly selected winter day d What is the probability of at least two accidents at this intersection on a randomly selected winter day on which the snowfall is at least three inches e What is the probability of less than four inches of snowfall on a randomly selected day A sporting goods store sells two competing brands of exercise bicycles Let X1 and X2 be the numbers of the two brands sold on a typical day at this store Based on available historical data the conditional probability distribution of X1 given X2 is assessed as shown in the file P0435XlsX The marginal distribution of X2 is given in the bottom row of the table a Are X1 and X2 independent random variables Explain why or why not b What is the probability of observing the sale of exactly one brand 1 bicycle and exactly one brand 2 bicycle on the same day at this store c What is the probability of observing the sale of at least one brand 1 bicycle on a given day at this store d What is the probability of observing the sale of no more than two brand 2 bicycles on a given day at this store e Given that no brand 2 bicycles are sold on a given day what is the likelihood of observing the sale of at least one brand 1 bicycle at this store The le P0428xlsX contains the probabilities of various combinations of monthly sales volumes of Dell X and HP Y laptop computers from online sites Are the monthly sales of these two competitors independent of each other Explain your answer I 92 Chapter 4 Probability and Probability Distributions 37 38 Let D1 and D2 represent the weekly demand in hundreds of two liter bottles for brand 1 diet soda and brand 2 diet soda respectively in a small central Indiana town The joint probability distribution of the weekly demand for these two brands of diet soda is provided in the file P0429xlsX Are D1 and D2 independent random variables Explain why or why not The le P0431XlsX contains the joint probability distribution of price P per unit and demand D for a particular product in the coming quarter a Are P and D independent random variables Explain your answer b If P and D are not independent random variables which joint probabilities result in the same marginal probabilities for P and D as given in the le but make P and D independent of each other Level B You know that in one year you are going to buy a house In fact you have already selected the neighborhood but right now you are finishing your graduate degree and you are engaged to be married this summer so you are delaying the purchase for a year The annual interest rate for xed rate 30year mortgages is currently 600 and the price of the type of house you are considering is 120000 However things may change Using your knowledge of the economy and a crystal ball you estimate that the interest rate might increase or decrease by as much as one percentage point Also the price of the house might increase by as much as l0000 it certainly won t decrease You assess the probability distribution of the interest rate change as shown in the file P0439XlsX The probability distribution of the increase in the price of the house is also shown in this file Finally you assume that the two random events change in interest rate change in house price are probabilistically independent This means that the probability of any joint event such as an interest increase of 050 and a price increase of 5000 is the product of the individual probabilities Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Variance of a Weighted Sum of Independent Random Variables VarY a VarX1 a VarX2 a VarXn 419 Using summation notation this becomes 11 VarY 2al2VarXl i1 If the Xs are not independent the variance of Y is more complex and requires covariances In particular for every pair Xi and X there is an extra term in Equation 419 2 alaJCovarXl XI The general result is best written in summation notation as shown in Equation 420 Variance of a Weighted Sum of Dependent Random Variables 1 VarY 2al2VarXl E2alaJCovarXl XJ 420 i1 iltj The first summation would be the variance if the Xs were independent The second sum mation indicates that a covariance term must be added for all pairs of Xs that have nonzero covariances Actually this equation is always valid regardless of independence because the covariance terms are all zero if the Xs are independent There are a number of special cases of Equation 420 which we list next Special Cases of Expected Value and Variance I Sum of independent random variables Here we assume the Xs are independent and the weights are all 1 that is YX1X2Xn Then the mean of the sum is the sum of the means and the variance of the sum is the sum of the variances EY EX1 EX2 EXn VarY VarX1 VarX2 A VarXn I Difference between two independent random variables Here we assume X1 and X2 are independent and the weights are a1 1 and a2 1 so that Y can be written as YX1 X2 Then the mean of the difference is the difference between means but the variance of the difference is the sum of the variances because a2 12 1 EY EX1 EX2 VarY VarX1 VarX2 I Sum of two dependent random variables In this case we make no independence assumption and set the weights equal to 1 so that Y X 1 X 2 Then the mean I 94 Chapter 4 Probability and Probability Distributions Copyright 2010 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it of the sum is again the sum of the means but the variance of the sum includes a covariance term EY EX1 EX2 VarY VarX1 VarX2 2CovarX1 X2 I Difference between two dependent random variables This is the same as the second case except that the Xs are no longer independent Again the mean of the difference is the difference between means but the variance of the difference now includes a covariance term and because of the negative weight a2 1 the sign of this covariance term is negative EY EX1 EX2 VarY VarX1 VarX2 2CovarX1 X2 I Linear Function of a Random Variable Suppose that Y can be written as Y a bX for some constants a and b In this special case the random variable Y is called a linear function of the random variable X Then the mean variance and standard deviation of Y can be calculated from the similar quantities for X with the follow ing formulas EY a bEX VarY b2VarX StdevY lbl StdevX In particular if Y is a constant multiple of X that is if a 0 then the mean and standard deviation of Y are the same multiple of the mean and standard deviation of X Note that the absolute value of 9 is used in the formula for StdevY This is because 9 could be negative whereas a standard deviation cannot be negative We now put these concepts to use in an investment example EXAMPLE 46 DESCRIBING INVESTMENT PORTFOLIO RETURNS In fact 0 Ofthe C0r An investor has 100000 to invest and she would like to invest it in a portfolio of eight 39639quotquot39 3 are P tquot e stocks She has gathered historical data on the returns of these stocks and has used the 39h E 3I39yS39Cjf historical data to estimate means standard deviations and correlations for the stock tends to Very in the returns These summary measures appear in rows 12 13 and 17 through 24 of Figure 415 same direction as some See the le POI tf0li0 Analysisxlsx underlying economic For example the mean and standard deviation of stock 1 are 101 and 124 These gdicatoy OfC 3e he probably imply that the historical annual returns of stock 1 averaged 101 and the stan Cgfglzzo nrigffrit gr dard deviation of the annual returns was 124 although they might not be based purely a because any Stock on historical data Also the correlation between the annual returns on stocks 1 and 2 for return is perfectly example is 032 see either cell C17 or B18 which necessarily contain the same value Cmelated With itself This value 032 probably indicates a moderate positive correlation between the historical annual returns of these stocks 48 Weighted Sums of Random Variables I 95 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Figure 4 I 5 Input Data for Investment Example A B c D E F G H I J K L M 1 Calculating mean variance and stdev for a weighted sum of random variables Range names used 2 39 l I l 39 Covariances ModeB2835 3 A55quotquotquot quot5 Means ModeB1212 4 1 Random variables are oneyear returns from various stocks Stdevs MOde B1313 2 Weights are amounts invested in stocks 5 Variance ModeB39 3 Weighted sum is return from portfolio 6 Weights ModeB99 7 Given quantities 8 l Stockl Stock2 Stock3 Stock4 Stock5 Stock6 Stock7 Stock8 Total 9 Weights 10500 16300 9600 9300 9500 15400 14300 15100 100000 10 11 Stockl Stock2 Stock3 Stock4 Stock5 Stock6 Stock7 Stock8 12 Means 101 73 118 99 118 91 96 123 13 Stdevs 124 119 134 141 158 159 113 174 14 l l l 15 Correlations between stock returns 16 l Stockll Stock2l Stock3 Stock4 Stock5 Stoc k6 Stoc k7 Stoc k8 17 Stockl 1000 0320 0370 0610 0800 0610 0550 0560 18 Stock2 0320 1000 0410 0780 0430 0800 0950 0480 19 Stock3 0370 0410 1000 0330 0860 0380 0340 0700 20 Stock4 0610 0780 0330 1000 0680 0500 0500 0670 21 Stock5 0800 0430 0860 0680 1000 0580 0420 0540 22 Stock6 0610 0800 0380 0500 0580 1000 0920 0340 23 Stock7 0550 0950 0340 0500 0420 0920 1000 0650 24 Stock8 0560 0480 0700 0670 0540 0340 0650 1000 Although these summary measures have probably been obtained from historical data the investor believes they are relevant for predicting future returns Now she would like to analyze a portfolio of these stocks using the investment amounts shown in row 9 What is the mean annual return from this portfolio What are its variance and standard deviation Objective To determine the mean annual return of the portfolio and to quantify the risk associated with the total dollar return from the given weighted sum of annual stock returns Solution This is a typical weighted sum model The random variables are the annual returns from the stocks the weights are the dollar amounts invested in the stocks and the summary measures of the random variables are given in rows 12 13 and 17 through 24 of Figure 415 Be careful about units however Each X I expressed as a percentage rep resents the return on a single dollar invested in stock i whereas Y the weighted sum of the X s represents the total dollar return So a typical value of an X might be 105 whereas a typical value of Y might be 10500 We can immediately apply Equation 418 to obtain the mean return from the portfo lio This appears in cell B38 of Figure 416 using the formula SUMPRODUCTWeightsMeans I 96 Chapter 4 Probability and Probability Distributions Copyright 2010 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Figure 4 I 6 Calculations for Investment Example A B c D E F G H I 26 Covariances between stock returns variances of stock returns are on the diagonal 27 Stock1 Stock2 Stock3 Stock4 Stock5 Stock6 Stock7 Stock8 28 Stock1 00154 00047 00061 00107 00157 00120 00077 00121 29 Stock2 00047 00142 00065 00131 00081 00151 00128 00099 30 Stock3 00061 00065 00180 00062 00182 00081 00051 00163 31 Stock4 00107 00131 00062 00199 00151 00112 00080 00164 32 Stock5 00157 00081 00182 00151 00250 00146 00075 00148 33 Stock6 00120 00151 00081 00112 00146 00253 00165 00094 34 Stock7 00077 00128 00051 00080 00075 00165 00128 00128 35 Stock8 00121 00099 00163 00164 00148 00094 00128 00303 36 37 Summary measures of portfolio 38 Mean 1005640 39 Variance 124992021 40 Stdev 1117998 You cannot yet calculate the variance of the portfolio return The reason is that the input data include standard deviations and correlations not the variances and covariances required in Equation 4187 But the variances and covariances are related to standard deviations and correlations by VarXl StdevXl2 421 and CovarXl X J StdevXl gtlt StdevXJ gtlt CorrelXl X J 422 You can form a table of variances and covariances in the range B28I35 using Equations 421 and 422 in one step with a careful use of the HLOOKUP horizontal lookup function To do so highlight the range B28I35 type the formula HLOOKUPA28B11I133FALSEB17HLOOKUPB27B11I133FALSE and press Ctrl Enter Be careful with relative and absolute addresses Note how the HLOOKUP functions find the appropriate standard deviations from row 13 for use in the covariance formula Each diagonal element of the covariance range is a variance and the elements off the diagonal are covariances Now you can use Equation 420 to calculate the portfolio variance in cell B39 Although Equation 420 looks intimidating it can be implemented fairly easily with Excel s matrix multiplication function MMULT and its TRANSPOSE function To do so enter the following formula in cell B39 and then press CtrlShift Enter all three keys at once MMULTWeightsMMULTCovariancesTRANSPOSEWeights 7This was intentional It is often easier for an investor to assess standard deviations and correlations because they are more intuitive measures 48 Weighted Sums of Random Variables I 97 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it This formula called an array formula is somewhat advanced but it is a very handy short cut for implementing Equation 420 Section 1483 provides more information about matrix multiplication and the MMULT function in general If you are interested you can read that section now Finally calculate the standard deviation of the portfolio return in cell B40 as the square root of the variance The results in Figure 416 indicate that the investor has an expected return of slightly more than 10000 or 10 from this portfolio However the standard deviation of approx imately 11200 is sizable This standard deviation is a measure of the portfolio s risk Investors always want a large mean return but they also want low risk Moreover they realize that the only way to obtain a higher mean retum is usually to accept more risk You can experiment with the spreadsheet for this example to see how the mean and standard devi ation of portfolio retum vary with the investment amounts Just enter new weights in row 9 keeping the sum equal to 100000 and see how the values in B38 through B40 change I PROBLEMS Level A 40 A typical consumer buys a random number X of polo shirts when he shops at a men s clothing store The dis tribution of X is given by the following probability dis tribution PX 0 030 PX 1 030 PX 2 020 PX 3 010 and PX 4 010 a Find the mean and standard deviation of X b Assuming that each shirt costs 35 let Y be the total amount of money in dollars spent by a customer when he visits this clothing store Find the mean and standard deviation of Y c Find the probability that a customer s expenditure will be more than one standard deviation above the mean expenditure level Based on past experience the number of customers who arrive at a local gasoline station during the noon hour to purchase fuel is best described by the probability distribution given in the file P0441xlsX a Find the mean variance and standard deviation of this random variable b Find the probability that the number of arrivals during the noon hour will be within one standard deviation of the mean number of arrivals c Suppose that the typical customer spends 28 on fuel upon stopping at this gasoline station during the noon hour Find the mean and standard deviation of the total gasoline revenue earned by this gas station during the noon hour d What is the probability that the total gasoline revenue will be less than the mean value found in part c e What is the probability that the total gasoline revenue will be more than two standard deviations above the mean value found in part c I 98 Chapter 4 Probability and Probability Distributions 42 44 Let X be the number of defective items found by a quality inspector in a random batch of 15 items from a particular manufacturing process The probability distribution of X is provided in the le P0418XlsX This rm earns 500 pro t from the sale of each acceptable item in a given batch In the event that an item is found to be defective it must be reworked at a cost of 100 before it can be sold thus reducing its per unit pro t to 400 3 b Find the mean and standard deviation of the pro t earned from the sale of all items in a given batch What is the probability that the pro t earned from the sale of all items in a given batch is within two standard deviations of the mean profit level Is this result consistent with the empirical rules from Chapter 2 Explain The probability distribution for the number of job applications processed at a small employment agency during a typical week is given in the file P0443XlsX 3 Assuming that it takes the agency s administrative assistant two hours to process a submitted job application on average how many hours in a typical week will the administrative assistant spend processing incoming job applications Find an interval with the property that the administrative assistant can be approximately 95 sure that the total amount of time he spends each week processing incoming job applications will be in this interval Consider a financial services salesperson whose annual salary consists of both a xed portion of 25000 and a variable portion that is a commission based on her sales performance In particular she estimates that her monthly sales commission can be represented by a random variable with mean 5000 and standard deviation 700 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it 45 46 48 a What annual salary can this salesperson expect to earn Assuming that her sales commissions in different months are independent random variables what is the standard deviation of her annual salary c Between what two annual salary levels can this salesperson be approximately 95 sure that her true total earnings will fall b E A lm processing shop charges its customers 18 cents per print but customers may refuse to accept one or more of the prints for various reasons Assume that this shop does not charge its customers for refused prints The number of prints refused per 24 print roll is a random variable with mean 15 and standard deviation 05 21 Find the mean and standard deviation of the amount that customers pay for the development of a typical 24 print roll Assume that this shop processes 250 24 print rolls of film per week Assuming the numbers of refused prints on these rolls are independent random variables find the mean and standard deviation of the weekly lm processing revenue of this shop c Find an interval such that the manager of this film shop can be approximately 95 sure that the weekly processing revenue will be contained within the interval 50 Suppose the monthly demand for Thomson televisions has a mean of 40000 and a standard deviation of 20000 Find the mean and standard deviation of the annual demand for Thomson TVs Assume that demands in different months are probabilistically independent Is this assumption realistic Suppose there are five stocks available for investment and each has an annual mean return of 10 and a standard deviation of 4 Assume the returns on the stocks are independent random variables a If you invest 20 of your money in each stock find the mean and standard deviation of the annual dollar return on your investments If you invest 100 in a single stock determine the mean and standard deviation of the annual return on your investment c How do the answers to parts a and b relate to the phrase Don t put all your eggs in one basket 52 An investor puts 10000 into each of four stocks labeled A B C and D The file P0448xlsx contains the means and standard deviations of the annual returns of these four stocks Assuming that the returns of these four stocks are independent find the mean and standard deviation of the total amount that this investor earns in one year from these four investments Level B Consider again the investment problem described in the previous problem Now assume that the retums of the four stocks are no longer independent Specifically the correlations between all pairs of stock returns are given in the le P0449XlsX 21 Find the mean and standard deviation of the total amount that this investor earns in one year from these four investments Compare these results to those you found in the previous problem Explain the differences in your answers Suppose that this investor now decides to place 15000 each in stocks B and D and 5000 each in stocks A and C How do the mean and standard deviation of the total amount that this investor eams in one year change from the allocation used in part 21 Provide an intuitive explanation for these changes A supermarket chain operates ve stores of varying sizes in Bloomington Indiana Pro ts represented as a percentage of sales volume earned by these ve stores are 275 3 35 425 and 5 respectively The means and standard deviations of the daily sales volumes at these ve stores are given in the le P0450xlsx Assuming that the daily sales volumes are independent nd the mean and standard deviation of the total pro t that this supermarket chain eams in one day from the operation of its ve stores in Bloomington A manufacturing company constructs a 1 cm assembly by snapping together four parts that average 025 cm in length The company would like the standard deviation of the length of the assembly to be 001 cm Its engineer Peter Purdue believes that the assembly will meet the desired level of variability if each part has standard deviation 0014 00025 cm Instead show Peter that you can do the job by making each part have standard deviation 0014 W 0005 cm This could save the company a lot of money because not as much precision is needed for each part The weekly demand function for one of a given firm s products can be represented by q 200 5 p where q is the number of units purchased in hundreds at price p in dollars Assume that the price of the product will be an integer value from 10 to 15 with probabilities 010 015 025 030 015 and 005 21 Find the mean and standard deviation of p b Find the mean and standard deviation of q c Assuming that it costs this firm 10 to manufacture and sell each unit of the product express the firm s weekly contribution to pro t from the sale of this product measured in dollars as a function of the quantity purchased c Find the mean and standard deviation of weekly contribution to the firm s profit from the sale of this product 48 Weighted Sums of Random Variables I 99 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it A retailer purchases a batch of 1000 uorescent lightbulbs from a wholesaler at a cost of 2 per bulb The wholesaler agrees to replace each defective bulb with one that is guaranteed to function properly for a charge of 020 per bulb The retailer sells the bulbs at typical batch and assume that the mean and standard deviation of X are 50 and 10 respectively a Find the mean and standard deviation of the profit in dollars the retailer makes from selling a batch of lightbulbs a price of 250 per bulb and gives his customers free b Find an interval with the property that the retailer replacements if they bring defective bulbs back to the store Let X be the number of defective bulbs in a can be approximately 95 sure that his profit will be in this interval 49 CONCLUSION This chapter has introduced some very important concepts including the basic rules of probability random variables probability distributions and summary measures of proba bility distributions We have also shown how computer simulation can be used to help explain some of these concepts Many of the concepts presented in this chapter will be used in later chapters so it is important to learn them now In particular we rely heavily on probability distributions in Chapter 6 when we discuss decision making under uncertainty There you will learn how the expected value of a probability distribution is the primary cri terion for making decisions We will also continue to use computer simulation in later chapters to help explain difficult statistical concepts Summary of Key Terms Term Explanation Excel Pages Equation Random variable Associates a numerical value with 156 each possible outcome in a situation involving uncertainty Probability A number between 0 and 1 that measures 158 the likelihood that some event will occur Rule of The probability of any event A Basic 159 41 complements and the probability of its complement formulas sum to 1 Mutually exclusive Events where only one of them can occur 200 events Exhaustive events Events where at least one of them 159 must occur Addition rule for The probability that at least one of a Basic 159 42 mutually exclusive set of mutually exclusive events will formulas events occur is the sum of their probabilities Conditional Updates the probability of an event Basic 160 43 probability formula given the knowledge that another event formulas has occurred Multiplication rule Formula for the probability that two Basic 160 44 events both occur formulas Probability tree A graphical representation of how events 160 occur through time useful for calculating probabilities continued 200 Chapter 4 Probability and Probability Distributions Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Term Explanation Excel Pages Equation Probabilistic Events where knowledge that one of 201 45 independence them has occurred is of no value in assessing the probability that the other will occur Relative The proportion of times the event occurs 201 frequency out of the number of times a random experiment is performed Cumulative Less than or equal to probabilities 167 probability associated with a random variable Mean or expected A measure of central tendency the Basic 167 46 value of a weighted sum of the possible values formulas probability weighted by their probabilities distribution Variance of a A measure of variability the weighted Basic 168 47 49 probability sum of the squared deviations of the formulas distribution possible values from the mean weighted by the probabilities Standard deviation A measure of variability the square Basic 168 48 of a probability root of the variance formulas distribution Simulation An extremely useful tool that can be 173 used to incorporate uncertainty explicitly into spreadsheet models Uniformly Random numbers such that all decimal RAND 173 distributed random values between 0 and 1 are equally numbers likely Uniformly distributed Random integers such that all RAND random integers integers between two given values BETWEEN16 173 are equally likely for example Covariance between A measure of the relationship Basic 177 412 414 two random variables between two jointly distributed formulas random variables Correlation between A measure of the relationship Basic 201 413 two random variables between two jointly distributed formulas random variables scaled to be between 1 and 1 Multiplication rule Formula for a joint probability as the Basic 193 415 416 for random variables product of a marginal probability and formulas a conditional probability Independent random Random variables where information 189 variables about one of them is no value in terms of predicting the others Multiplication rule The joint probability is the product Basic 195 417 for independent of the marginal probabilities formulas random variables Expected value of a Useful for finding the expected value of Basic 193 418 weighted sum of Y where Y a1X1 a2X2 anXn formulas random variables Variance of a Useful for finding the variance of Y Basic 194 419 weighted sum of where Y a1X1 a2X2 formulas independent random anXn and the Xs are independent variables continued 49 Conclusion 20 I Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Summary of Key Terms Continued Term Explanation Excel Pages Equation Variance of a Useful for nding the variance of Y Basic 194 420 weighted sum of where Y a1X1 a2X2 formulas dependent random a nX n and the Xs are not independent variables Covariance in terms Used to calculate covariance when Basic 197 422 of standard only information on correlations formulas deviations and and standard deviations is given correlation PROBLEMS Conceptual Questions C1 Suppose that you want to nd the probability that event A or event B will occur If these two events are not mutually exclusive explain how you would proceed C2 If two events are mutually exclusive they must not be independent events Is this statement true or false Explain your choice C3 Is the number of passengers who show up for a particular commercial airline ight a discrete or a continuous random variable Is the time between ight arrivals at a major airport a discrete or a continuous random variable Explain your answers C4 Suppose that officials in the federal government are trying to determine the likelihood of a major smallpox epidemic in the United States within the next 12 months Is this an example of an objective probability or a subjective probability How might the officials assess this probability C5 What is another term for the covariance between a random variable and itself If this variable is measured in dollars what are the units of this covariance C6 Consider the statement When there are a nite number of outcomes then all probability is just a matter of counting Speci cally if n of the outcomes are favorable to some event E and there are N outcomes total then the probability of E is n V Is this statement always true Is it always false C7 If there is uncertainty about some monetary outcome and you are concerned about return and risk then all you need to see are the mean and standard deviation The entire distribution provides no extra useful information Do you agree or disagree Provide an example to back up your argument C8 Choose at least one uncertain quantity of interest to you For example you might choose the highest price 202 Chapter 4 Probability and Probability Distributions C9 C10 C11 C12 of gas between now and the end of the year the highest point the Dow Jones Industrial Average will reach between now and the end of the year the number of majors Tiger Woods will win in his career and so on Using all of the information and insight you have assess the probability distribution of this uncertain quantity Is there one right answer Historically the most popular measure of variability has been the standard deviation the square root of the weighted sum of squared deviations from the mean weighted by their probabilities Suppose analysts had always used an alternative measure of variability the weighted sum of the absolute deviations from the mean again weighted by their probabilities Do you think this would have made a big difference in the theory and practice of probability and statistics Suppose a person ips a coin but before you can see the result the person puts her hand over the coin At this point does it make sense to talk about the probability that the result is heads Is this any different from the probability of heads before the coin was ipped Consider an event that will either occur or not For example the event might be that California will experience a major earthquake in the next five years You let p be the probability that the event will occur Does it make any sense to have a probability distribution of p Why or why not If so what might this distribution look like How would you interpret it Suppose a couple is planning to have two children Let Bl be the event that the rst child is a boy and let B2 be the event that the second child is a boy You and your friend get into an argument about whether B1 and B2 are independent events You think they are independent and your friend thinks they aren t Which of you is correct How could you settle the argument Copyright 2010 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Level A 54 56 57 A business manager who needs to make many phone calls has estimated that when she calls a client the probability that she will reach the client right away is 60 If she does not reach the client on the first call the probability that she will reach the client with a subsequent call in the next hour is 20 21 Find the probability that the manager reaches her client in two or fewer calls b Find the probability that the manager reaches her client on the second call but not on the rst call c Find the probability that the manager is unsuccessful on two consecutive calls Suppose that a marketing research rm sends question naires to two different companies Based on historical evidence the marketing research rm believes that each company independently of the other will retum the questionnaire with probability 040 a What is the probability that both questionnaires are retumed b What is the probability that neither of the questionnaires is retumed c Now suppose that this marketing research rm sends questionnaires to ten different companies Assuming that each company independently of the others retums its completed questionnaire with probability 040 how do your answers to parts a and b change Based on past sales experience an appliance store stocks five window air conditioner units for the coming week No orders for additional air conditioners will be made until next week The weekly consumer demand for this type of appliance has the probability distribution given in the le P0456XlsX a Let X be the number of window air conditioner units left at the end of the week if any and let Y be the number of special stockout orders required if any assuming that a special stockout order is required each time there is a demand and no unit is available in stock Find the probability distributions of X and Y b Find the expected value of X and the expected value of Y c Assume that this appliance store makes a 60 pro t on each air conditioner sold from the weekly available stock but the store loses 20 for each unit sold on a special stockout order basis Let Z be the pro t that the store earns in the coming week from the sale of window air conditioners Find the probability distribution of Z d Find the expected value of Z Simulate 1000 weekly consumer demands for window air conditioner units with the probability distribution given in the file P0456xlsx How does your simulated distribution compare to the given probability distribution Explain any differences between these two distributions 58 60 The probability distribution of the weekly demand for copier paper in hundreds of reams used in the duplicating center of a corporation is provided in the le P0458xlsx 21 Find the mean and standard deviation of this distribution b Find the probability that weekly copier paper demand is at least one standard deviation above the mean c Find the probability that weekly copier paper demand is within one standard deviation of the mean Consider the probability distribution of the weekly demand for copier paper in hundreds of reams used in a corporation s duplicating center as shown in the file P0458xlsx 21 Use simulation to generate 500 values of this random variable b Find the mean and standard deviation of the simulated values c Use your simulated values to estimate the probability that weekly copier paper demand is within one standard deviation of the mean Why is this only an estimate not an exact value The probability distribution of the weekly demand for copier paper in hundreds of reams used in the duplicating center of a corporation is provided in the file P0458xlsx Assuming that it costs the duplicat ing center 5 to purchase a ream of paper find the mean and standard deviation of the weekly copier paper cost for this corporation The instructor of an introductory organizational behavior course believes that there might be a relationship between the number of writing assignments X she gives in the course and the nal grades Y earned by students in this class She has taught this course with varying numbers of writing assignments for many semesters and has compiled relevant historical data in the le P0461xlsx 21 Convert the given frequency table to a table of conditional probabilities of final grades Y earned by students in this class given the number of writing assignments 0 in the course Comment on the table of conditional probabilities Generally speaking what does this table tell you b Given that this instructor requires only one writing assignment in the course what is the expected final grade earned by the typical student c How much variability exists around the conditional mean grade you found in part b Also what propor tion of all relevant students earn nal grades within two standard deviations of this conditional mean d Given that this instructor gives more than one writing assignment in the course what is the expected final grade earned by the typical student e How much variability exists around the conditional mean grade you found in part d What proportion 203 49 Conclusion Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it of all relevant students earn nal grades within two standard deviations of this conditional mean 139 Find the covariance and correlation between X and Y What does each of these measures tell you In particular is this instructor correct in believing that there is a systematic relationship between the number of writing assignments made and final grades earned in her classes 62 The file P0462XlsX contains the joint probability distribution of recent weekly trends of two particular stock prices P1 and P2 21 Are P1 and P2 independent random variables Explain why or why not b If P1 and P2 are not independent random variables which joint probabilities result in the same marginal probabilities for P1 and P2 as given in the le but make P1 and P2 independent Consider two service elevators used in parallel by employees of a three story hotel building At any point in time when both elevators are stationary let X1 and X2 be the oor numbers at which elevators 1 and 2 respec tively are currently located The le P0433xlsX contains the joint probability distribution of X1 and X2 21 Are X1 and X2 independent random variables Explain your answer b If X1 and X2 are not independent random variables which joint probabilities result in the same mar ginal probabilities for X1 and X2 as given in this le but make X1 and X2 independent 64 A roulette wheel contains the numbers 0 00 and 1 to 36 If you bet 1 on a single number coming up you earn 35 if the number comes up and lose 1 otherwise Find the mean and standard deviation of your winnings on a single bet Then nd the mean and standard deviation of your net winnings if you make 100 bets You can assume realistically that the results of the 100 spins are independent Finally provide an interval such that you are 95 sure your net winnings from 100 bets will be inside this interval Assume that there are four equally likely states of the economy boom low growth recession and depression Also assume that the percentage annual retum you obtain when you invest a dollar in gold or the stock market is shown in the file P0465XlsX 21 Find the covariance and correlation between the annual return on the market and the annual retum on gold Interpret your answers b Suppose you invest 40 of your available money in the market and 60 of your money in gold Determine the mean and standard deviation of the annual return on your portfolio c Obtain your part b answer by determining the actual retum on your portfolio in each state of the economy and then finding the mean and variance 204 Chapter 4 Probability and Probability Distributions 66 68 directly without using any formulas involving covariances or correlations d Suppose you invested 70 of your money in the market and 30 in gold Without doing any calculations determine whether the mean and standard deviation of your portfolio would increase or decrease from your answer in part b Give an intuitive explanation to support your answers Suppose there are three states of the economy boom moderate growth and recession The annual return on Honda and Toyota stock in each state of the economy is shown in the le P0466XlsX a Calculate the mean and standard deviation of the annual retum on each stock assuming the probability of each state is 13 b Calculate the mean and standard deviation of the annual retum on each stock assuming the probabilities of the three states are 14 14 and 12 c Calculate the covariance and correlation between the annual returns of the two companies stocks assuming the probability of each state is 13 d Calculate the covariance and correlation between the annual returns of the two companies stocks assuming the probabilities of the three states are 14 14 and 12 e You have invested 25 of your money in Honda and 75 in Toyota Assuming that each state is equally likely nd the mean and variance of your portfolio s retum 139 Now check your answer to part e by directly calculating the return on your portfolio for each state and use the formulas for mean and variance of a random variable For example in the boom state your portfolio earns 025025 075032 Each year the employees at Zipco receive a 0 2000 or 4500 salary increase They also receive a merit rating of 0 1 2 or 3 with 3 indicating outstanding performance and 0 indicating poor performance The joint probability distribution of salary increase and merit rating is listed in the le P0467XlsX For example 20 of all employees receive a 2000 increase and have a merit rating of 1 Find the corre lation between salary increase and merit rating Then interpret this correlation The return on a portfolio during a period is defined by Pvend Pvbeg where PVbeg is the portfolio value at the beginning of a period and PVCHC1 is the portfolio value at the end of the period Suppose there are two stocks in which you can invest stock 1 and stock 2 During each year there is a 50 chance that each dollar invested in stock 1 will turn into 2 and a 50 chance that each dollar invested in stock 1 will tum into 050 During each year there is Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it also a 50 chance that each dollar invested in stock 2 will tum into 2 and a 50 chance that each dollar invested in stock 2 will tum into 050 a If you invest all your money in stock 1 nd the expected value and standard deviation of your one year return b Assume the returns on stocks 1 and 2 are indepen dent random variables If you put half your money into each stock nd the expected value and standard deviation of your one year retum c Can you give an intuitive explanation of why the standard deviation in part b is smaller than the standard deviation in part a d Use simulation to check your answers to part b Use at least 1000 trials E You are involved in a risky business venture where three outcomes are possible 1 you will lose not only your initial investment 5000 but an additional 3000 2 you will just make back your initial investment for a net gain of 0 or 3 you will make back your initial investment plus an extra 10000 The probability of 1 is half as large as the probability of 2 and the probability of 3 is one third as large as the probability of 2 21 Find the individual probabilities of 1 2 and 3 They should sum to 1 b Find the expected value and standard deviation of your net gain or loss from this venture 70 Suppose X and Y are independent random variables The possible values of X are 1 0 and 1 the possible values of Y are 10 20 and 30 You are given that PX 1 and Y 10 005 PX 0 and Y 30 020 PY 10 020 and PX 0 050 Determine the joint probability distribution of X and Y Level B 71 Equation 47 for variance indicates exactly what variance is the weighted average of squared deviations from the mean weighted by the probabilities However the computing formula for variance Equation 49 is more convenient for spreadsheet calculations Show algebraically that the two formulas are equivalent 72 Equation 410 for covariance indicates exactly what covariance is the weighted average of products of deviations from the means weighted by the joint probabilities However the computing formula for covariance Equation 412 is more convenient for spreadsheet calculations Show algebraically that the two formulas are equivalent 73 The basic game of craps works as follows You throw two dice If the sum of the two faces showing up is 7 or 11 you win and the game is over If the sum is 2 3 or 12 you lose and the game is over If the sum is anything else 4 5 6 8 9 or 10 that value becomes 74 76 77 your point You then keep throwing the dice until the sum matches your point or equals 7 If your point occurs first you win and the game is over If 7 occurs rst you lose and the game is over What is the probability that you win the game Imagine that you are trying to predict the price of gasoline regular unleaded and the price of natural gas for home heating during the next month Assume you believe that the price of either will stay the same go up by 5 or go down by 5 Assess the joint probabilities of these possibilities that is assess nine probabilities that sum to 1 and are realistic Do you believe it is easier to assess the marginal probabilities of one and the conditional probabilities of the other or to assess the joint probabilities directly Note There is no correct answer but there are unreasonable answers those that do not re ect reality Consider an individual selected at random from a sample of 750 married women see the data in the le P0405xlsx in answering each of the following questions a What is the probability that this woman does not work outside the home given that she has at least one child b What is the probability that this woman has no children given that she works part time c What is the probability that this woman has at least two children given that she does not work full time Suppose that 8 of all managers in a given company are African American 13 are women and 17 have earned an MBA degree from a top 10 graduate business school Let A B and C be respectively the events that a randomly selected individual from this population is African American is a woman and has earned an MBA from a top 10 graduate business school a Do you believe that A B and C are independent events Explain why or why not b Assuming that A B and C are independent events find the probability that a randomly selected manager from this company is a white male and has eamed an MBA degree from a top 10 graduate business school c If A B and C are not independent events can you calculate the probability requested in part b from the information given What further information would you need Consider again the supermarket chain described in Problem 50 Now assume that the daily sales of the ve stores are no longer independent of one another In particular the le P047 7 xlsx contains the correlations between all pairs of daily sales volumes a Find the mean and standard deviation of the total pro t that this supermarket chain earns in one day from the operation of its ve stores in Bloomington 205 49 Conclusion Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Compare these results to those you found in Problem 50 Explain the differences in your answers b Find an interval such that the regional sales manager of this supermarket chain can be approximately 95 sure that the total daily pro t earned by its stores in Bloomington will be contained within the interval 78 A manufacturing plant produces two distinct products A and B The cost of producing one unit of A is 18 and that of B is 22 Assume that this plant incurs a weekly setup cost of 24000 regardless of the number of units of A or B produced The means and standard deviations of the weekly production levels of A and B are given in the P0478XlsX 21 Assuming that the weekly production levels of A and B are independent nd the mean and standard deviation of this plant s total weekly production cost Between which two total cost figures can you be about 68 sure that this plant s actual total weekly production cost will fall b How do your answers in part a change if you discover that the correlation between the weekly production levels of A and B is actually 029 Explain the differences in the two sets of results The typical standard deviation of the annual retum on a stock is 20 and the typical mean retum is about 12 The typical correlation between the annual retums of two stocks is about 025 Mutual funds often put an equal percentage of their money in a given number of stocks By choosing a large number of stocks they hope to diversify away the risk involved with choosing particular stocks How many stocks does an investor need to own to diversify away the risk associated with individual stocks To answer this question use the above information about typical stocks to determine the mean and standard deviation for the following portfolios I Portfolio 1 Half your money in each of 2 stocks I Portfolio 2 20 of your money in each of 5 stocks I Portfolio 3 10 of your money in each of 10 stocks I Portfolio 4 5 of your money in each of 20 stocks I Portfolio 5 1 of your money in each of 100 stocks What do your answers tell you about the number of stocks a mutual fund needs to invest in to diversify adequately 80 You are ordering milk for Mr D s supermarket and you are determined to please Milk is delivered once a week at midnight Sunday The mean and standard deviation of the number of gallons of milk demanded each day are given in the file P0480XlsX Find the mean and standard deviation of the weekly demand for milk What assumption must you make to determine the weekly standard deviation Presently you are ordering 1000 gallons per week Is this a sensible order quantity Assume all milk spoils after one week 206 Chapter 4 Probability and Probability Distributions Copyright 2010 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 81 82 Two gamblers play a version of roulette with a wheel as shown in the file P0481xlsx Each gambler places four bets but their strategies are different as explained below For each gambler use the rules of probability to find the distribution of their net winnings after four bets Then find the mean and standard deviation of their net winnings The file gets you started a Player 1 always bets on red On each bet he either wins or loses what he bets His rst bet is for 10 From then on he bets 10 following a win and he doubles his bet after a loss This is called a martin gale strategy and is used frequently at casinos For example if he spins red red not red and not red his bets are for 10 10 10 and 20 and he has a net loss of 10 Or if he spins not red not red not red and red then his bets are for 10 20 40 and 80 and he has a net gain of 10 b Player 2 always bets on black and green On each bet he places 10 on black and 2 on green If red occurs he loses all 12 If black occurs he wins a net 8 10 gain on black 2 loss on green If green occurs he wins a net 50 10 loss on black 60 gain on green Suppose the New York Yankees and Philadelphia Phillies two Maj or League Baseball teams are playing a best of three series The rst team to win two games is the winner of the series and the series ends as soon as one team has won two games The rst game is played in New York the second game is in Philadelphia and if necessary the third game is in New York The probability that the Yankees win a game in their home park is 055 The probability that the Phillies win a game in their home park is 053 You can assume that the outcomes of the games are independent 21 Find the probability that the Yankees win the series b Suppose you are a Yankees fan so you place a bet on each game played where you win 100 if the Yankees win the game and you lose 105 if the Yankees lose the game Find the distribution of your net winnings Then nd the mean and standard deviation of this distribution Is this betting strategy favorable to you c Repeat part a but assume that the games are played in Philadelphia then New York then Philadelphia How much does this home eld advantage help the Phillies d Repeat part a but now assume that the series is a best of ve series where the rst team that wins three games wins the series Assume that games alternate between New York and Philadelphia with the first game in New York The application at the beginning of this chapter describes the campaign McDonald s used several years ago where customers could win various prizes a Verify the figures that are given in the description That is argue why there are 10 winning outcomes and 120 total outcomes b Suppose McDonald s had designed the cards so that each card had two zaps and three pictures of the winning prize and again ve pictures of other irrelevant prizes The rules are the same as before To win the customer must uncover all three pictures of the winning prize before uncovering a zap Would there be more or fewer winners with this design Argue by calculating the probability that a card is a winner c Going back to the original game as in part a sup pose McDonald s printed one million cards each of which was eventually given to a customer Assume that the potential winning prizes on these were 500000 Cokes worth 040 each 250000 french fries worth 050 each 150000 milk shakes worth 075 each 75000 hamburgers worth 150 each 20000 cards with 1 cash as the winning prize 4000 cards with 10 cash as the winning prize 800 cards with 100 cash as the winning prize and 200 cards with 1000 cash as the winning prize Find the expected amount the dollar equivalent that McDonald s gave away in winning prizes assuming everyone played the game and claimed the prize if they won Also nd the standard deviation of this amount 2 A manufacturing company is trying to decide whether to sign a contract with the govemment to deliver an instrument to the govemment no later than eight weeks from now Due to various uncertainties the company isn t sure when it will be able to deliver the instrument Also when the instrument is delivered there is a chance that the govemment will judge it as being of inferior quality The company estimates that the probability distribution of the time it takes to deliver the instrument is as given in the le P0484XlsX Independently of this it estimates that the probability of rejection due to inferior quality is 015 If the instrument is delivered at least a week ahead of time and the govermnent judges the quality to be inferior the company will have time to x the problem with certainty and still meet the deadline However if the delivery is late or if it is exactly on time but of inferior quality the govermnent won t pay up The company expects its cost of manufacturing the instrument to be 45000 This is a sunk cost that will be incurred regardless of timing or the quality of the instrument The company also estimates that the cost to x an inferior instrument depends on the number of weeks left to x it 7500 if there are three weeks left 10000 if there are two weeks left and 15000 if there is one week left The govemment will pay 70000 for an instrument of suf cient quality delivered on time but it will pay nothing otherwise Find the distribution of pro t or loss to the company Then nd the mean and standard deviation of this distribution Do you think the company should sign the contract Have you ever watched the odds at a horse race You might hear that the odds against a given horse winning are 9 to 1 meaning that the horse has a probability 11 9 110 of winning However these odds after being converted to probabilities typically add to something greater than one Why is this Suppose you place a bet of 10 on this horse It seems that it is a fair bet if you lose your 10 if the horse loses but you win 90 if the horse wins However argue why this isn t really fair to you that is argue why your expected winnings are negative 49 Conclusion 207 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it 4 SIMPs0N s PARADOX he results we obtain with conditional probabilities can be quite counterintuitive even paradoxical This case is similar to one described in an article by Blyth I972 and is usually referred to as Simpson s paradox Two other examples of Simpson s paradox are described in articles byWestbrooke I998 and Appleton et a996 Essentially Simpson s paradox says that even if one treatment has a better effect than another on each of two separate subpopulations it can have a worse effect on the population as a whole Suppose that the population is the set of managers in a large company We categorize the managers as those with an MBA degree the Bs and those without an MBA degree the BsThese categories are the two treatment groupsWe also categorize the managers as those who were hired directly out of school by this company the Cs and those who worked with another company rst the EsThese two categories form the two subpopuations Finally we use as a measure of effectiveness those managers who have been promoted within the past year the As Assume the following conditional probabilities are given PAB and C O0PAB and C 005 423 PAB andE 035 PAB and E 020 424 PCB O90PCB 030 425 Each of these can be interpreted as a proportion For examplethe probability PAB and C implies that O of all managers who have an MBA degree and 208 Chapter 4 Probability and Probability Distributions Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it were hired by the company directly out of school were promoted last year Similar explanations hold for the other probabilities Joan Seymour the head of personnel at this company is trying to understand these figures From the probabilities in Equation 423 she sees that among the subpopulation of workers hired directly out of school those with an MBA degree are twice as likely to be promoted as those without an MBA degree Similarly from the probabilities in Equation 424 she sees that among the subpopulation of workers hired after working with another company those with an MBA degree are almost twice as likely to be promoted as those without an MBA degree The information provided by the probabilities in Equation 425 is somewhat different From these she sees that employees with MBA degrees are three times as likely as those without MBA degrees to have been hired directly out of school Joan can hardly believe it when a whizkid analyst uses these probabilities to show correcty that PAB O25PAB O55 426 In words those employees without MBA degrees are more likely to be promoted than those with MBA degreesThis appears to go directly against the evidence in equations 423 and 424 both of which imply that MBAs have an advantage in being promoted Can you derive the probabilities in Equation 426 Can you shed any light on this paradox I CHAPTER Normal Binomial Poisson and Exponential Distributions stockphotocombluestocking CHALLENGING CLAIMS OF THE BELL CURVE ne of the most controversial books in recent years is The Bell Curve The Free Press 994The authors are the late Richard Herrnstein a psychologist and Charles Murray an economist both of whom had extensive training in statisticsThe book is a scholarly treatment of differences in intelligence measured by IQ and their effect on socioeconomic status SESThe authors argue by appealing to many past studies and presenting many statistics and graphs that there are significant differences in IQ among different groups of people and that these differences are at least partially responsible for differences in SES Specifically their basic claims are that I there is a quantity intelligence that can be measured by an IQ test 2 the distribution of IQ scores is essentially a symmetric bellshaped curve 3 IQ scores are highly correlated with various indicators of success 4 IQ is determined 209 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it predominantly by genetic factors and less so by environmental factors and 5 African Americans score significantly ower about I5 points ower on IQ than whites Although the discussion of this latter point takes up a relatively small part of the book it has generated by far the most controversy Many criticisms of the authors racial thesis have been based on emotional arguments However it can also be criticized on entirely statistical grounds as Barnett I995 has done39 Barnett never states that the analysis by Herrnstein and Murray is wrong He merely states that I the assumptions behind some of the analysis are at best questionable and 2 some of the crucial details are not made as explicit as they should have beenAs he states The issue is not that The Bell Curve is demonstrably wrong but that it falls so far short of being demonstrably rightThe book does not meet the burden of proof we might reasonably expect of it For example Barnett takes issue with the claim that the genetic component of IQ is in the words of Herrnstein and Murray unikey to be smaller than 40 percent or higher than 80 percent Barnett asks what it would mean if genetics made up say 60 of IQ His only clue from the book is in an endnotewhich implies this de nition If a large population of genetically identical newborns grew up in randomly chosen environments and their Qs were measured once they reached aduthoodthen the variance of these Qs would be 60 less than the variance for the entire popuationThe key word is variance As Barnett notes however this statement implies that the corresponding drop in standard deviation is only 37That is even if all members of the population were exactly the same genetically differing environments would create a standard deviation of Qs 63 as large as the standard deviation that exists today If this is true it is hard to argue as Herrnstein and Murray have donethat environment plays a minor role in determining IQ Because the effects of different racial environments are so difficult to disentangle from genetic effects Herrnstein and Murray try at one point to bypass environmental in uences on IQ by matching blacks and whites from similar environmentsThey report that blacks in the top decile of SES have an average IQ of I04 but that whites within that decile have an IQ one standard deviation higher Even assuming that they have their facts straight Barnett criticizes the vagueness of their caimWhat standard deviation are they referring to the standard deviation of the entire population or the standard deviation of only the people in the upper decile of SESThe latter is certainly much smaller than the former Should we assume that the topdecile blacks are in the top decile of the black population or of the overall population If the latter then the matched comparison between blacks and whites is flawed because the wealthiest O of whites have far more wealth than the wealthiest O of blacks Moreover even if the reference is to the pooled national popuationthe matching is imperfect It is possible that the blacks in this pool could average around the ninth percentiewhereas the whites could average around the fourth percentiewith a signi cant difference in income between the two groups The problem is that Herrnstein and Murray never state these details explicitly Thereforewe have no way of knowing without collecting and analyzing all of the data ourseves whether their results are essentially correctAs Barnett concludes his article I believe that The Bell Curve s statements about race would have been better left unsaid even if they were definitely trueAnd they are surely better left unsaid when as we have seentheir meaning and accuracy are in doubt I 1Arnold Barnett is a professor in operations research at MIT s Sloan School of Management He specializes in data analysis of health and safety issues 2 l 0 Chapter 5 Normal Binomial Poisson and Exponential Distributions Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it 5 1 INTRODUCTION In the previous chapter we discussed probability distributions in general In this chapter we investigate several speci c distributions that commonly occur in a variety of business appli cations The first of these is a continuous distribution called the normal distribution It is characterized by a symmetric bell shaped curve and is the cornerstone of statistical theory The second distribution is a discrete distribution called the binomial distribution It is rele vant when we sample from a population with only two types of members or when we per form a series of independent identical experiments with only two possible outcomes The other two distributions we will discuss brie y are the Poisson and exponential distributions These are often used when we are counting events of some type through time such as arrivals to a bank In this case the Poisson distribution which is discrete describes the number of arrivals in any period of time and the exponential distribution which is continu ous describes the times between arrivals The main goals in this chapter are to present the properties of these distributions give some examples of when they apply and show how to perform calculations involving them Regarding this last objective analysts have traditionally used special tables to find probabilities or values for the distributions in this chapter However these tasks have been simplified with the statistical functions available in Excel Given the availability of these Excel functions the traditional tables are no longer necessary We cannot overemphasize the importance of these distributions Almost all of the sta tistical results discussed in later chapters are based on either the normal distribution or the binomial distribution The Poisson and exponential distributions play a less important role in this book but they are nevertheless extremely important in many management science applications Therefore it is important for you to become familiar with these distributions before proceeding 52 THE NORMAL DISTRIBUTION The single most important distribution in statistics is the normal distribution It is a continu ous distribution and is the basis of the familiar symmetric bell shaped curve Any particular normal distribution is specified by its mean and standard deviation By changing the mean the normal curve shifts to the right or left By changing the standard deviation the curve becomes more or less spread out Therefore there are really many normal distributions not just a single one We say that the normal distribution is a two parameter family where the two parameters are the mean and standard deviation 521 Continuous Distributions and Density Functions We first take a moment to discuss continuous probability distributions in general In the previous chapter we discussed discrete distributions characterized by a list of possible val ues and their probabilities The same basic idea holds for continuous distributions such as the normal distribution but the mathematics is more complex Now instead of a list of pos sible values there is a continuum of possible values such as all values between 0 and 100 or all values greater than 0 Instead of assigning probabilities to each individual value in the continuum the total probability of 1 is spread over this continuum The key to this spreading is called a density function which acts like a histogram The higher the value of the density function the more likely this region of the continuum is 52 The Normal Distribution 2 I Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Figure 5 A Skewed Density Function For continuous distributions probabilities are areas under the density function These probabilities can often be calculated with Excel functions Figure 52 Probability as the Area Under the Density Density Function A density function usually denoted by fx specifies the probability distribution of a continuous random variable X The higher fx is the more likely x is Also the total area between the graph of fx and the horizontal axis which represents the total probability is equal to 1 Finally fx is nonnegative for all possible values of X As an example consider the density function shown in Figure 51 This is not a nor mal density function It indicates that all values in the continuum from 25 to 100 are pos sible but that the values near 70 are most likely This density function might correspond to scores on an exam More speci cally because the height of the density at 70 is approx imately twice the height of the curve at 84 or 53 a value near 70 is approximately twice as likely as a value near 84 or a value near 53 In this sense the height of the density function indicates relative likelihoods 0 5 l 5quot 5rquotr r8gtqj r1P q r5 r5quotJ r vikv i ltoquotv vvs e lt c quot t2gtquotltb e3 ltogt cbquot lt lta Probabilities are found from a density function as areas under the curve For example the area of the designated region in Figure 52 represents the probability of a score between 65 and 75 Also the area under the entire curve is 1 because the total probability of all possible values is always 1 Unfortunately this is about as much as we can say without calculus Integral calculus is required to find areas under curves Fortunately statistical tables have been constructed to find such areas for a number of well known density func tions including the normal Even better Excel functions have been developed to find these 35 Area under curve is probability of being between 65 and 75 0 9395 5 gtlv39gtgt3gtn r1P r1 5 r5 5r5 r5 b J39b bgtlta va lt ee395e equot39lquotquot bquotltz lt8ltz o395lt lta 2 l 2 Chapter 5 Normal Binomial Poisson and Exponential Distributions Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it areas without the need for bulky tables We take advantage of these Excel functions in the rest of this chapter As in the previous chapter the mean is a measure of central tendency of the distribution and the standard deviation or variance measures the variability of the distribution Again however calculus is generally required to calculate these quantities We will simply list their values which were obtained through calculus for the normal distribution and any other continuous distributions where we need them By the way the mean for the non normal density in Figure 51 is slightly less than 70 it is always to the left of the peak for a left skewed distribution and to the right of the peak for a right skewed distribution and the standard deviation is approximately 15 522 The Normal Density The normal distribution is a continuous distribution with possible values ranging over the entire number line from minus in nity to plus in nity However only a relatively small range has much chance of occurring The normal density function is actually quite complex in spite of its nice bell shaped appearance For the sake of completeness we list the formula for the normal density function in Equation 51 Here pt and o are the mean and standard deviation of the distribution Normal Density Function ex 22quot2 for 00 lt x lt 00 51 M 2 21770 The curves in Figure 53 illustrate several normal density functions for different values of pt and o The mean pt can be any number negative positive or zero As you can see the effect of increasing or decreasing the mean p is to shift the curve to the right or the left On the other hand the standard deviation o must be a positive number It controls the spread of the normal curve When 0 is small the curve is more peaked when r is large the curve is more spread out For shorthand we use the notation NL o to refer to the normal dis tribution with mean pt and standard deviation 0 For example N 2 1 refers to the nor mal distribution with mean 2 and standard deviation 1 Normal Distributions Figure 53 Several Normal Density Functions 9v 2gtg 4ltgt ybsx 3 vgt39 zgt39 2 9s gvQlte N 39vltgtvltgtgt gt9gtvgzgtx ltzgtzag zgt 52 The Normal Distribution 2393 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it FUNDAMENTAL INSIGHT Why the Normal Distribution The normal density in Equation 5l is certainly not very intuitive so why is the normal distribution the basis for so much of statistical theory One reason is practical Many histograms based on real data resemble the bellshaped normal curve to a remarkable extent Granted not all histograms are symmetric and bellshaped but a surprising number are Another reason is theoretical In spite of the complexity of Equation 5 I the normal distribution has many appealing properties that have enabled researchers to build the rich statistical theory that nds widespread use in business the sciences and other elds EXAMPLE 523 Standardizing Z Values There are in nitely many normal distributions one for each pair L and 0 We single out one of these for special attention the standard normal distribution The standard normal distribution has mean 0 and standard deviation 1 so we denote it by N01 It is also referred to as the Z distribution Suppose the random variable X is normally distributed with mean L and standard deviation 0 We de ne the random variable Z by Equation 52 This operation is called standardizing That is to standardize a variable you subtract its mean and then divide the difference by the standard deviation When X is normally distrib uted the standardized variable is N0 1 Standardizing a Normal Random Variable X M 039 Z One reason for standardizing is to measure variables with different means andor standard deviations on a single scale For example suppose several sections of a college course are taught by different instructors Because of differences in teaching methods and grading procedures the distributions of scores in these sections might differ possibly by a wide margin However if each instructor calculates his or her mean and standard deviation and then calculates a Z value for each student the distributions of the Zvalues should be approximately the same in each section It is easy to interpret a Z value It is the number of standard deviations to the right or the left of the mean If Z is positive the original value in this case the original score is to the right of the mean if Z is negative the original score is to the left of the mean For example if the Z value for some student is 2 then this student s score is two standard deviations above the mean If the Z value for another student is 05 then this student s score is half a standard deviation below the mean We illustrate Zvalues in the following example he annual returns for 30 mutual funds appear in Figure 54 See the file 5l STANDARDIZING RETURNS FROM MUTUAL FUNDS Standardizingxlsx Find and interpret the Zvalues of these returns Objective To use Excel to standardize annual returns of various mutual funds Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it 2 4 Chapter 5 Normal Binomial Poisson and Exponential Distributions 52 Figure 54 Mutual Fund Returns and Z Values A B c D E F G 1 Standardizing mutual fund returns 2 l 3 Summary statistics from returns below Calculated two different WaySthe T 4 Mean 0091 0000 0000 second with the Standardize 4 5 Stdev 0047 1000 1000 6 4 i l 7 Fund Annual return Z value Z value Range names used 8 1 0007 18047 18047 Mean DataSBS4 9 2 0080 02363 02363 Stdev DataSBS5 10 3 0082 01934 01934 11 4 0123 06875 06875 12 5 0022 14824 14824 i 13 6 0054 07949 07949 32 25 0088 00645 00645 33 26 0077 03008 03008 34 27 0125 07305 07305 35 28 0094 00645 00645 36 29 0078 02793 02793 37 30 0066 05371 05371 Solution The 30 annual returns appear in column B of Figure 54 Their mean and standard devia tion are calculated in cells B4 and B5 with the AVERAGE and STDEV functions The corresponding Z Values are calculated in column C by entering the formula B8MeanStdev in cell C8 and copying it down column C There is an equivalent way to calculate these Zvalues in Excel This is done in col umn D by using EXcel s STANDARDIZE function directly To use this function enter the formula STANDARDIZEB8MeanStdev in cell D8 and copy it down column D The Zvalues in Figure 54 range from a low of 180 to a high of 219 Speci cally the return for stock 1 is about 180 standard deviations below the mean whereas the return for fund 17 is about 219 standard deviations above the mean As you will see shortly these values are typical Z Values are usually in the range from 2 to 2 and values beyond 3 or 3 are very uncommon Recall the empirical rules for interpreting standard deviation first discussed in Chapter 2 Also the Z Values automatically have mean 0 and standard deviation 1 as you can see in cells C5 and C6 by using the AVERAGE and STDEV func tions on the Z Values in column C or D I 52 The Normal Distribution 2 I 5 Copyright 2010 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 524 Normal Tables and ZValuesz A common use for Z values and the standard normal distribution is in calculating proba bilities and percentiles by the traditional method This method is based on a table of the standard normal distribution found in many statistics textbooks Such a table is given in Figure 55 The body of the table contains probabilities The left and top margins contain possible values Speci cally suppose you want to find the probability that a standard normal random variable is less than 135 You locate 13 along the left and 005 the sec ond decimal in 135 along the top and then read into the table to find the probability 09115 In words the probability is about 091 that a standard normal random variable is less than 135 Figure 55 Normal Probabilities zl 000 001 002 003 004 005 006 007 008 009 00 05000 05040 05080 05120 05160 05199 05239 05279 05319 05359 01 05398 05438 05478 05517 05557 05596 05636 05675 05714 05753 02 05793 05832 05871 05910 05948 05987 06026 06064 06103 06141 03 06179 06217 06255 06293 06331 06368 06406 06443 06480 06517 04 06554 06591 06628 06664 06700 06736 06772 06808 06844 06879 05 06915 06950 06985 07019 07054 07088 07123 07157 07190 07224 06 07257 07291 07324 07357 07389 07422 07454 07486 07517 07549 07 07580 07611 07642 07673 07704 07734 07764 07794 07823 07852 08 07881 07910 07939 07967 07995 08023 08051 08078 08106 08133 09 08159 08186 08212 08238 08264 08289 08315 08340 08365 08389 10 08413 08438 08461 08485 08508 08531 08554 08577 08599 08621 11 08643 08665 08686 08708 08729 08749 08770 08790 08810 08830 12 08849 08869 08888 08907 08925 08944 08962 08980 08997 09015 13 09032 09049 09066 09082 09099 09115 09131 09147 09162 09177 14 09192 09207 09222 09236 09251 09265 09279 09292 09306 09319 15 09332 09345 09357 09370 09382 09394 09406 09418 09429 09441 16 09452 09463 09474 09484 09495 09505 09515 09525 09535 09545 17 09554 09564 09573 09582 09591 09599 09608 09616 09625 09633 18 09641 09649 09656 09664 09671 09678 09686 09693 09699 09706 19 09713 09719 09726 09732 09738 09744 09750 09756 09761 09767 20 09772 09778 09783 09788 09793 09798 09803 09808 09812 09817 21 09821 09826 09830 09834 09838 09842 09846 09850 09854 09857 22 09861 09864 09868 09871 09875 09878 09881 09884 09887 09890 23 09893 09896 09898 09901 09904 09906 09909 09911 09913 09916 24 09918 09920 09922 09925 09927 09929 09931 09932 09934 09936 25 09938 09940 09941 09943 09945 09946 09948 09949 09951 09952 26 09953 09955 09956 09957 09959 09960 09961 09962 09963 09964 27 09965 09966 09967 09968 09969 09970 09971 09972 09973 09974 28 09974 09975 09976 09977 09977 09978 09979 09979 09980 09981 29 09981 09982 09982 09983 09984 09984 09985 09985 09986 09986 30 09987 09987 09987 09988 09988 09989 09989 09989 09990 09990 31 09990 09991 09991 09991 09992 09992 09992 09992 09993 09993 32 09993 09993 09994 09994 09994 09994 09994 09995 09995 09995 33 09995 09995 09995 09996 09996 09996 09996 09996 09996 09997 34 09997 09997 09997 09997 09997 09997 09997 09997 09997 09998 2If you intend to rely on Excel functions for normal calculations you can skip this subsection 2 I 6 Chapter 5 Normal Binomial Poisson and Exponential Distributions Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Alternatively if you are given a probability you can use the table to nd the value with this much probability to the left of it under the standard normal curve This is called a per centile calculation For example if the probability is 075 you can nd the 75th percentile by locating the probability in the table closest to 075 and then reading to the left and up With interpolation the required value is approximately 0675 In words the probability of being to the left of 0675 under the standard normal curve is approximately 075 You can perform the same kind of calculations for any normal distribution if you first standardize As an example suppose that X is normally distributed with mean 100 and standard deviation 10 We will find the probability that X is less than 115 and the 85th per centile of this normal distribution To find the probability that X is less than 115 first stan dardize the value 115 The corresponding Z value is Z 115 10010 15 Now look up 15 in the table 15 row 000 column to obtain the probability 09332 For the percentile question first find the 85th percentile of the standard normal distribution Interpolating a value of approximately 1037 is obtained Then set this value equal to a standardized value Z 1037 X 10010 Finally solve for X to obtain 11037 In words the probability of being to the left of 11037 in the N100 10 distribution is about 085 There are some obvious drawbacks to using the standard normal table for probability calculations The first is that there are holes in the table interpolation is often necessary A second drawback is that the standard normal table takes different forms in different text books These differences are rather minor but they can easily cause confusion Finally the table requires you to perform calculations For example you often need to standardize More importantly you often have to use the symmetry of the normal distribution to find probabilities that are not in the table As an example to find the probability that Z is less than 15 you must go through some mental gymnastics First by symmetry this is the same as the probability that Z is greater than 15 Then because only left tail less than probabilities are tabulated you must find the probability that Z is less than 15 and subtract this probability from 1 The chain of reasoning is PZ lt 15 PZ gt 15 1 PZ lt 15 1 09332 00668 This is not too difficult given a bit of practice but it is easy to make a mistake Excel func tions make the whole procedure much easier and less error prone 525 Normal Calculations in Excel Two types of calculations are typically made with normal distributions finding probabilities and finding percentiles Excel makes each of these fairly simple The functions used for nor mal probability calculations are NORMDIST and NORMSDIST The main difference between these is that the one with the S for standardized applies only to N0 1 calcula tions whereas NORMDIST applies to any normal distribution On the other hand percentile calculations that take a probability and return a value are often called inverse calculations Therefore the Excel functions for these are named NORMINV and NORMSINV Again the S in the second of these indicates that it applies to the standard normal distribution The NORMDIST and NORMSDIST functions return left tail probabilities such as the probability that a normally distributed variable is less than 35 The syntax for these functions is NORMDISTxpto1 52 The Normal Distribution 2 I 7 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it and NORMSDISTx Here x is a number you supply and L and o are the mean and standard deviation of the normal distribution The last argument in the NORMDIST function 1 is used to obtain the cumulative normal probability the type usually required This 1 is a nuisance to remem ber but it is necessary Note that NORMSDIST takes only one argument because pt and o are known to be 0 and 1 so it is easier to use when it applies The NORMINV and NORMSINV functions return values for user supplied probabil ities For example if you supply the probability 095 these functions return the 95th percentile Their syntax is NORMINVppo39 and NORMSINVp where p is a probability you supply These are analogous to the NORMDIST and NORMS DIST functions except there is no fourth argument in the NORMINV function CHANGES IN EXCEL 20l0 Many of the statistical functions have been revamped in Excel 20 I 0 as we will point out throughout the next few chapters Microsoft evidently wanted a more consistent naming convention that would make functions better match the ways they are used in statistical inference All of the old functions including the normal functions discussed here are still available for compatibility but Microsoft is hoping that users will switch to the new functionsThe new normal functions are NORMDST NORMSDlST NORMlNV and NORMSlNV These work exactly like the old normal functions except that NORMSDST takes the same ast cumuative argument as was explained above for NORMDIST The new and old functions are both shown in the le for the next example FUNDAMENTAL INSIGHT Pobabi ty and Percentile cacuation5 provide a probability and you ask for the value that has this probability to the left of it Exce s statistical functions especially with the new names in Excel 200 use DIST in functions that perform probability calculations and INV for inverse in functions that perform percentile calculations There are two basic types of calculations involving probability distributions normal or otherwise In a probability calculation you provide a possible value and you ask for the probability of being less than or equal to this value In a percentile calculation you We illustrate these Excel functions in the following example3 EXAMPLE 52 BECOMING FAMILIAR WITH NORMAL CALCULATIONS IN EXCEL se Excel to calculate the following probabilities and percentiles for the standard normal distribution a PZ lt 2 b PZ gt 1 c P 04 lt Z lt 16 d the 5th percentile e the 75th percentile and f the 99th percentile Then for the N75 8 distribution nd the following probabilities and percentiles a PX lt 70 b PX gt 73 c P75 lt X lt 85 d the 5th percentile e the 60th percentile and f the 97th percentile 3Actually we already illustrated the NORMSDIST function it was used to create the body of Figure 55 In other words you can use it to build your own normal probability table 2 I 8 Chapter 5 Normal Binomial Poisson and Exponential Distributions Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Objective To calculate probabilities and percentiles for standard normal and general normal distributions in Excel Solution The solution appears in Figure 56 See the le Normal Calculationsxlsx The N0 1 calculations are in rows 7 through 14 the N75 8 calculations are in rows 23 through 30 For your convenience the formulas used in column B are spelled out in column D as labels Note that the standard normal calculations use the normal functions with the S in the middle the rest use the normal functions without the S and require more arguments The Excel 2010 functions don39t appear in this gure but they are included in the le Figure 56 Normal Calculations with Excel Functions A B c D E F G H I 1 Normal probability calculations L N 3 Examples with standard normal 4 5 Probability calculations 6 Range Probability Formula L Less than 2 00228 NORMSDST 2 8 Greater than 1 01587 1 NORMSDST1 9 Between 04and16 06006 NORMSDST16NORMSDST04 10 11 Percentiles 12 5th 1645 NORMSNV005 13 75th 0674 NORMSNV075 7 99th 2326 NORMSNV099 15 16 Examples with nonstandard normal 17 Range names used 18 Mean 75 Mean NormaSB18 19 Stdev 8 Stdev NormaSBS19 20 21 Probability calculations 22 Range Probability Formula 23 Less than 70 02660 NORMDST70MeanStde41 24 Greater than 73 05987 1NORMDST73MeanStdev1 75 Between 75 and 85 03944 NORMDST85MeanStde41NORMDST75MeanStdev1 26 27 Percentiles 28 5th 61841 NORMNV005MeanStdev 29 60th 77027 NORMNV06MeanStdev 30 97th 90046 NORMNV097MeanStdev Note the following for normal probability calculations I For less than probabilities use NORMDIST or NORMSDIST directly See rows 7 and 23 52 The Normal Distribution 2l9 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it I For greater than probabilities subtract the NORMDIST or NORMSDIST function from 1 See rows 8 and 24 I For between probabilities subtract the two NORMDIST or NORMSDIST func tions For example in row 9 the probability of being between 04 and 16 is the probability of being less than 16 minus the probability of being less than 04 The percentile calculations are even more straightforward In most percentile problems you want to find the value with a certain probability to the left of it In this case you use the NORMINV or NORMSINV function with the specified probability as the first argument See rows 12 through 14 and 28 through 30 I There are a couple of variations of percentile calculations First suppose you want the value with probability 005 to the right of it This is the same as the value with probability 095 to the left of it so you use NORMINV or NORMSINV with probability argument 095 For example the value with probability 04 to the right of it in the N75 8 distribu tion is 77027 See cell B29 in Figure 56 As a second variation suppose you want to nd an interval of the form x to x for some positive number x with 1 probability 0025 to the left of x 2 probability 0025 to the right of x and 3 probability 095 between x and x This is a very common problem in statistical inference In general you want a probability such as 095 to be in the middle of the interval so that half of the remaining probability 0025 is in each of the tails See Figure 57 Then the required x can be found with NORMINV or NORMSINV using probability argument 0975 because there must be a total probability of 0975 to the left of x For example if the relevant distribution is the standard normal the required value of x is 196 found with the function NORMSINV0975 Similarly if you want probability 090 in the middle and probability 005 in each tail the required x is 1645 found with the function NORMSINV095 Remember these two numbers 196 and 1645 They occur frequently in statistical applications Standard Normal Distribution Figure 57 Typical Normal Probabilities probability 95 probability 025 probability 025 X196 X196 526 Empirical Rules Revisited We introduced three empirical rules in Chapter 2 that apply to many data sets Namely about 68 of the data fall within one standard deviation of the mean about 95 fall within two standard deviations of the mean and almost all fall within three standard deviations of the mean For these rules to hold with real data the distribution of the data must be at least approximately symmetric and bell shaped Let s look at these rules more closely 220 Chapter 5 Normal Binomial Poisson and Exponential Distributions Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Let X be normally distributed with mean L and standard deviation 0 To perform a probability calculation on X we can first standardize X and then perform the calculation on the standardized variable Z Speci cally we will nd the probability that X is within k standard deviations of its mean for k 1 k 2 and k 3 In general this probability is PL ko lt X lt pt ko But by standardizing the values p ko and L ko we obtain the equivalent probability P k lt Z lt k where Z has a N0 1 distribution This latter probability can be calculated in Excel with the formula NORMSDISTk NORMSDIST k The normal distribution is the basis for the empirical rules intro p1 lt Z lt 1 05827 duced in Chapter 2 By substituting the values 1 2 and 3 for k we nd the following probabilities P2 lt Z lt 2 09545 P 3 lt Z lt 3 09973 As you can see there is virtually no chance of being beyond three standard deviations from the mean the chances are about 19 out of 20 of being within two standard deviations of the mean and the chances are about 2 out of 3 of being within one standard deviation of the mean These probabilities are the basis for the empirical rules in Chapter 2 These rules more closely approximate reality as the histograms of observed data become more bellshaped 53 APPLICATIONS OF THE NORMAL DISTRIBUTION In this section we apply the normal distribution to a variety of business problems 53 PERSONNEL TESTING AT ZTEL he personnel department of ZTel a large communications company is reconsidering its hiring policy Each applicant for a job must take a standard exam and the hire or nohire decision depends at least in part on the result of the exam The scores of all appli cants have been examined closely They are approximately normally distributed with mean 525 and standard deviation 55 The current hiring policy occurs in two phases The rst phase separates all applicants into three categories automatic accepts automatic rejects and maybes The automatic accepts are those whose test scores are 600 or above The automatic rejects are those whose test scores are 425 or below All other applicants the maybes are passed on to a second phase where their previous job experience special talents and other factors are used as hiring criteria The personnel manager at ZTel wants to calculate the percentage of applicants who are automatic accepts or rejects given the current standards She also wants to know how to change the standards to automatically reject 10 of all applicants and automatically accept 15 of all applicants Objective To determine test scores that can be used to accept or reject job applicants at ZTel Solution Let X be the test score of a typical applicant Then historical data suggest that the distribution of X is N525 55 A probability such as PX S 425 can be interpreted as the probability that 53 Applications of the Normal Distribution 2239 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it a typical applicant is an automatic reject or it can be interpreted as the percentage of all applicants who are automatic rejects Given this observation the solution to ZTel s problem appears in Figure 58 See the le Personnel DecisionsXlsX The probability that a typical applicant is automatically accepted is 00863 found in cell B10 with the formula 1 NORMDISTB7MeanStdev1 Figure 58 A l B C D l E F 1 Personnel Decisions Calculations for Personnel Example L Range names used 3 Mean of test scores 525 Mean ModeSBS3 4 Stdev of test scores 55 Stdev ModeSBS4 5 6 Current Policy L Automatic accept point 600 8 Automatic reject point 425 9 10 Percent accepted 863 1 NORMDSTB7MeanStdev1 E Percent rejected I 345 NORMDSTB8MeanStdev1 12 13 New Policy 14 Percent accepted 15 15 Percent rejected 10 16 17 Automatic accept point 582 NORMNV1 B14MeanStdev 18 Automatic reject point 455 NORMNVB15MeanStdev Similarly the probability that a typical applicant is automatically rejected is 00345 found in cell B 11 with the formula NORMDISTB8MeanStdev1 Therefore ZTel automatically accepts about 86 and rejects about 35 of all applicants under the current policy To nd new cutoff values that reject 10 and accept 15 of the applicants we need the 10th and 85th percentiles of the N525 55 distribution These are 455 and 582 rounded to the nearest integer respectively found in cells B 17 and B 18 with the formulas NORMINV1B14MeanStdev and NORMINVB15MeanStdeV To accomplish its objective ZTel needs to raise the automatic rejection point from 425 to 455 and lower the automatic acceptance point from 600 to 582 I EXAM PLE 54 QUALITY CONTROL AT PAPERSTOCK COMPANY he PaperStock Company runs a manufacturing facility that produces a paper product The fiber content of this product is supposed to be 20 pounds per 1000 square feet This is typical for the type of paper used in grocery bags for example Because of random variations in the inputs to the process however the fiber content of a typical 1000squarefoot roll varies according to a NL 0 distribution The mean fiber content p 222 Chapter 5 Normal Binomial Poisson and Exponential Distributions Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it can be controlled that is it can be set to any desired level by adjusting an instrument on the machine The variability in ber content as measured by the standard deviation 0 is 010 pound when the process is good but it sometimes increases to 015 pound when the machine goes bad A given roll of this product must be rejected if its actual fiber content is less than 198 pounds or greater than 203 pounds Calculate the probability that a given roll is rejected for a setting of pt 20 when the machine is good and when it is bad Objective To determine the machine settings that result in paper of acceptable quality at PaperStock Company Solution Let X be the fiber content of a typical roll The distribution of X will be either N20 010 or N20 015 depending on the status of the machine In either case the probability that the roll must be rejected can be calculated as shown in Figure 59 See the file Paper Machine SettingsXlsX The formula for rejection in the good case appears in cell B 12 NORMDISTB8MeanStdevgood1 1NORMDISTB9MeanStdevgood1 Figure 59 Calculations for Paper Quality Example A B c D E F G H I J 1 Paper Machine Settings Range names used 2 Mean ModeSBS3 3 Mean 20 Stdevgood ModeSBS4 i Stdev in good case 01 Stdevbad ModeSBS5 5 Stdev in bad case 015 6 1 7 Reject region 1 L Lower limit 198 9 Upper limit 203 10 11 Probability of reject E in good case 0024 NORMDSTB8MeanStdevgood11 NORMDSTB9MeanStdevgood1 13 in bad case 0114 NORMDSTB8MeanStdevbad11 NORMDSTB9MeanStdevbad1 14 1 1 15 Data table of rejection probability as a function of the mean and good standard deviation 16 Standard deviation 17 0024 010 011 012 013 014 015 18 197 0841 0818 0798 0779 0762 0748 19 198 0500 0500 0500 0500 0500 0500 20 199 0159 0182 0203 0222 0240 0256 21 Mean 200 0024 0038 0054 0072 0093 0114 22 201 0024 0038 0054 0072 0093 0114 23 202 0159 0182 0203 0222 0240 0256 24 203 0500 0500 0500 0500 0500 0500 25 204 0841 0818 0798 0779 0762 0748 This is the sum of two probabilities the probability of being to the left of the lower limit and the probability of being to the right of the upper limit These probabilities of rejection are represented graphically in Figure 510 A similar formula for the bad case appears in cell B13 using Stdevbad in place of Stdevgood 223 53 Applications of the Normal Distribution Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Figure 5 I 0 Rejection Regions for Fiber Content of Paper Rejection Regions for Paper Quality Example Acceptance Region Lower rejection region Upper rejection region 198 pounds 203 pounds T0 f0Fm this data You can see that the probability of a rejected roll in the good case is 0024 in the table Ie tegh2e H bad case it is 0114 That is when the standard deviation increases by 50 from 010 to form m Ce 015 the percentage of rolls rejected more than quadruples from 24 to 114 BI 7 highlight the range 5 7H25 and It IS certainly possible that the true process mean and good standard deviation will create a data table not always be equal to the values in cells B3 and B4 Therefore it is useful to see how sen with row input cell sitive the rejection probability is to these two parameters You can do this with a two way 54 md Comm 39 P t data table as shown in Figure 59 The tabulated values show that the probability of rejec cell B3 tion varies greatly even for small changes in the key inputs In particular a combination of a badly centered mean and a large standard deviation can make the probability of rejection quite large I EXAM PLE 55 ANALYZING AN lNVESTOR S AFTERTAX PROFIT oward Davis invests 10000 in a certain stock on January 1 By examining past move ments of this stock and consulting with his broker Howard estimates that the annual return from this stock X is normally distributed with mean 10 and standard deviation 4 Here X when expressed as a decimal is the profit Howard receives per dollar invested It means that on December 31 his 10000 will have grown to 100001 X dollars Because Howard is in the 33 tax bracket he will then have to pay the Internal Revenue Service 33 of his profit Calculate the probability that Howard will have to pay the IRS at least 400 Also calculate the dollar amount such that Howard s after tax profit is 90 certain to be less than this amount that is calculate the 90th percentile of his aftertax profit Objective To determine the after tax profit Howard Davis can be 90 certain of earning Solution Howard s before tax profit is 10000X dollars so the amount he pays the IRS is 03310000X or 3300X dollars We want the probability that this is at least 400 Because 3300X gt 400 is the same as X gt 433 the probability of this outcome can be 224 Chapter 5 Normal Binomial Poisson and Exponential Distributions Copyright 2010 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it found as in Figure 511 See the le Tax on Stock Returnxlsx It is calculated with the formula 1NORMDIST400AmountinvestedTaxrateMeanStdev1 in cell B8 As you can see Howard has about a 30 chance of paying at least 400 in taxes To answer the second question note that the after tax profit is 67 of the before tax pro t or 6700X dollars and we want its 90th percentile If this percentile is x then we know that P6700X lt x 090 which is the same as PX lt x6700 090 In words we want the 90th percentile of the X distribution to be x6700 From cell B 10 of Figure 511 the 90th percentile is 1513 so the required value of x is 1013 Note that the mean after tax profit is 670 67 of the mean before tax profit of 010 multiplied by 10000 Of course Howard might get lucky and make more than this but he is 90 certain that his after tax profit will be no greater than 1013 Figure 5 I I Calculations for Taxable Returns Example A B C D E F G H I 1 Tax on Stock Return 2 Range names used 3 Amount invested 10000 Amountinvested ModeB3 4 Mean 10 Mean ModeB4 5 Stdev 4 Stdev ModeB5 6 Tax rate 33 Taxrate ModeB6 7 8 Probability he pays at least 400 in taxes 0298 1 NORM DST400AmountinvestedTaxrateMeanStdev1 9 10 90th percentile of stock return 1513 NORMNV09MeanStdev 11 90th percentile of after tax return 1013 1 TaxrateAmountinvestedB10 It is sometimes tempting to model every continuous random variable with a normal distribution This can be dangerous for at least two reasons First not all random variables have a symmetric distribution Some are skewed to the left or the right and for these the normal distribution can be a poor approximation to reality The second problem is that many random variables in real applications must be nonnegative and the normal distribu tion allows the possibility of negative values The following example shows how assuming normality can get you into trouble if you aren t careful 56 PREDICTING FUTURE DEMAND FOR MICROWAVE OVENS AT HIGHLAND COMPANY he Highland Company is a retailer that sells microwave ovens The company wants to model its demand for microwaves over the next 12 years Using historical data as a guide it assumes that demand in year 1 is normally distributed with mean 5000 and standard deviation 1500 It assumes that demand in each subsequent year is normally dis tributed with mean equal to the actual demand from the previous year and standard devia tion 1500 For example if demand in year 1 turns out to be 4500 then the mean demand in year 2 is 4500 This assumption is plausible because it leads to correlated demands For example if demand is high one year it will tend to be high the next year Investigate the ramifications of this model and suggest models that might be more realistic 53 Applications of the Normal Distribution 225 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Objective To construct and analyze a spreadsheet model for microwave oven demand over the next 12 years using Excel s NORMINV function and to show how models using the nor mal distribution can lead to nonsensical outcomes unless they are modi ed appropriately Solution To generate a random The best way to analyze this model is with simulation much as in Chapter 4 To do this f l3e from 0 quotquotl you must be able to simulate normally distributed random numbers in Excel You can do P three this with the NORMINV function Specifically to generate a normally distributed number arguments RAND0 with mean pt and standard deviation 0 use the formula the mean and the N0RMINVRANDlLo standard deviation Because this formula uses the RAND function it generates a d erent random number each time it is used and each time the spreadsheet recalculates4 The spreadsheet in Figure 512 shows a simulation of yearly demands over a 12year period See the file Oven Demand Simulationxlsx To simulate the demands in row 15 enter the formula NORMINVRANDB6B7 in cell B 15 Then enter the formula NORMINVRANDB15B11 Figure 5 I 2 One Set of Demands for Model 1 in the Microwave Example A B c D E F G H I J K L M 1 Normal model for multiperiod demand 2 1 1 1 3 Assumptions of a tentative model 4 1 1 5 Demand in year 1 normally distributed 6 Mean 5000 7 Stdev 1500 8 1 1 9 Demand in other years normally distributed 10 Mean lactual demand in previous year 11 Stdev 1500 2 l 13 Simulated demands 14 Year 1 2 3 4 5 6 7 8 9 10 11 12 15 Demand 5266 7657 7420 8094 9099 11674 7245 7191 8420 8638 9702 7275 16 1 Time series of demand 19 14000 20 21 12000 E 23 10000 24 8000 25 26 6000 27 4000 28 29 2000 E 0 l l l l l l l l l l l 31 1 2 3 4 5 6 Year 7 8 9 10 11 12 32 4To see why this formula makes sense note that the RAND function in the first argument generates a uniformly distributed random value between 0 and 1 Therefore the effect of the function is to generate a random percentile from the normal distribution 226 Chapter 5 Normal Binomial Poisson and Exponential Distributions Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it in cell C15 and copy it across row 15 Note how the mean demand in any year is the sim ulated demand from the previous year As the accompanying time series graph of these demands indicates the model seems to be performing well However the simulated demands in Figure 512 are only one set of possible demands Remember that each time the spreadsheet recalculates all of the random numbers change5 Figure 513 shows a different set of random numbers generated by the same formulas Clearly the model is not working well in this case some demands are negative which makes no sense The problem is that if the actual demand is low in one year there is a fairly good chance that the next normally distributed demand will be negative You can check by recalculating many times that the demand sequence is usually all positive but every now and then a nonsense sequence as in Figure 513 appears We need a new model One way to modify the model is to let the standard deviation and mean move together That is if the mean is low then the standard deviation will also be low This minimizes the chance that the next random demand will be negative Besides this type of model is prob ably more realistic If demand in one year is low there could be less variability in next year s demand Figure 514 illustrates one way but not the only way to model this chang ing standard deviation Figure 5 I 3 Another Set of Demands for Model 1 in the Microwave Example A B C D E F G H I J K L M 1 Normal model for multiperiod demand 2 l 3 Assumptions of a tentative model 4 1 1 1 5 Demand in year 1 normally distributed 6 Mean 5000 7 Stdev 1500 8 l l L Demand in other years normally distributed 10 Mean lactual demand in previous year 11 Stdev 1500 12 l 13 Simulated demands 14 Year 1 2 3 4 5 6 7 8 9 10 11 12 15 Demand 5528 3874 3268 2416 2348 4181 3697 3337 1064 100 116 988 E 1 Time series of demand 19 6000 20 4 21 5000 22 4000 23 E 3000 25 2000 25 1000 27 A I 28 O 11 1392 29 1000 30 2000 31 Year 32 5The usual way to get Excel to recalculate is to press the F9 key However this makes all of the data tables in the workbook recalculate which can take significant time Because there is a data table in another sheet of the Oven Demand Simulationxlsx le we suggest a different way to recalculate Simply position the cursor on any blank cell and press the Delete key 227 53 Applications of the Normal Distribution Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Figure 5 I 4 Generated Demands for Model 2 in Microwave Example A B C D E F G H J K L M 1 Normal model for multiperiod demand 2 N N 3 Assumptions of a quotsaferquot model 4 N N 5 Demand in year 1 normally distributed 6 Mean 5000 7 Stdev 1500 8 N N L Demand in other years normally distributed 10 Mean actual demand in previous year N 11 Stdev15OO times ratio of previous year39s actual demand to year 139s mean demand 12 13 Simulated demands 14 Year 1 2 3 4 5 6 7 8 9 10 11 12 15 Demand 6521 6255 8239 6856 9638 7045 7122 4877 7212 10681 5211 4211 E 17 Time series of demands 18 19 12000 20 21 10000 E i 8000 E 6000 24 4000 25 26 2000 27 0 1 N N l N N l l N N l l 28 1 2 3 4 5 6 7 8 9 10 11 12 Year 29 30 We let the standard deviation of demand in any year after year 1 be the original stan dard deviation 1500 multiplied by the ratio of the expected demand for this year to the expected demand in year 1 For example if demand in some year is 500 the expected demand next year is 500 and the standard deviation of next year s demand is reduced to 15005005000 150 The only change to the spreadsheet model is in row 15 where cell C15 contains the formula NORMINVRANDB15B7B15B6 and is copied across row 15 Now the chance of a negative demand is practically negligible because this would require a value more than three standard deviations below the mean Unfortunately the model in Figure 514 is still not foolproof By recalculating many times negative demands still appear occasionally To be even safer it is possible to trun cate the demand distribution at some nonnegative value such as 250 as shown in Figure 515 Now a random demand is generated as in the previous model but if this randomly gener ated value is below 250 it is replaced by 250 This is done with the formulas MAXNORMINVRANDB8B9D5 and MAXNORMINVRANDB17B9B17B8D5 in cells B17 and C17 and copying this latter formula across row 17 Whether this is the way the demand process works for Highland s microwaves is an open question but at 228 Chapter 5 Normal Binomial Poisson and Exponential Distributions Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Figure 5 I 5 Generated Demands for a Truncated Model in Microwave Example A B C D G H I Normal model for multiperiod demand Assumptions of an even quotsaferquot model Minimum demand in any year 250 Demand in year 1 truncated normal Mean 3 gtcn N cacn A onto A Stdev 5000 1500 11 Demand in other years truncated normal 12 Mean actual demand in previous year 13 Stdev 1500 times ratio of previous year39s actu al demand to year 139s mean demand 14 15 Simulated demands 16 Year 4 5 6 10 11 12 17 Demand 4087 1956 2274 2846 1947 1458 1969 1887 1572 2695 2483 1088 18 19 20 21 22 23 24 25 26 27 28 29 30 32 4500 4000 3500 3000 2500 2000 1500 1000 500 Time series of demands Year least it prevents demands from becoming negative or even falling below 250 Moreover this type of truncation is a common way of modeling when you want to use a normal distribution but for physical reasons cannot allow the random quantities to become negative Before leaving this example we challenge your intuition In the nal model in Figure 515 the demand in any year say year 6 is aside from the truncation normally distributed with a mean and standard deviation that depend on the previous year s demand Does this mean that if you recalculate many times and keep track of the year 6 demand each time the resulting histogram of these year 6 demands will be normally distributed Perhaps surprisingly the answer is a clear no Evidence of this appears in Figures 516 and 517 In Figure 516 we use a data table to obtain 400 replications of demand in year 6 in column B Then we use StatTools s histogram procedure to create a histogram of these simulated demands in Figure 517 It is clearly skewed to the right and nonnormal What causes this distribution to be nonnormal It is not the truncation Truncation has a relatively minor effect because most of the demands don t need to be truncated The real reason is that the distribution of year 6 demand is only normal conditional on the demand in year 5 That is if we fix the demand in year 5 at any level and then replicate year 6 demand many times the resulting histogram is normally shaped But the year 5 demand is not xed It varies from replication to replication and this variation causes the skewness in Figure 517 Admittedly the reason for this skewness is not intuitively obvious but simu lation makes it easy to demonstrate 229 53 Applications of the Normal Distribution Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Figure 56 B C D E R 1 f 36 Replication Demand Del 1 a1 6 37 4476 Average 4916 em 1 ear 38 1 1635 Stdev 3956 39 2 8229 40 3 3582 41 4 11282 42 5 2845 43 6 3942 44 7 5700 45 8 12273 433 396 8919 434 397 4587 435 398 10003 436 399 5012 437 400 3944 Figure 5 I 7 80 Histogram of Demand Histogram of Year 6 70 Demands 60 5 50 C 3 40 5 LL 30 20 10 0 OOOOOOOOOOOOOOOOOOOO QQQQQQQQQQQQQQQQQQQO OOOOOOOOOOOOOOOOOOOg OOOOOOOOOOOOOOOOOOO LO LO LO LO LO LO LO LO LO LO LO LO LO LO LO LO LO LO LO A Fl N Y ltl LO LO l 00 03 O Fl 39I Y ltl LO LO l 00 Fl Fl Fl Fl Fl Fl Fl Fl Fl I P R0 B L E M S Note Student solutions for problems whose numbers appear within a For example the lowest acceptable score for an A is colored box are available for purchase at www cengagebraincom the 39013 at the 90th pgrcgntilg of this hormal distribution LevelA 2 Suppose it IS known that the distribution of purchase The grades on the midterm examination given in a large managerial statistics class are normally distributed with mean 75 and standard deviation 9 The instructor of this class wants to assign an A grade to the top 10 of the scores a B grade to the next 10 of the scores a C grade to the next 10 of the scores a D grade to the next 10 of the scores and an F grade to all scores below the 60th percentile of this distribution For each possible letter grade nd the lowest acceptable score within the established range amounts by customers entering a popular retail store is approximately normal with mean 25 and standard deviation 8 a What is the probability that a randomly selected customer spends less than 35 at this store b What is the probability that a randomly selected customer spends between 15 and 35 at this store c What is the probability that a randomly selected customer spends more than 10 at this store 230 Chapter 5 Normal Binomial Poisson and Exponential Distributions Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it d Find the dollar amount such that 75 of all customers spend no more than this amount e Find the dollar amount such that 80 of all customers spend at least this amount f Find two dollar amounts equidistant from the mean such that 90 of all customer purchases are between these values A machine used to regulate the amount of a certain chemical dispensed in the production of a particular type of cough syrup can be set so that it discharges an average of p milliliters ml of the chemical in each bottle of cough syrup The amount of chemical placed into each bottle of cough syrup is known to have a normal distribution with a standard deviation of 0250 ml If this machine discharges more than 2 ml of the chemical when preparing a given bottle of this cough syrup the bottle is considered to be unacceptable by industry standards Determine the setting for p so that no more than 1 of the bottles of cough syrup prepared by this machine will be rejected Assume that the weekly demand for Ford car sales follows a normal distribution with mean 50000 cars and standard deviation 14000 cars 21 There is a 1 chance that Ford will sell more than what number of cars during the next year b What is the probability that Ford will sell between 24 and 27 million cars during the next year An investor has invested in nine different investments The dollar returns on the different investments are probabilistically independent and each return follows a normal distribution with mean 50000 and standard deviation 10000 21 There is a 1 chance that the total return on the nine investments is less than what value Use the fact that the sum of independent normal random variables is normally distributed with mean equal to the sum of the individual means and variance equal to the sum of the individual variances b What is the probability that the investor s total retum is between 400000 and 520000 Scores on an exam appear to follow a normal distribution with p 60 and o 20 The instructor wishes to give a grade of D to students scoring between the 10th and 30th percentiles on the exam For what range of scores should a D be given What percentage of the students will get a D Suppose that the weight of a typical American male follows a normal distribution with p 180 lb and o 30 lb Also suppose 9192 of all American males weigh more than I weigh a What fraction of American males weigh more than 225 pounds b How much do I weigh c If I weighed 20 pounds more than I do what percentile would I be in 10 Assume that the length of a typical televised baseball game including all the commercial timeouts is normally distributed with mean 245 hours and standard deviation 037 hour Consider a televised baseball game that begins at 200 in the aftemoon The next regularly scheduled broadcast is at 500 a What is the probability that the game will cut into the next show that is go past 5 00 b If the game is over before 430 another half hour show can be inserted into the 430500 slot What is the probability of this occurring The amount of a soft drink that goes into a typical 12 ounce can varies from can to can It is normally distributed with an adjustable mean pt and a fixed standard deviation of 005 ounce The adjustment is made to the filling machine a If regulations require that cans have at least 119 ounces what is the smallest mean pt that can be used so that at least 995 of all cans meet the regulation b If the mean setting from part a is used what is the probability that a typical can has at least 12 ounces Suppose that the demands for a company s product in weeks 1 2 and 3 are each normally distributed The means are 50 45 and 65 The standard deviations are 10 5 and 15 Assume that these three demands are probabilistically independent This means that if you observe one of them it doesn t help you to predict the others Then it turns out that total demand for the three weeks is also normally distributed Its mean is the sum of the individual means and its variance is the sum of the individual variances Its standard deviation however is not the sum of the individual standard deviations square roots don t work that way 21 Suppose that the company currently has 180 units in stock and it will not be receiving any more shipments from its supplier for at least three weeks What is the probability that stock will run out during this three week period b How many units should the company currently have in stock so that it can be 98 certain of not running out during this three week period Again assume that it won t receive any more shipments during this period Level B Matthew s Bakery prepares peanut butter cookies for sale every morning It costs the bakery 050 to bake each peanut butter cookie and each cookie is sold for 125 At the end of the day leftover cookies are discounted and sold the following day at 040 per cookie The daily demand in dozens for peanut butter cookies at this bakery is known to be normally distributed with mean 200 and standard deviation 60 23 53 Applications of the Normal Distribution Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it The manager of Matthew s Bakery is trying to determine how many dozen peanut butter cookies to make each moming to maximize the product s contribution to bakery pro ts Use simulation to find a very good if not optimal production plan 12 The manufacturer of a particular bicycle model has the following costs associated with the management of this product s inventory In particular the company currently maintains an inventory of 1000 units of this bicycle model at the beginning of each year If X units are demanded each year and X is less than 1000 the excess supply 1000 X units must be stored until next year at a cost of 50 per unit If X is greater than 1000 units the excess demand X 1000 units must be produced separately at an extra cost of 80 per unit Assume that the annual demand X for this bicycle model is normally distributed with mean 1000 and standard deviation 75 21 Find the expected annual cost associated with managing potential shortages or surpluses of this product Hint Use simulation to approximate the answer An exact solution using probability arguments is beyond the level of this book b Find two annual total cost levels equidistant from the expected value found in part a such that 95 of all costs associated with managing potential shortages or surpluses of this product are between these values Continue to use simulation c Comment on this manufacturer s annual production policy for this bicycle model in light of your ndings in part b 13 Suppose that a particular production process fills detergent in boxes of a given size Specifically this process fills the boxes with an amount of detergent in ounces that is adequately described by a normal distribution with mean 50 and standard deviation 05 a Simulate this production process for the filling of 500 boxes of detergent Find the mean and standard deviation of your simulated sample weights How do your sample statistics compare to the theoretical population parameters in this case How well do the empirical rules apply in describing the variation in the weights in your simulated detergent boxes b A box of detergent is rejected by quality control personnel if it is found to contain less than 49 ounces or more than 51 ounces of detergent Given these quality standards what proportion of all boxes are rejected What steps could the supervisor of this production process take to reduce this proportion to 1 14 It is widely known that many drivers on interstate highways in the United States do not observe the posted speed limit Assume that the actual rates of speed driven by US motorists are normally distributed with mean it mph and standard deviation 15 16 18 5 mph Given this information answer each of the following independent questions Hint Use Goal Seek in parts a and b and use the Solver add in with no objective in part c Solver is usually used to optimize but it can also be used to solve equations with multiple unknowns a If 40 of all US drivers are observed traveling at 65 mph or more what is the mean it b If 25 of all US drivers are observed traveling at 50 mph or less what is the mean pt c Suppose now that the mean pt and standard deviation 039 of this distribution are both unknown Furthermore it is observed that 40 of all US drivers travel at less than 55 mph and 10 of all US drivers travel at more than 70 mph What must it and 039 be The lifetime of a certain manufacturer s washing machine is normally distributed with mean 4 years Only 15 of all these washing machines last at least 5 years What is the standard deviation of the lifetime of a washing machine made by this manufacturer You have been told that the distribution of regular unleaded gasoline prices over all gas stations in Indiana is normally distributed with mean 295 and standard deviation 0075 and you have been asked to find two dollar values such that 95 of all gas stations charge somewhere between these two values Why is each of the following an acceptable answer between 2776 and 3081 or between 2803 and 3097 Can you nd any other acceptable answers Which of the many possible answers would you give if you are asked to obtain the shortest interval A fast food restaurant sells hamburgers and chicken sandwiches On a typical weekday the demand for hamburgers is normally distributed with mean 313 and standard deviation 57 the demand for chicken sandwiches is normally distributed with mean 93 and standard deviation 22 a How many hamburgers must the restaurant stock to be 98 sure of not running out on a given day b Answer part a for chicken sandwiches c If the restaurant stocks 400 hamburgers and 150 chicken sandwiches for a given day what is the probability that it will run out of hamburgers or chicken sandwiches or both that day Assume that the demand for hamburgers and the demand for chicken sandwiches are probabilistically independent d Why is the independence assumption in part c probably not realistic Using a more realistic assumption do you think the probability requested in part c would increase or decrease Referring to the box plots introduced in Chapter 2 the sides of the box are at the first and third quartiles and the difference between these the length of the 232 Chapter 5 Normal Binomial Poisson and Exponential Distributions Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it box is called the interquartile range IQR A mild answers depend on the mean andor standard outlier is an observation that is between 15 and deviation of the distribution 3 IQRs from the box and an extreme outlier is b Check your answers in part a with simulation an observation that is more than 3 IQRs from Simulate a large number of normal random numbers the box you can choose any mean and standard deviation a If the data are normally distributed what and count the number of mild and extreme outliers percentage of values will be mild outliers What with appropriate IF functions Do these match at percentage will be extreme outliers Why don t the least approximately your answers to part a 54 THE BINOMIAL DISTRIBUTION The normal distribution is undoubtedly the most important probability distribution in sta tistics Not far behind however is the binomial distribution The binomial distribution is a discrete distribution that can occur in two situations 1 when sampling from a popula tion with only two types of members males and females for example and 2 when performing a sequence of identical experiments each of which has only two possible O11tCO1TlCS FUNDAMENTAL INSIGHT Imagine any experiment that can be repeated Why the Binomial Distribution many times under identical conditions It is common to refer to each repetition of the experiment as a trial We assume that the outcomes of successive tri Unilte the normal distributionwhich can describe all sorts of random phenomena the binomial distribu tion is relevant for a very common and specific situa tion the number of successes in a xed number of trials where the trials are probabilistically indepen als are probabilistically independent of one another and that each trial has only two possible outcomes We label these two possibilities generically as suc cess and failure In any particular application the outcomes might be DemocratRepublican defec dent and the probability of success remains constant across trials Whenever this situation occurs the binomial distribution is the relevant distribution tivenondefective went bankruptremained solvent and so on We label the probability of a success on each trial as p and the probability of a failure as 1 19 We let n be the number of trials Binomial Distribution Consider a situation where there are n independent identical trials where the probability of a success on each trial is p and the probability of a failure is 1 p De ne X to be the random number of successes in the n trials Then X has a binomial distribution with parameters n and 19 For example the binomial distribution with parameters 100 and 03 is the distribution of the number of successes in 100 trials when the probability of success is 03 on each trial A simple example that you can keep in mind throughout this section is the number of heads you would see if you ipped a coin n times Assuming the coin is well balanced the rele vant distribution is binomial with parameters n and p 05 This coin ipping example is often used to illustrate the binomial distribution because of its simplicity but you will see that the binomial distribution also applies to many important business situations To understand how the binomial distribution works consider the coin ipping example with n 3 If X represents the number of heads in three ips of the coin then the possible values of X are 0 1 2 and 3 You can nd the probabilities of these values by considering the eight possible outcomes of the three ips TTT TTH THT 54 The Binomial Distribution 233 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it HTT THH HTH HHT and HHH Because of symmetry the well balanced property of the coin each of these eight possible outcomes must have the same probability so each must have probability 18 Next note that one of the outcomes has X 0 three outcomes have X 1 three outcomes have X 2 and one outcome has X 3 Therefore the probability distribution of X is PX 0 18 PX 1 38 PX 2 38 PX 3 18 This is a special case of the binomial distribution with n 3 and p 05 In general where n can be any positive integer and p can be any probability between 0 and 1 there is a rather complex formula for calculating PX k for any integer k from 0 to n Instead of presenting this formula we will discuss how to calculate binomial probabilities in Excel You do this with the BINOMDIST function The general form of this function is BINOMDISTknpcum The middle two arguments are the number of trials n and the probability of success 19 on each trial The first parameter k is an integer number of successes that you specify The last parameter cam is either 0 or 1 It is 1 if you want the probability of less than or equal to k successes and it is 0 if you want the probability of exactly k successes We illustrate typi cal binomial calculations in the following example CHANGES IN EXCEL 20l0 As with the new normal functions there are new binomial functions in Excel 20OThe BINOMDIST and CRITBINOM functions in the following example have been replaced by BNOMDST and BNOMNV but the old functions still work ne Both versions are indicated in the le for the following example EXAMPLE 57 BATTERY LIFE EXPERIMENT Suppose that 100 identical batteries are inserted in identical ashlights Each ashlight takes a single battery After eight hours of continuous use a given battery is still oper ating with probability 06 or has failed with probability 04 Let X be the number of suc cesses in these 100 trials where a success means that the battery is still functioning Find the probabilities of the following events a exactly 58 successes b no more than 65 suc cesses c less than 70 successes d at least 59 successes e greater than 65 successes f between 55 and 65 successes inclusive g exactly 40 failures h at least 35 failures and i less than 42 failures Then nd the 95th percentile of the distribution of X Objective To use Excel s BINOMDIST and CRITBINOM functions for calculating binomial probabilities and percentiles in the context of ashlight batteries Solution Figure 518 shows the solution to all of these problems See the le Binomial Calculationsxlsx The probabilities requested in parts a through f all involve the num ber of successes X The key to these is the wording of phrases such as no more than greater than and so on In particular you have to be careful to distinguish between prob abilities such as PX lt k and PX lt k The latter includes the possibility of having X k and the former does not 234 Chapter 5 Normal Binomial Poisson and Exponential Distributions Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Figure 5 I 8 Typical Binomial Calculations A B C D E F G H J 1 Binomial Probability Calculations 2 Range names used 3 Number of trials 100 NTrias BinomCacsB3 4 Probability of success on each trial 06 PSuccess BinomCacsB4 5 6 Event Probability Formula 7 Exactly 58 successes 00742 BNOMDST58NTriasPSuccess0 8 No more than 65 successes 08697 BNOMDST65NTriasPSuccess1 9 Less than 70 successes 09752 BNOMDST69NTriasPSuccess1 10 At least 59 successes 06225 1 BNOMDST58NTriasPSuccess1 11 Greater than 65 successes 01303 1 BNOMDST65NTriasPSuccess1 12 Between 55 and 65 successes inclusive 07386 BNOMDST65NTriasPSuccess1 BNOMDST54NTriasPSuccess1 13 I I I 14 Exactly 40 failures 00812 BNOMDST40NTrias1 PSuccess0 15 At least 35 failures 08697 1 BNOMDST34NTrials1 PSuccess1 16 Less than 42 failures 06225 BNOMDST41NTrias1 PSuccess1 17 18 Finding the 95th percentile trial and error 19 Trial values CumProb 20 65 08697 BNOMDSTA20NTriasPSuccess1 21 66 09087 Copy down 22 67 09385 23 68 09602 24 69 09752 25 70 09852 26 Formula in cell A27 27 68 095 CRTBNOMNTriasPSuccessB27 With this in mind the probabilities requested in a through f become a PX 58 b PX 5 65 c PX lt 70 PX s 69 d PX2591 PXlt591 PX58 e PXgt 65 1 PX 65 f P55 S X S 65 PX 5 65 PX lt 55 PX 5 65 PX 5 54 Note how we have manipulated each of these so that it includes only terms of the form PX k or PX S k for suitable values of k These are the types of probabilities that can be handled directly by the BINOMDIST function The answers appear in the range B7B12 and the corresponding formulas are shown as labels in column D The Excel 2010 functions do not appear in this gure but they are included in the file The probabilities requested in g through i involve failures rather than successes But because each trial results in either a success or a failure the number of failures is also binomially distributed with parameters n and 1 p 04 So in rows 14 through 16 the requested probabilities are calculated in exactly the same way except that 1PSuccess is subtituted for PSuccess in the third argument of the BINOMDIST function Finally to calculate the 95th percentile of the distribution of X you can proceed by trial and error For each value k from 65 to 70 the probability PX S k is calculated in col umn B with the BINOMDIST function Note that there is no value k such that PX S k 095 exactly Speci cally PX S 67 is slightly less than 095 and PX S 68 is slightly greater than 095 Therefore the meaning of the 95th percentile is somewhat ambiguous If you want the largest value k such that PX S k S 095 then this k is 67 If instead you 54 The Binomial Distribution 235 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it want the smallest value k such that PX S k 2 095 then this value is 68 The latter inter pretation is the one usually accepted for binomial percentiles In fact Excel has another built in function CRITBINOM for nding this value of k This function is illustrated in row 27 of Figure 518 Now you enter the requested proba bility 095 in cell B27 and the formula CRITBINOMNTrialsPSuccessB27 in cell A27 It returns 68 the smallest value k such that PX S k 2 095 for this binomial distribution I 541 Mean and Standard Deviation of the Binomial Distribution It can be shown that the mean and standard deviation of a binomial distribution with para meters n and p are given by the following equations E00 np 53 StdevX npl p 54 The formula for the mean is quite intuitive For example if you observe 100 trials each with probability of success 06 your best guess for the number of successes is clearly 10006 60 The standard deviation is less obvious but still very useful It indicates how far the actual number of successes is likely to deviate from the mean In this case the stan dard deviation is l000604 490 Fortunately the empirical rules discussed in Chapter 2 also apply at least approximately to the binomial distribution That is there is about a 95 chance that the actual number of suc cesses will be within two standard deviations of the mean and there is almost no chance that the number of successes will be more than three standard deviations from the mean So for this example it is very likely that the number of successes will be in the range of approximately 50 to 70 and it is very unlikely that there will be fewer than 45 or more than 75 successes This reasoning is extremely useful It provides a rough estimate of the number of suc cesses you are likely to observe Suppose 1000 parts are sampled randomly from an assembly line and based on historical performance the percentage of parts with some type of defect is about 5 Translated into a binomial model each of the 1000 parts independently of the others has some type of defect with probability 005 Would it be surprising to see say 75 parts with a defect The mean is 1000005 50 and the standard deviation is 1000005095 689 Therefore the number of parts with defects is 95 certain to be within 50 2689 or approximately from 36 to 64 Because 75 is slightly beyond three standard deviations from the mean it is highly unlikely that there would be 75 or more defective parts 542 The Binomial Distribution in the Context of Sampling We now discuss how the binomial distribution applies to sampling from a population with two types of members Let s say these two types are men and women although in applica tions they might be Democrats and Republicans users of our product and nonusers and so on We assume that the population has N members of whom NM are men and NW are women where NM NW N If you sample n of these randomly you are typically inter ested in the composition of the sample You might expect the number of men in the sample to be binomially distributed with parameters n and p NMN the fraction of men in the population However this depends on how the sampling is performed If sampling is done without replacement each member of the population can be sam pled only once That is once a person is sampled his or her name is struck from the list and cannot be sampled again If sampling is done with replacement then it is possible 236 Chapter 5 Normal Binomial Poisson and Exponential Distributions Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it although maybe not likely to select a given member of the population more than once Most real world sampling is performed without replacement There is no point in obtaining infor mation from the same person more than once However the binomial model applies only to sampling with replacement Because the composition of the remaining population keeps changing as the sampling progresses the binomial model provides only an approximation if sampling is done without replacement If there is no replacement the value of p the pro portion of men in this case does not stay constant a requirement of the binomial model The appropriate distribution for sampling without replacement is called the hypergeometric distribution a distribution we will not discuss here6 If n is small relative to N however the binomial distribution is a very good approxi mation to the hypergeometric distribution and can be used even if sampling is performed without replacement A rule of thumb is that if n is no greater than 10 of N that is no more than 10 of the population is sampled then the binomial model can be used safely Of course most national polls sample considerably less than 10 of the population In fact they often sample only a few thousand people from the hundreds of millions in the entire population The bottom line is that in most real world sampling contexts the bino mial model is perfectly adequate 543 The Normal Approximation to the Binomial If n is large and p is If you graph the binomial probabilities you will see an interesting phenomenon namely quot075 Q0 C 3e if 0 the graph begins to look symmetric and bellshaped when n is fairly large and p is not too the bmomml d39Sm39 close to 0 or 1 An example is illustrated in Figure 519 with the parameters n 30 and b 39 39 b II h d aggocnaf bee agpcrt p 04 Generally if np gt 5 and nl p gt 5 the binomial distribution can be approxi mated well by the mated well by a normal distribution with mean np and standard deviation n pl p normal distribution One practical consequence of the normal approximation to the binomial is that the empirical rules can be applied That is when the binomial distribution is approximately symmetric and bellshaped there is about a 68 chance that the number of successes will be within one standard deviation of the mean Similarly there is about a 95 chance that the number of successes will be within two standard deviations of the mean and the 01600 Binomial Distribution Figure 5 I 9 Bell shaped 01400 Binomial Distribution 01200 01000 a E 3 00800 39 2 Q 00600 00400 00200 I I I ISIDIUIII I I I I I I IIInInII I I I I I I I I I I 012 3 4 5 6 7 8 9101112131415161718192021222324252627282930 Value 6Excel has a function HYPGEOMDIST for sampling without replacement that works much like the BINOMDIST function You can look it up under the Statistical category of Excel functions 54 The Binomial Distribution 237 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it number of successes will almost surely be within three standard deviations of the mean Here the mean is rip and the standard deviation is V n pl p FUNDAMENTAL INSIGHT Relationship Between Normal and binomial under these conditions n large p not too Binomial Distributions close to 0 or I One implication is that the empirical rules from Chapter 2 apply very well to binomial dis tributions using the mean and standard deviation in Equations 53 and 54 For example there is about a 95 chance that the number of successes will be within two standard deviations of the mean If you oolt at a graph of a binomial distribution when n is fairly large and p is not too close to 0 or you will see that the distribution is bellshapedThis is no acci dent It can be proven mathematically that the normal distribution provides a very good approximation to the Technical Tip Continuity Correction Because the normal distribution is continuous and the binomial distribution is discrete the normal approximation to the binomial can be improved slightly with a continuity correc tion Ifyou want to approximate a binomial probability such as P36SXS45 expand the interval by 05 on each end in the normal approximation That is approximate with the normal probability P355SXS455 Similarly approximate binomial PXS45 with nor mal PXS45 or binomial PX236 with normal PX2355 55 APPLICATIONS OF THE BINOMIAL DISTRIBUTION The binomial distribution finds many applications in the business world and elsewhere We discuss a few typical applications in this section EXAMPLE 58 Is THIS MUTUAL FUND REALLY A WINNER An investment broker at the Michaels amp Dodson Company claims that he has found a real winner He has tracked a mutual fund that has beaten a standard market index in 37 of the past 52 weeks Could this be due to chance or has he really found a winner Objective To determine the probability of a mutual fund outperforming a standard market index at least 37 out of 52 weeks Solution The broker is no doubt tracking a lot of mutual funds and he is probably reporting only the best of these Therefore we will check whether the best of many mutual funds could do at least this well purely by chance To do this we first specify what we mean by purely by chance This means that each week a given fund has a f1fty f1fty chance of beating the market index independently of performance in other weeks In other words the number of weeks where a given fund outperforms the market index is binomially distributed with n 52 and p 05 With this in mind cell B6 of Figure 520 shows the probability that a given fund does at least as well beats the market index at least 37 out of 52 weeks as the reported fund See the Beating the Marketxlsx le Because PX 2 37 1 PX S 36 the relevant formula is 1BINOMDISTB31B4051 238 Chapter 5 Normal Binomial Poisson and Exponential Distributions Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Obviously this probability 000159 is quite small A single fund isn t likely to beat the market this often purely by chance Figure 520 A B C D E F G 1 Beating the market Binomial 2 Calculations for 3 Weeks beating market index 37 InVe3tmem Example 4 Total number of weeks 52 5 Probability of doing at least 6 this well by chance 000159 1BlNOMDSTB31B4051 7 8 Number of mutual funds 400 Probability of at least one 9 doing at least this well 0471 1BNOMDST0B8B61 10 l l 11 Twoway data table of the probability in B9 as a function of values in B3 and B8 12 Number of weeks beating the market index 13 0471 36 37 38 39 40 14 Number of mutual funds 200 0542 0273 0113 0040 0013 15 300 0690 0380 0164 0060 0019 16 400 0790 0471 0213 0079 0025 17 500 0858 0549 0258 0097 0031 18 600 0904 0616 0301 0116 0038 However the probability that the best of many mutual funds does at least this well is much larger To calculate this probability assume that 400 funds are being tracked and let Y be the number of these that beat the market at least 37 of 52 weeks Then Y is also bino mially distributed with parameters n 400 and p 000159 the probability calculated previously To see whether any of the 400 funds beats the market at least 37 of 52 weeks calculate PY 2 1 1 PY 0 in cell B9 with the formula 1BINOMDIST0B8B61 Can you see why the fourth argument could be 0 or 1 The resulting probability is nearly 05 that is there is nearly a f1fty f1fty chance that at least one of 400 funds will do as well as the reported fund This certainly casts doubt on the broker s claim that he has found a real winner Perhaps his star fund just got lucky and will perform no better than average in succeeding weeks To see how the probability in cell B9 depends on the level of success of the reported fund the value in cell B3 and the number of mutual funds being tracked in cell B8 you can create a twoway data table in the range B13G18 The formula in cell B13 is B9 the row input cell is B3 and the column input cell is B8 As you saw beating the market 37 times out of 52 is no big deal with 400 funds but beating it 40 times out of 52 even with 600 funds is something worth reporting The probability of this happening purely by chance is only 0038 or less than 1 out of 25 I The next example requires a normal calculation to find a probability p which is then used in a binomial calculation EXAM PLE 59 ANALYZING DAILY SALES AT A SUPERMARKET Customers at a supermarket spend varying amounts Historical data show that the amount spent per customer is normally distributed with mean 85 and standard deviation 30 If 500 customers shop in a given day calculate the mean and standard 55 Applications of the Binomial Distribution 239 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it deviation of the number who spend at least 100 Then calculate the probability that at least 30 of all customers spend at least 100 Objective To use the normal and binomial distributions to calculate the typical number of customers who spend at least 100 per day and the probability that at least 30 of all 500 daily customers spend at least 100 Solution Both questions involve the number of customers who spend at least 100 Because the amounts spent are normally distributed the probability that a typical customer spends at least 100 is found with the NORMDIST function This probability 0309 appears in cell B7 of Figure 521 See the file Supermarket Spendingxlsx It is calculated with the formula 1NORMDIST100B4B51 This probability is then used as the parameter p in a binomial model The mean and stan dard deviation of the number who spend at least 100 are calculated in cells B13 and B 14 as np and np1 p using n 500 the number of shoppers and p 0309 The expected number who spend at least 100 is slightly greater than 154 and the standard deviation of this number is slightly greater than 10 Figure 52 I Calculations for Supermarket Example A B C D E F Supermarket spending Amount spent per customer normally distributed Mean 85 StDev 30 Probability that a customer spends at least 100 0309 1 NORMDST100B4B51 Number of customers 500 Mean and stdev of number who spend at least 100 LLLLLLL Mean 15427 B10B7 StDev 1033 SQRTB10B71 B7 Probability at least 30 spend at least 100 0676 1 BNOMDST03B10 1B10B71 To answer the second question note that 30 of 500 customers is 150 customers Then the probability that at least 30 of the customers spend at least 100 is the probabil ity that a binomially distributed random variable with n 500 and p 0309 is at least 150 This binomial probability which turns out to be about 23 is calculated in cell B16 with the formula 1BINOMDIST03B101B10B71 Note that the first argument calculates to 149 This is because the probability of at least 150 customers is one minus the probability of less than or equal to 149 customers I 240 Chapter 5 Normal Binomial Poisson and Exponential Distributions Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it EXAMPLE 5l0 OVERBOOKING BY AIRLINES his example presents a simpli ed version of calculations used by airlines when they overbook ights They realize that a certain percentage of ticketed passengers will cancel at the last minute Therefore to avoid empty seats they sell more tickets than there are seats hoping that just about the right number of passengers show up We assume that the noshow rate is 10 In binomial terms we assume that each ticketed passenger indepen dently of the others shows up with probability 090 and cancels with probability 010 For a ight with 200 seats the airline wants to see how sensitive various probabilities are to the number of tickets it issues In particular it wants to calculate a the probability that more than 205 passengers show up b the probability that more than 200 passengers show up c the probability that at least 195 seats are lled and d the probability that at least 190 seats are filled The first two of these are bad events from the airline s perspec tive they mean that some customers will be bumped from the ight The last two events are good in the sense that the airline wants most of the seats to be occupied Objective To assess the benefits and drawbacks of airline overbooking Solution To solve the airline s problem we use the BINOMDIST function and a data table The solution appears in Figure 522 See the le Airline Overbookingxlsx For any number of tickets issued in cell B6 the required probabilities are calculated in row 10 For exam ple the formulas in cells B 10 and D10 are 1BINOMDIST205NTickets1PNoShow1 and 1BINOMDIST194NTickets1PNoShow1 Note that the condition more than requires a slightly different calculation from at least The probability of more than 205 is one minus the probability of less than or equal to 205 whereas the probability of at least 195 is one minus the probability of less than or equal to 194 Also note that a passenger who shows up is called a success Therefore the third argument of each BINOMDIST function is one minus the noshow probability To see how sensitive these probabilities are to the number of tickets issued we create a one way data table at the bottom of the spreadsheet It is oneway because there is only one input the number of tickets issued even though four output probabilities are tabulated To create the data table list several possible numbers of tickets issued along the side in column A and create links to the probabilities in row 10 in row 14 That is enter the for mula B 10 in cell B 14 and copy it across row 14 Then form a data table using the range A14E24 no row input cell and column input cell B6 The results are as expected As the airline issues more tickets there is a larger chance of having to bump passengers from the ight but there is also a larger chance of filling most seats In reality the airline has to make a trade off between these two taking its vari ous costs and revenues into account I The following is another simplified example of a real problem that occurs every time you watch election returns on TV This problem is of particular interest in light of the highly unusual events that took place during election night television coverage of the US presidential election in 2000 where the networks declared Al Gore an early winner in at least one state that he eventually lost The basic question is how soon the networks can 55 Applications of the Binomial Distribution 24l Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Figure 522 Binomial Calculations for Overbooking Example A B C D E F 1 Airline overbooking Range names used 2 NTickets OverbookingSBS6 3 Number of seats 200 PNoShow OverbookingBS4 4 Probability of no show 01 5 6 Number of tickets issued 215 7 8 Required probabilities More than 205 More than 200 At least 195 At least 190 9 show up show up seats lled seats lled 10 0001 0050 0421 0820 11 12 Data table showing sensitivity of probabilities to number of tickets issued More than 205 More than 200 At least 195 At least 190 13 Number of tickets issued show up show up seats lled seats lled 14 0001 0050 0421 0820 15 206 0000 0000 0012 0171 16 209 0000 0001 0064 0384 17 212 0000 0009 0201 0628 18 215 0001 0050 0421 0820 19 218 0013 0166 0659 0931 20 221 0064 0370 0839 0978 21 224 0194 0607 0939 0995 22 227 0406 0802 0981 0999 23 230 0639 0920 0995 1000 24 233 0822 0974 0999 1000 declare one of the candidates the winner based on early voting returns Our example is somewhat unrealistic because it ignores the possibility that early tabulations can be biased one way or the other For example the earliest reporting precincts might be known to be more heavily in favor of the Democrat than the population in general Nevertheless the example indicates why the networks are able to make early conclusions based on such seemingly small amounts of data EXAMPLE 5ll PROJECTING ELECTION WINNERS FROM EARLY RETURNS e assume that there are N voters in the population of whom NR will vote for the Republican and ND will vote for the Democrat The eventual winner will be the Republican if NR gt ND and will be the Democrat otherwise but we won t know which until all of the votes are tabulated To simplify the example we assume there are only two candidates and that the election will not end in a tie Let s suppose that a small percentage of the votes have been counted and the Republican is currently ahead 540 to 460 On what basis can the networks declare the Republican the winner especially if there are millions of voters in the population 242 Chapter 5 Normal Binomial Poisson and Exponential Distributions Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Objective To use a binomial model to determine whether early returns re ect the even tual winner of an election between two candidates Solution Let n 1000 be the total number of votes that have been tabulated If X is the number of Republican votes so far we are given that X 540 Now we pose the following question If the Democrat were going to be the eventual winner that is ND gt NR and we randomly sampled 1000 voters from the population how likely is it that at least 540 of these voters would be in favor of the Republican If this is very unlikely then the only reasonable conclusion is that the Democrat will not be the eventual winner This is the reasoning the networks might use to declare the Republican the winner so early in the tabulation We use a binomial model to see how unlikely the event at least 540 out of l000 is assuming that the Democrat will be the eventual winner We need a value for p the proba bility that a typical vote is for the Republican This probability should be the proportion of voters in the entire population who favor the Republican All we know is that this probabil ity is less than 05 because we have assumed that the Democrat will eventually win In Figure 523 we show how the probability of at least 540 out of 1000 varies with values of p less than but close to 05 See the le Election ReturnsXlsX Figure 523 Binomial Calculations for Voting Example A B C D E F Election returns Population proportion for Republican 049 Votes tabulated so far 1000 Votes for Republican so far 540 Binomial probability of at least this many votes for Republican 00009 1BlNOMDSTB61B5B31 Data table showing sensitivity of this probability to population proportion for Republican LLLLLLLLLL Population proportion for Republican Probability 00009 0490 00009 0492 00013 0494 00020 0496 00030 0498 00043 0499 00052 We enter a trial value of 049 for p in cell B3 and then calculate the required probabil ity in cell B9 with the formula 1BINOMDISTB61B5B31 55 Applications of the Binomial Distribution 243 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Then we use this to create the data table at the bottom of the spreadsheet This data table tabulates the probability of the given lead at least 540 out of 1000 for various values of p less than 05 As shown in the last few rows even if the eventual outcome were going to be a virtual tie with the Democrat slightly ahead there would still be very little chance of the Republican being at least 80 votes ahead so far But because the Republican is cur rently ahead by 80 votes the networks feel safe in declaring the Republican the winner Admittedly the probability model they use is more complex than our simple binomial model but the idea is the same I The final example in this section challenges the two assumptions of the binomial model So far we have assumed that the outcomes of successive trials have the same probability p of success and are probabilistically independent There are many situations where either or both of these assumptions are questionable For example consider successive items from a production line where each item either meets specifications a success or doesn t a fail ure If the process deteriorates over time at least until it receives maintenance the proba bility p of success will slowly decrease But even if 19 remains constant defective items could come in bunches because of momentary inattentiveness on the part of a worker say which would invalidate the independence assumption If you believe that the binomial assumptions are invalid then you must specify an alternative model that re ects reality more closely This is not easy all kinds of nonbino mial assumptions can be imagined Furthermore when you make such assumptions there are probably no simple formulas to use such as the BINOMDIST formulas we have been using Simulation might be the only simple alternative as illustrated in the following example EXAMPLE 5l2 STREAK SHOOTING IN BASKETBALL o basketball players shoot in streaks This question has been debated by thousands of basketball fans and it has been studied statistically by academic researchers Most fans believe the answer is yes arguing that players clearly altemate between hot streaks where they can t miss and cold streaks where they can t hit the broad side of a bam This situation does not fit a binomial model where say a 450 shooter has a 0450 probability of making each shot and a 0550 probability of missing independently of other shots If the binomial model does not apply what model is appropriate and how could it be used to calculate a typical probability such as the probability of making at least 13 shots out of 25 attempts7 Objective To formulate a nonbinomial model of basketball shooting and to use it to find the probability of a 450 shooter making at least 13 out of 25 shots Solution This example is quite open ended There are numerous alternatives to the binomial model that could capture the streakiness most fans believe in and the one we suggest here is by no means the only possibility We challenge you to develop others The model we propose assumes that this shooter makes 45 of his shots in the long run The probability that he makes his first shot in a game is 045 In general consider 7There are obviously a lot of extenuating circumstances surrounding any shot the type of shot layup versus jump shot the type of defense the score the time left in the game and so on For this example we focus on a pure jump shooter who is unaffected by the various circumstances in the game 244 Chapter 5 Normal Binomial Poisson and Exponential Distributions Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it his nth shot If he has made his last k shots we assume the probability of making shot n is 045 kd 1 On the other hand if he has missed his last k shots we assume the probability of making shot n is 045 kdz Here dl and d2 are small Values 001 and 002 for exam ple that indicate how much the shooter s probability of success increases or decreases depending on his current streak The model implies that the shooter gets better the more shots he makes and worse the more he misses Figure 524 Simulation of Basketball Shooting Model A B c D E F G H I 1 Basketball shooting simulation 2 I 3 Long run average 045 4 Increment d1 after a make 0015 5 Increment d2 after a miss 0015 6 7 Number of shots 25 8 Binomial probability of at 9 least 13 out of 25 0306 10 11 Summary statistics from simulation below Compare these Fraction of reps with at least 13 from table below 12 Number of makes 14 0272 13 At least 13 makes 1 14 15 Simulation of makes and misses using nonbinomial model Data table to replicate 25 shots many times 16 Shot Streak Pmake Make Rep At least 13 17 1 NA 045 0 1 18 2 1 0435 0 1 0 19 3 2 042 0 2 1 20 4 3 0405 1 3 1 21 5 1 0465 1 4 0 37 21 1 0435 0 20 0 38 22 2 042 1 21 0 39 23 1 0465 0 22 1 40 24 1 0435 0 23 1 41 25 2 042 1 24 0 42 25 1 43 26 0 265 248 0 266 249 0 267 250 1 To implement this model we use simulation as shown in Figure 524 with many hidden rows See the le Basketball SimulationXlsX Actually we first do a baseline binomial calculation in cell B9 using the parameters n 25 and p 0450 The formula in cell B9 is 1BINOMDIST12B7B31 If the player makes each shot with probability 045 independently of the other shots then the probability that he will make over half of his 25 shots is 0306 about a 30 chance Remember that this is a binomial calculation for a situation where the binomial distribution probably does not apply The simulation in the range A17 D41 245 55 Applications of the Binomial Distribution Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it shows the results of 25 random shots according to the nonbinomial model we have assumed Column B indicates the length of the current streak where a negative value indicates a streak of misses and a positive value indicates a streak of makes Column C indicates the probability of a make on the current shot and column D contains 1s for makes and 0s for misses Here are stepbystep instructions for developing this range 0 First shot Enter the formulas B3 and IFRANDltC1710 in cells C17 and D17 to determine the outcome of the first shot 9 Second shot Enter the formulas IFD17011 IFB18lt0B3B18B5B3B18B4 and IFRANDC1810 in cells B 18 C18 and D18 The first of these indicates that by the second shot the shooter will have a streak of one make or one miss The second formula is the important one It indicates how the probability of a make changes depending on the current streak The third formula simulates a make or a miss using the probability in cell C18 9 Length of streak on third shot Enter the formula IFANDB18lt0D180B181 IFANDB18lt0D1811 IFANDB18gt0D180 1B181 in cell B 19 and copy it down column B This nested IF formula checks for all four combi nations of the previous streak negative or positive indicated in cell B18 and the most recent shot make or miss indicated in cell D18 to see whether the current streak contin ues by 1 or a new streak starts 0 Results of remaining shots The logic for the formulas in columns C and D is the same for the remaining shots as for shot 2 so copy the formulas in cells C18 and D18 down their respective columns 6 Summary of 25 shots Enter the formulas SUMD17D41 and IFB12gt 1310 in cells B 12 and B 13 to summarize the results of the 25 simulated shots In particular the value in cell B13 is 1 only if at least 13 of the shots are successes What about the probability of making at least 13 shots with this nonbinomial model So far we have simulated one set of 25 shots and have reported whether at least 13 of the shots are successes We need to replicate this simulation many times and report the fraction of the replications where at least 13 of the shots are successes We do this with a data table in columns F and G To create this table enter the replication numbers 1 through 250 you could use any number of replications in column F Then put a link to B 13 in cell G17 by entering the 246 Chapter 5 Normal Binomial Poisson and Exponential Distributions Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it PROBLEMS Level A In a typical month an insurance agent presents life insurance plans to 40 potential customers Historically one in four such customers chooses to buy life insurance from this agent Based on the relevant binomial distribution answer the following questions a What is the probability that exactly five customers will buy life insurance from this agent in the coming month b What is the probability that no more than 10 customers will buy life insurance from this agent in the coming month c What is the probability that at least 20 customers will buy life insurance from this agent in the coming month formula B13 in this cell Essentially we are recalculating this value 250 times each with different random numbers To do this highlight the range F17G267 and create a data table with no row input cell and any blank cell such as F17 as the column input cell This causes Excel to recalculate the basic simulation 250 times each time with different ran dom numbers This trick of using a blank column input cell will be discussed in more detail in Chapter 15 Finally enter the formula AVERAGEG18G267 in cell F12 to calculate the fraction of the replications with at least 13 makes out of 25 shots After finishing all of this note that the spreadsheet is live in the sense that if you press the F9 recalculation key all of the simulated quantities change with new random numbers In particular the estimate in cell F12 of the probability of at least 13 makes out of 25 shots changes It is sometimes less than the binomial probability in cell B9 and some times greater In general the two probabilities are roughly the same The bottom line Even if the world doesn t behave exactly as the binomial model indicates probabilities of various events can often be approximated fairly well by binomial probabilities which saves you the trouble of developing and working with more complex models I Many vehicles used in space travel are constructed with redundant systems to protect ight crews and their valuable equipment In other words backup systems are included within many vehicle components so that if one or more systems fail backup systems will assure the safe operation of the given component and thus the entire vehicle For example consider one particular component of the US space shuttle that has n duplicated systems ie one original system and n 1 backup systems Each of these systems functions independently of the others with proba bility 098 This shuttle component functions successfully provided that at least one of the n systems functions properly 21 Find the probability that this shuttle component functions successfully if n 2 d Determine the mean and standard deviation of the b Find the probability that this Shuttle Component number of customers who will buy life insurance from this agent in the coming month C e What is the probability that the number of customers who buy life insurance from this agent in the coming month will lie within two standard deviations of the mean 22 139 What is the probability that the number of customers who buy life insurance from this agent in the coming month will lie within three standard deviations of the mean 20 Continuing the previous exercise use the normal approximation to the binomial to answer each of the questions posed in parts a through f How well does the normal approximation perform in this case Explain functions successfully if n 4 What is the minimum number n of duplicated systems that must be incorporated into this shuttle component to ensure at least a 09999 probability of successful operation Suppose that a popular hotel for vacationers in Orlando Florida has a total of 300 identical rooms As many major airline companies do this hotel has adopted an overbooking policy in an effort to maximize the usage of its available lodging capacity Assume that each potential hotel customer holding a room reservation independently of other customers cancels the reservation or simply does not show up at the hotel on a given night with probability 015 55 Applications of the Binomial Distribution 247 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it 21 Find the largest number of room reservations that this hotel can book and still be at least 95 sure that everyone who shows up at the hotel will have a room on a given night b Given that the hotel books the number of reservations found in part a nd the probability that at least 90 of the available rooms will be occupied on a given night c Given that the hotel books the number of reservations found in part a nd the probability that at most 80 of the available rooms will be occupied on a given night d How does your answer to part a change as the required assurance rate increases from 95 to 97 How does your answer to part a change as the required assurance rate increases from 95 to 99 e How does your answer to part a change as the cancellation rate varies between 5 and 25 in increments of 5 Assume now that the required assurance rate remains at 95 A production process manufactures items with weights that are normally distributed with mean 15 pounds and standard deviation 01 pound An item is considered to be defective if its weight is less than 148 pounds or greater than 152 pounds Suppose that these items are currently produced in batches of 1000 units a Find the probability that at most 5 of the items in a given batch will be defective b Find the probability that at least 90 of the items in a given batch will be acceptable c How many items would have to be produced in a batch to guarantee that a batch consists of no more than 1 defective items 24 Past experience indicates that 30 of all individuals entering a certain store decide to make a purchase Using a the binomial distribution and b the normal approximation to the binomial find that probability that 10 or more of the 30 individuals entering the store in a given hour will decide to make a purchase Compare the results obtained using the two different approaches Under what conditions will the normal approximation to this binomial probability become even more accurate Suppose that the number of ounces of soda put into a soft drink can is normally distributed with p 1205 ounces and 039 003 ounce a Legally a can must contain at least 12 ounces of soda What fraction of cans will contain at least 12 ounces of soda b What fraction of cans will contain less than 119 ounces of soda c What fraction of cans will contain between 12 and 1208 ounces of soda d One percent of all cans will weigh more than what value 26 27 28 30 e Ten percent of all cans will weigh less than what value f The soft drink company controls the mean weight in a can by setting a timer For what mean should the timer be set so that only 1 in 1000 cans will be underweight g Every day the company produces 10000 cans The government inspects 10 randomly chosen cans each day If at least two are underweight the company is ned 10000 Given that p 1205 ounces and 039 003 ounce what is the probability that the company will be fined on a given day Suppose that 53 of all registered voters prefer Barack Obama to John McCain You can substitute the names of the current presidential candidates if you like a In a random sample of 100 voters what is the probability that the sample will indicate that Obama will win the election that is there will be more votes in the sample for Obama b In a random sample of 100 voters what is the probability that the sample will indicate that McCain will win the election c In a random sample of 100 voters what is the probability that the sample will indicate a dead heat fifty fifty d In a random sample of 100 voters what is the probability that between 40 and 60 inclusive voters will prefer Obama Assume that on average 95 of all ticket holders show up for a ight If a plane seats 200 people how many tickets should be sold to make the chance of an overbooked ight as close as possible to 5 Suppose that 55 of all people prefer Coke to Pepsi We randomly choose 500 people and ask them if they prefer Coke to Pepsi What is the probability that our survey will erroneously indicate that Pepsi is preferred by more people than Coke Does this probability increase or decrease as we take larger and larger samples Why A rm s office contains 150 PCs The probability that a given PC will not work on a given day is 005 a On a given day what is the probability that exactly one computer will not be working b On a given day what is the probability that at least two computers will not be working c What assumptions do your answers in parts a and b require Do you think they are reasonable Explain Suppose that 4 of all tax returns are audited In a group of n tax returns consider the probability that at most two returns are audited How large must n be before this probability is less than 001 Suppose that the height of a typical American female is normally distributed with pt 64 inches and o 4 248 Chapter 5 Normal Binomial Poisson and Exponential Distributions Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it 32 inches We observe the height of 500 American females 21 What is the probability that fewer than 35 of the 500 women will be less than 58 inches tall Let X be the number of the 500 women who are less than 58 inches tall Find the mean and standard deviation of X 36 b Consider a large population of shoppers each of whom spends a certain amount during his or her current shopping trip the distribution of these amounts is normally distributed with mean 55 and standard deviation 15 We randomly choose 25 of these shoppers What is the probability that at least 15 of them spend between 45 and 75 Level B 34 Many firms utilize sampling plans to control the quality of manufactured items ready for shipment To illustrate the use of a sampling plan suppose that a particular company produces and ships electronic computer chips in lots each lot consisting of 1000 chips This company s sampling plan speci es that quality control personnel should randomly sample 50 chips from each lot and accept the lot for shipping if the number of defective chips is four or fewer The lot will be rejected if the number of defective chips is ve or more 21 Find the probability of accepting a lot as a function of the actual fraction of defective chips In particular let the actual fraction of defective chips in a given lot equal any of 002 004 006 008 010 012 014 016 018 Then compute the lot acceptance probability for each of these lot defective fractions Construct a graph showing the probability of lot acceptance for each of the lot defective fractions and interpret your graph c Repeat parts a and b under a revised sampling plan that calls for accepting a given lot if the number of defective chips found in the random sample of 50 chips is five or fewer Summarize any notable differences between the two graphs Suppose you play a game at a casino where your probability of winning each game is 049 On each game you bet 10 which you either win or lose Let Pn be the probability that you are ahead by at least 50 after n games Graph this probability versus n for n equal to multiples of 50 up to 1000 Discuss the behavior of this function and why it behaves as it does 39 Comdell Computer receives computer chips from Chipco Each batch sent by Chipco is inspected as follows 35 chips are tested and the batch passes inspection if at most one defective chip is found in the set of 35 tested chips Past history indicates an average of 1 of all chips produced by Chipco are defective Comdell has received 10 batches this week What is the probability that at least nine of the batches will pass inspection A standardized test consists entirely of multiple choice questions each with ve possible choices You want to ensure that a student who randomly guesses on each question will obtain an expected score of zero How can you accomplish this In the current tax year suppose that 5 of the millions of individual tax returns are fraudulent That is they contain errors that were purposely made to cheat the government 21 Although these errors are often well concealed let s suppose that a thorough IRS audit will uncover them If a random 250 tax returns are audited what is the probability that the IRS will uncover at least 15 fraudulent returns Answer the same question as in part a but this time assume there is only a 90 chance that a given fraudulent return will be spotted as such if it is audited Suppose you work for a survey research company In a typical survey you mail questionnaires to 150 companies Of course some of these companies might decide not to respond Assume that the nonresponse rate is 45 that is each company s probability of not responding independently of the others is 045 a If your company requires at least 90 responses for a valid survey nd the probability that it will get this many Use a data table to see how your answer varies as a function of the nonresponse rate for a reasonable range of response rates surrounding 45 Suppose your company does this survey in two waves It mails the 150 questionnaires and waits a certain period for the responses As before assume that the nonresponse rate is 45 However after this initial period your company follows up by telephone say on the nonrespondents asking them to please respond Suppose that the nonresponse rate on this second wave is 70 that is each original nonrespondent now responds with probability 03 independently of the others Your company now wants to nd the probability of obtaining at least 110 responses total It tums out that this is a difficult probability to calculate directly So instead approximate it with simulation Suppose you are sampling from a large population and you ask the respondents whether they believe men should be allowed to take paid paternity leave from their jobs when they have a new child Each person you sample is equally likely to be male or female The population proportion of females who believe males should be granted paid paternity leave is 56 and the population 249 55 Applications of the Binomial Distribution Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it proportion of males who favor it is 48 If you sa1nple D U D U U D D D You can assume that she 200 people and count the number who believe males makes all 10 predictions right now although that does should be granted paternity leave is this number not affect your answer to the question Obviously you binomially distributed Explain why or why not Would are skeptical thinking that she is just guessing so you your answer change if you knew your sample was going would be surprised if her predictions are accurate to consist of exactly 100 males and 100 females Which would surprise you more 1 she predicts at 40 A Woman C1 aims that She is a fOrtunete11er least 8 out of 10 correctly or 2 she predicts at least 6 Speci cally She C1 aims that She can predict the out of 10 correctly on each of four separate occasions direction of the change up or down in the Dow Jones Answer by assumlllg lllal l Sllels really guesslllg and Industrial Average for the next 10 days such as U U 2 each day the DOW ls equally llkely to go up or down 2 56 THE POISSON AND EXPONENTIAL DISTRIBUTIONS The final two distributions in this chapter are called the Poisson and exponential distribu tions In most statistical applications including those in the rest of this book these distrib utions play a much less important role than the normal and binomial distributions For this reason we will not analyze them in as much detail However in many applied manage ment science models the Poisson and exponential distributions are key distributions For example much of the study of probabilistic inventory models queuing models and relia bility models relies heavily on these two distributions 561 The Poisson Distribution The Poisson distribution is a discrete distribution It usually applies to the number of events occurring within a speci ed period of time or space Its possible values are all of the nonnegative integers 0 1 2 and so on there is no upper limit Even though there is an infinite number of possible values this causes no real problems because the probabilities of all sufficiently large values are essentially 0 The Poisson distribution is characterized by a single parameter usually labeled A Greek lambda which must be positive By adjusting the value of A we are able to produce differ ent Poisson distributions all of which have the same basic shape as in Figure 525 That is they first increase and then decrease It tums out that A is easy to interpret It is both the mean and the variance of the Poisson distribution Therefore the standard deviation is A Typical Examples of the Poisson Distribution 1 A bank manager is studying the arrival pattern to the bank The events are customer arrivals the number of arrivals in an hour is Poisson distributed and A represents the expected number of arrivals per hour 2 An engineer is interested in the lifetime of a type of battery A device that uses this type of battery is operated continuously When the first battery fails it is replaced by a second when the second fails it is replaced by a third and so on The events are battery failures the number of failures that occur in a month is Poisson distributed and A represents the expected number of failures per month 3 A retailer is interested in the number of customers who order a particular product in a week Then the events are customer orders for the product the number of customer orders in a week is Poisson distributed and A is the expected number of orders per week 4 In a quality control setting the Poisson distribution is often relevant for describing the number of defects in some unit of space For example when paint is applied to the body of a new car any minor blemish is considered a defect Then the number of defects on the hood say might be Poisson distributed In this case A is the expected number of defects per hood 250 Chapter 5 Normal Binomial Poisson and Exponential Distributions Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it These examples are representative of the many situations where the Poisson distribution has been applied The parameter A is often called a rate arrivals per hour failures per month and so on If the unit of time is changed the rate must be modi ed accordingly For example if the number of arrivals to a bank in a single hour is Poisson distributed with rate A 30 then the number of arrivals in a halfhour period is Poisson distributed with rate A 15 We can use Excel to calculate Poisson probabilities much as we did with binomial probabilities The relevant function is the POISSON function It takes the form POISSONkAcum The third argument cum works exactly as in the binomial case If it is 0 the function returns PX k if it is 1 the function retums PX S k As examples if A 5 POISSON750 retums the probability of exactly 7 POISSON751 returns the probability of less than or equal to 7 and 1POISSON351 returns the probability of greater than 3 CHANGES IN EXCEL 20 I 0 The POISSON function has been replaced in Excel 200 by POSSONDST Either version can be used and they work exactly the same way Both versions are shown in the le for the following example Curiously there is still no POSSONNV function The following example shows how a manager or consultant could use the Poisson distribution 5l3 MANAGING TV INVENTORY AT KRIEGLAND EXAMPLE Krniegland is a department store that sells various brands of at screen TVs One of the anager s biggest problems is to decide on an appropriate inventory policy for stocking TVs He wants to have enough in stock so that customers receive their requests right away but he does not want to tie up too much money in inventory that sits on the storeroom oor Most of the difficulty results from the unpredictability of customer demand If this demand were constant and known the manager could decide on an appropriate inventory policy fairly easily But the demand varies widely from month to month in a random manner All the manager knows is that the historical average demand per month is approximately 17 Therefore he decides to call in a consultant The consultant immediately suggests using a probability model Specifically she attempts to nd the probability distribution of demand in a typical month How might she proceed Objective To model the probability distribution of monthly demand for at screen TVs with a particular Poisson distribution Solution Let X be the demand in a typical month The consultant knows that there are many possible values of X For example if historical records show that monthly demands have always been between 0 and 40 the consultant knows that almost all of the probability should be assigned to the values 0 through 40 However she does not relish the thought of finding 41 probabil ities PX 0 through PX 40 that sum to 1 and re ect historical frequencies Instead she discovers from the manager that the histogram of demands from previous months is shaped much like the graph in Figure 525 That is it rises to some peak and then falls 56 The Poisson and Exponential Distributions 25 I Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Poisson Distribution A 5 Figure 5 25 392 quot Typical Poisson 180 quot Distribution 01600 01400 quot 39 012o0 E N 3 010o0 E 00800 quot 00600 quot 00400 quot 000000 1 2 3 4 5 6 7s 0quot1039113912391339141516171819 Value Knowing that a Poisson distribution has this same basic shape the consultant decides to model the monthly demand with a Poisson distribution To choose a particular Poisson distribution all she has to do is choose a value of A the mean demand per month Because the historical average is approximately 17 she chooses A 17 Now she can test the Poisson model by calculating probabilities of various events and asking the manager whether these probabilities are reasonable approximations to reality For example the Poisson probability that monthly demand is less than or equal to 20 PX S 20 is 0805 using the Excel function POISSON20171 and the probability that demand is between 10 and 15 inclusive P10 S X S 15 is 0345 using POIS SON15171POISSON9171 Figure 526 illustrates various probability calculations and shows the graph of the individual Poisson probabilities See the file Poisson Demand Distributionxlsx If the manager believes that these probabilities and other similar probabilities are rea sonable then the statistical part of the consultant s job is nished Otherwise she must try a different Poisson distribution a different value of A or perhaps a different type of distribution altogether I 562 The Exponential Distribution Suppose that a bank manager is studying the pattern of customer arrivals at her branch location As indicated previously in this section the number of arrivals in an hour at a facility such as a bank is often well described by a Poisson distribution with parameter A where A represents the expected number of arrivals per hour An alternative way to view the uncertainty in the arrival process is to consider the times between customer arrivals The most common probability distribution used to model these times often called interar rival times is the exponential distribution In general the continuous random variable X has an exponential distribution with parameter A with A gt 0 if the density function of X has the form fx Aequotquot for x gt 0 This density function has the shape shown in Figure 527 Because this density function decreases continuously from left to right its most likely value is x 0 Alternatively if you collect many observations from an exponential distribution and draw a histogram of the observed values then you should expect it to resemble the smooth curve shown in Figure 527 with the tallest bars to the left The mean and standard deviation of this distribution are easy to remember They are both equal to the reciprocal of the para meter A For example an exponential distribution with parameter A 01 has mean and standard deviation both equal to 10 252 Chapter 5 Normal Binomial Poisson and Exponential Distributions Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Figure 526 Poisson Calculations for TV Example A B c D E F G H I 1 K 1 Poisson distribution for monthly demand 2 l Range name used 3 Mean monthly demand 7 17 Mean Sheet1B3 A 5 Representative probability calculations 6 Less than or equal to 20 0805 POSSON20Mean1 7 Between 10 and 15 inclusive 0345 POSSON15Mean1POSSON9Mean1 8 9 Individual probabilities T0 Value Prob 11 0 0000 POSSONA11MeanDem0 12 1 0000 1 l 13 2 0000 0 14 3 0000 Poisson Distribution with X 17 1 5 439 0000 16 5 0000 0120 17 6 0001 18 7 0003 0100 19 8 0007 20 9 0014 0080 21 10 0023 0 22 11 0036 0060 0 23 12 0050 24 13 0066 0040 25 14 0080 26 15 0091 0020 27 16 0096 I I I I 28 17 0096 0000 lllllIllllllllllllllllllllll0lIl39lIIllllllllll E 3 88 0 Q Q 03 3 lt2gt1xbo be 56 9 31 20 0069 0 32 21 0056 33 22 0043 34 23 0032 g 24 0023 36 25 0015 37 26 0010 38 27 0006 39 28 0004 40 29 0002 T 3039 0001 42 31 0001 43 32 0000 44 33 0000 45 34 0000 46 35 0000 47 36 0000 48 37 0000 49 38 0000 E 39 0000 51 40 0000 As with the normal distribution you usually want probabilities to the left or right of a given value For any exponential distribution the probability to the left of a given value x gt 0 can be calculated with EXcel s EXPONDIST function This function takes the form EXPONDISTx gt1 1 56 The Poisson and Exponential Distributions 253 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it x Graph of Exponential Density Function with Figure 527 Lambda 10 Exponential Density Function For example if x 05 and A 5 so that the mean equals 15 02 the probability of being less than 05 can be found with the formula EXPONDIST05 5 1 This returns the probability 0918 Of course the probability of being greater than 05 is then 1 0918 0082 CHANGES IN EXCEL 20l0 The EXPONDIST function has been replaced in Excel 200 by EXPONDST Either version can be used and they work exactly the same way As with the Poisson distribution there is no EXPONNV function Returning to the bank manager s analysis of customer arrival data when the times between arrivals are exponentially distributed you sometimes hear that arrivals occur according to a Poisson process This is because there is a close relationship between the exponential distribution which measures times between events such as arrivals and the Poisson distribution which counts the number of events in a certain length of time The details of this relationship are beyond the level of this book so we will not explore the topic further But if you hear for example that customers arrive at a facility according to a Poisson process at the rate of six per hour then the corresponding times between arrivals are exponentially distributed with mean 16 hour PROBLEMS Level A b What is the probability of observing no more than 12 accidents during the coming year 41 The annual number of industrial accidents occurring in c what is the probability of Observing at least 15 a particular manufacturing plant is known to follow a accidents during the Coming year P0133011 d1S111b1111011 W1111 1116311 12 d What is the probability of observing between 10 3 What 13 1113 l310bab1111Y Of 0bS31V111g 3X3C11Y 12 and 15 accidents inclusive during the coming accidents during the coming year ygar 254 Chapter 5 Normal Binomial Poisson and Exponential Distributions Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it 42 44 e Find the smallest integer k such that we can be at least 99 sure that the annual number of accidents occurring will be less than k Suppose that the number of customers arriving each hour at the only checkout counter in a local pharmacy is approximately Poisson distributed with an expected arrival rate of 20 customers per hour 21 Find the probability that exactly 10 customers arrive in a given hour b Find the probability that at least ve customers arrive in a given hour c Find the probability that no more than 25 customers arrive in a given hour d Find the probability that between 10 and 30 customers inclusive arrive in a given hour e Find the largest integer k such that we can be at least 95 sure that the number of customers arriving in a given hour will be greater than k f Recalling the relationship between the Poisson and exponential distributions find the probability that the time between two successive customer arrivals is more then four minutes Find the probability that it is less than two minutes Suppose the number of baskets scored by the Indiana University basketball team in one minute follows a Poisson distribution with A 15 In a 10minute span of time what is the probability that Indiana University scores exactly 20 baskets at most 20 baskets Use the fact that if the rate per minute is A then the rate in t minutes is At Suppose that the times between arrivals at a bank during the peak period of the day are exponentially distributed with a mean of 45 seconds If you just observed an arrival what is the probability that you will need to wait for more than a minute before observing the next arrival What is the probability you will need to wait at least two minutes Level B Consider a Poisson random variable X with parameter 46 A 2 21 Find the probability that X is within one standard deviation of its mean b Find the probability that X is within two standard deviations of its mean c Find the probability that X is within three standard deviations of its mean d Do the empirical rules we learned previously seem to be applicable in working with the Poisson distribution where A 2 Explain why or why not e Repeat parts a through d for the case of a Poisson random variable where A 20 Based on historical data the probability that a major league pitcher pitches a no hitter in a game is about 1 1 300 21 Use the binomial distribution to determine the probability that in 650 games 0 1 2 or 3 no hitters will be pitched Find the separate probabilities of these four events b Repeat part 21 using the Poisson approximation to the binomial This approximation says that if n is large and p is small a binomial distribution with parameters n and p is approximately the same as a Poisson distribution with A np 57 FITTING A PROBABILITY DISTRIBUTION TO DATA WITH RISK8 The normal binomial Poisson and exponential distributions are four of the most commonly used distributions in real applications However many other discrete and continuous distributions are also used These include the uniform triangular Erlang lognormal gamma Weibull and others How do you know which to choose for any particular application One way to answer this is to check which of several potential distributions fits a given set of data most closely Essentially you compare a histogram of the data with the theoretical probability distributions available and see which gives the best fit The RISK addin part of the Palisade DecisionTools suite makes this fairly easy as we illustrate in the following example Many other features of RISK are discussed in depth in Chapters 15 and 16 8In a previous edition we showed how to do this with Palisade s stand alone program BestFit Because RISK incorporates all the functionality of BestFit and because BestFit is not included in the current version of the Palisade suite you should now use RISK 57 Fitting a Probability Distribution to Data with Risk 255 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it EXAM PLE 5l4 ASSESSING A DISTRIBUTION or SUPERMARKET CHECKOUT TIMES supermarket has collected checkout times on over 100 customers See the le Checkout Timesxlsx As shown in Figure 528 the times vary from 40 seconds to 279 seconds with the mean and median close to 160 seconds Figure 528 A B C D l E l F l G Supermarket 1 Customer Time Summary measures for selected variables Checkout Times 2 1 13 1 Ti me 3 2 101 Count 113000 4 3 178 Mean 159239 5 4 246 Median 155000 6 5 207 Standard deviation 52609 7 6 155 Minimum 40000 8 7 95 Maximum 279000 9 8 105 10 9 168 11 10 92 12 11 112 13 12 163 111 110 138 112 111 279 113 112 90 114 113 155 The supermarket manager would like to check whether these data are normally dis tributed or whether some other distribution ts them better How can he tell Objective To use RISK to determine which probability distribution ts the given data best Solution To open RISK click on the Windows Start button nd the Palisade group and click on RISK If Excel is already open this opens RISK on top of it If Excel isn t open this launches Excel and RISK You will know that RISK is open when you see the RISK tab and the associated ribbon in Figure 529 For now choose the Distribution Fitting item From here you can go in one of two ways You can test the t of a given distribution or you can nd the best tting distribution from a number of candidates Both are now illustrated Because the supermarket manager wants to know whether the data could come from a normal distribution check this possibility rst To do so select Fit Manager from the Distribution Fitting dropdown menu The rst step is to de ne a data set as in Figure 530 Figure 529 RISK Ribbon 3 1 0O HvIn39Ie lrrsert Page tarynvut Ftrrrrrulas Data Review View DelrelapEr Acrobat 2FtI5l C 1 C p Lia Relations 39 0 0 E7 39 SIirrrrnarv as Lilirarrquot p J 2 E ma P b quot 1 I 739 quot quotquot 39 min Simulations 1 39 5 l Hr Elzlinr Fl iEtE H Utilities quot DvefinE P drlI insert Detilrire Eiistrilmtitrn Distribution ltludel K Start Advanced Eiras2 v 2 1 Excel Ewari aistriautiuns output Funzction v CrrrretatiJur39Is raringv rtrrrsr wunaaal F f V l K Simutaticm A39nafrs1Es 391 Pesults y b El z 1 Relarts F39unctlcns HP 39 l 39l395quoti393 l 5l quot39 l i3939 llquot5quotquot39 l FlE539J 5 l9 quot5 256 Chapter 5 Normal Binomial Poisson and Exponential Distributions Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Figure 530 ERISH Fit Distrihutinna in Data De ning 3 Data Set Di5triIItipn5 tp Fit ghiSq Dinning Data Set ame quotTime ange a2a114 I139pe lllipntinupus Sample Data 39 Filter Type quotNpne En Fit H Cancel il Figure 53 I DRIER Fit Distrihutians in Data Selecting the Data Di5triIItipn5 tp Fit I ghiSq Einning H Distributions to Fit Fitting Methpci Parameter Estimatipn 1quot Lpwer Limit Di5triuIItipn5 Fquot Eieu Bpuncl lift F GEnEra 39 II F puncienzh But Llnknpwn F EH Fquot gpen Etenu5 tp InFinityI I EH5lg if gnsure F Eppn I Et3939aIIe Upper Limit If Gamma i Figeci Eipunci M I39WGa 55 V g L t F E39pIinuenL But Llnignpwn F Lug 39E t an angle III if Dpen Etenu5 tp InFinitvI F L gmrm n 3939 539quotquot3 I Lpgnprm2 Nprmal F F39aretp r F39aretp2 i Bi Fit I Cancel T The second step is to click on the Distributions to Fit tab and select the Normal distribu tion as shown in Figure 531 To see how well a normal distribution fits the data click on the Fit button This produces the output shown in Figure 532 with a normal curve super imposed on the histogram of the data A visual examination of this graph is often suf cient to tell whether the fit is any good This t appears to be fair but not great 57 Fitting a Probability Distribution to Data with Risk 257 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Figure 532 RISK Fit Results Normal Fit to the Fit FquotE39quotl j 39339 3 Fit Comparison for Time l Fit Cl iStil FtislltI39ormaI159239E2EuIZI9I Data pZ l39lIIiTI39IEll El 246 E 5I39 quota rI 539939 quota III399 IIIi3 IIIJIIIII I Input Minimum 4EIIIIIIIIIIIII Maximum 2Iquot9IIIIIIII Mean 1592399 Std Dev 929999 3939alues 113 IIIE III5 D39DD l 39 1 Normal Minimum no Maximum no Mean 1592399 Std Deir 52EII9II III3 EEE2 III1 oooo P5 Cl SCI IIIIIII 15E 2II Ci Ll39 I IquotJ Ifquot39I Write To Cell V Close RISK provides several numerical measures of the goodness of t which you can nd by clicking on the dropdown arrow next to Fit Ranking at the top left in the gure The details are rather technical but each test value measures goodness of t in a slightly different way For each of these measures the larger the test value is the worse the t is They can then be used to compare ts the distribution with the lowest test values is the winner To see which of several possible distributions t the data best go back to the Fit Distributions to Data dialog box and click on the Distributions to Fit tab See Figure 533 Figure 533 Px Selecting gate Distributions to Fit gi5q ani Distributions to Fit Fitting Method F39arameter Estimation Lower Limit Distributions t Fixed Enund P t l9etaGenera ed Cl39iiSo Ir ounded But Unknown s Erlang F gpen Etends to Infinity r Expnn Irquot Qnsure p r Gamma InvGauss Upper Lquot39quott F LogLogistin flquot Figed Bound ll lquotquotquotquotI39mquot L 2 P Bounded But Llnlgnown F Pngnsrm are o if Clgen Etends to InFInItyI o Paremg F l39 539quotquot3 l F earson5 I7 F earsonE On Rayleigh I Triunn ll Clear all IE Fit Cancel l 258 Chapter 5 Normal Binomial Poisson and Exponential Distributions Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Some reasonable choices about the checkout data have been made on the left The lowest possible checkout time is 0 but there is no obvious upper limit When you make such choices the set of possible distributions that are checked on the right changes For example the selected list here contains only distributions with a lower limit of 0 Note that the nor mal distribution does not satisfy this condition You can then uncheck any distributions you do not want included in the search for the best t For example you might want to uncheck distributions you have never heard of Once you specify these candidate distributions and click on Fit RISK performs a numerical algorithm to nd the best tting distribution from each selected distribution family the best gamma of all gamma distributions for example and displays them in ranked order from best to worst The best t for these data is the beta general distribution as shown in Figure 534 The beta general family includes skewed distributions although Figure 534 Beta General Fit to Fit RE39 39339 Fit Comparison for Time F39iskEIeaGeneraI41234134EIII314I 2 lEI 5 I39 39quot 11313ans 53 Erlang 111415 39quot39quot399 39 Lugnurm 15 B 1 42 Triang 194248 39quot39quot393 39 Rayleigh 31 IIITquot I Input Minimum 4IIIII Maximum 239EIIIIIIIIIIIII Mean 1592339 St Dev 525IIIEE 3939aue5 113 IIIEu EEE5 III4 1 Eeta eneral Minimum III IIIIIIIIII IILIIIEI3 Maximum 3134300 Mean 1539E4539quot3 Std Dev 52031 III2 III1 EEEEI CI Il 5 1II 15I 2II 25I CI Il Cl Ll39 Cquot39I quot39I E J Write To Cell Close this one appears to be symmetric You can also click on any of the runner up distribu tions to see how well they t For example the triangular t is shown in Figure 535 Obviously this t is not nearly as good as the beta general t It is not always easy to look at these graphs and judge which t is best This is the rea son for the goodness of t measures Comparing Figures 534 and 535 you can see that the triangular t is considerably worse than the beta general its test values some not shown are all much larger By comparison the test values for the normal t in Figure 532 are quite comparable to those for the beta general The only downside to the normal distri bution in this example is that checkout times cannot possibly be negative which the nor mal distribution allows But the probability of a negative value for this particular normal distribution is so low that the manager might decide to use it anyway I At this point you might wonder why we bother tting a distribution to a set of data in the rst place The usual reason is given in the following scenario Suppose a manager needs to make a decision but there is at least one source of uncertainty If the manager wants to develop a decision model or perhaps a simulation model to help solve his problem 57 Fitting a Probability Distribution to Data with Risk 259 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Figure 535 Triangular Fit to the Data Qmm Fit Ranking Fiti emtznarieen for Time Fit Bb ChiSill F39iskTriannI81882848I 8etaGeneral 48823 228 2488 Weibull 58318 5r Gamma 188288 115 Erlang 111415 quot3 39 Lugnurm 158142 Triere A Dquotm 39 Rayleigh 324228 T 888 I Input ilggf Minimum 488888 Maximum 2288888 Mean 1582388 jjj5 Stl Dev 526886 3939alue 113 ll39llll l 39 Triang Minimum 88888 Maximum 2824888 Mean 1558388 5tl Dev 583828 8883 8882 8881 8888 D r Y l 58 188 158 288 Ci Ll39 I IquotJ 8quotI Write To Cell 3 Cluse I probability distributions of all uncertain outcomes are typically required The manager could always choose one of the well known distributions such as the normal for all uncer tain outcomes but these might not re ect reality well Instead the manager could gather historical data such as those in the preceding example nd the distribution that ts these data best and then use this distribution in the decision or simulation model Of course as this example has illustrated it helps to know a few distributions other than the normal the Weibull and the gamma for example Although we will not pursue these in this book the more distributions you have in your tool kit the more effectively you can model uncertainty PROBLEMS Level A A production manager is interested in determining the 48 proportion of defective items in a typical shipment of one of the computer components that her company manufactures The proportion of defective components is recorded for each of 250 randomly selected ship ments collected during a one month period The data are in the le P0547XlsX Use RISK to determine which probability distribution best ts these data The manager of a local fast food restaurant is inter ested in improving the service provided to customers who use the restaurant s drive up window As a first step in this process the manager asks his assistant to 50 record the time in minutes it takes to serve 200 dif ferent customers at the final window in the facility s drive up system The given 200 customer service times are all observed during the busiest hour of the day for this fast food operation The data are in the file P0548XlsX Use RISK to determine which proba bility distribution best ts these data The operations manager of a tollbooth located at a major exit of a state tumpike is trying to estimate the average number of vehicles that arrive at the tollbooth during a one minute period during the peak of rush hour traf c To estimate this average throughput value he records the number of vehicles that arrive at the tollbooth over a one minute interval commencing at the same time for each of 250 normal weekdays The data are in the le P0549XlsX Use RISK to determine which probability distribution best ts these data A nance professor has just given a midterm examination in her corporate finance course and is interested in learning how her class of 250 students performed on this exam The data are in the file P0550XlsX Use RISK to determine which probability distribution best ts these data 260 Chapter 5 Normal Binomial Poisson and Exponential Distributions Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it 58 CONCLUSION We have covered a lot of ground in this chapter and much of the material especially that on the normal distribution will be used in later chapters The normal distribution is the cornerstone for much of statistical theory As you will see in later chapters on statistical inference and regression an assumption of normality is behind most of the procedures we use Therefore it is important for you to understand the properties of the normal distribu tion and how to work with it in Excel The binomial Poisson and exponential distribu tions although not used as frequently as the normal distribution in this book are also extremely important The examples we have discussed indicate how these distributions can be used in a variety of business situations Although we have attempted to stress concepts in this chapter we have also described the details necessary to work with these distributions in Excel Fortunately these details are not too difficult to master once you understand Excel s built in functions especially NORMDIST NORMINV and BINOMDIST Figures 56 and 518 provide typical exam ples of these functions We suggest that you keep a copy of these figures handy Summary of Key Terms Term Explanation Excel9 Page Equation Density Specifies the probability distribution 211 function of a continuous random variable Normal distribution A continuous distribution with possible 213 51 values ranging over the entire number line its density function is a symmetric bell shaped curve Standardizing a Transforms any normal distribution STANDARDIZE 214 52 normal random with mean pm and standard deviation variable 039 to the standard normal distribution with mean 0 and standard deviation 1 Normal calculations Useful for finding probabilities and NORMDIST 217 in Excel percentiles for nonstandard and NORMSDIST standard normal distributions NORMINV NORMSINV Empirical rules for About 68 of the data fall within 221 normal distribution one standard deviation of the mean about 95 of the data fall within two standard deviations of the mean and almost all fall within three standard deviations of the mean Binomial The distribution of the number of BINOMDIST 233 distribution successes in n independent identical CRITBINOM trials where each trial has probability p of success Mean and standard The mean and standard deviation of a 236 53 54 deviation of a binomial distribution with parameters binomial n and p are np and n p1 p distribution respectively Sampling without Sampling where no member of the 236 replacement population can be sampled more than once continued 9See the text for the new versions of some of these Excel functions in Excel 2010 58 Conclusion 26 I Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Summary of Key Terms Continued Term Explanation Excel9 Page Equation Sampling with Sampling where any member of the 236 replacement population can be sampled more than once Normal If np gt 5 and n1 p gt 5 the binomial 237 approximation to distribution can be approximated well by the binomial a normal distribution with mean rip and distribution standard deviation n p1 p Poisson distribution A discrete probability distribution POISSON 250 that often describes the number of events occurring within a specified period of time or space mean and variance both equal the parameter Exponential A continuous probability distribution EXPONDIST 252 distribution useful for measuring times between events such as customer arrivals to a service facility mean and standard deviation both equal 1 Relationship Exponential distribution measures 254 between Poisson times between events Poisson and exponential distribution counts the number of distributions events in a certain period of time RISK An Excel add in for finding how Distribution 255 well a specified distribution fits a set Fitting item of data or for finding the distribution on RISK that best fits a set of data ribbon PROBLEMS Conceptual Questions C1 For each of the following uncertain quantities discuss whether it is reasonable to assume that the probability distribution of the quantity is normal If the answer isn t obvious discuss how you could discover whether a normal distribution is reasonable a The change in the Dow Jones Industrial Average between now and a year from now b The length of time in months a battery that is in continuous use lasts c The time between two successive arrivals to a bank d The time it takes a bank teller to service a random customer e The length in yards of a typical drive on a par 5 by Phil Michelson f The amount of snowfall in inches in a typical winter in Minneapolis g The average height in inches of all boys in a randomly selected seventh grade middle school class h Your bonus from nishing a project where your bonus is 1000 per day under the deadline if the project is completed before the deadline your bonus is 500 if the project is completed right on the deadline and your bonus is 0 if the project is completed after the deadline Your gain on a call option on a stock where you gain nothing if the price of the stock a month from now is less than or equal to 50 and you gain P50 dollars if the price P a month from now is greater than 50 C2 For each of the following uncertain quantities discuss whether it is reasonable to assume that the probability distribution of the quantity is binomial If you think it is what are the parameters n and p If you think it isn t explain your reasoning 3 b The number of wins the Boston Red Sox baseball team has next year in its 81 home games The number of free throws Kobe Bryant misses in his next 250 attempts The number of free throws it takes Kobe Bryant to achieve 100 successes The number out of 1000 randomly selected customers in a supermarket who have a bill of at least 150 262 Chapter 5 Normal Binomial Poisson and Exponential Distributions Copyright 2010 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it C3 C4 C5 C6 C7 C8 C9 e The number of trading days in a typical year where Microsoft s stock price increases 139 The number of spades you get in a 13 card hand from a well shuffIed 52 card deck g The number of adjacent 15minute segments during a typical Friday where at least 10 customers enter a McDonald s restaurant h The number of pages in a 500 page book with at least one misprint on the page The Poisson distribution is often appropriate in the binomial situation of n independent and identical trials where each trial has probability p of success but n is very large and p is very small In this case the Poisson distribution is relevant for the number of successes and its parameter its mean is np Discuss some situations where such a Poisson model might be appropriate How would you measure n and p or would you measure only their product np Here is one to get you started the number of traf c accidents at a particular intersection in a given year One disadvantage of a normal distribution is that there is always some probability that a quantity is negative even when this makes no sense for the uncertain quantity For example the time a light bulb lasts cannot be negative In any particular situation how would you decide whether you could ignore this disadvantage for all practical purposes Explain why probabilities such as PX lt x and PX S x are equal for a continuous random variable State the major similarities and differences between the binomial distribution and the Poisson distribution You have a bowl with 100 pieces of paper inside each with a person s name written on it It tums out that 50 of the names correspond to males and the other 50 to females You reach inside and grab ve pieces of paper If X is the random number of male names you choose is X binomially distributed Why or why not A distribution we didn t discuss is the Bernoulli distribution It is essentially a binomial distribution with n 1 In other words it is the number of successes 0 or 1 in a single trial when the probability of success is p What are the mean and standard deviation of a Bernoulli distribution Discuss how a binomial random variable can be expressed in terms of n independent Bernoulli random variables each with the same parameter p For real applications the normal distribution has two potential drawbacks 1 it can be negative and 2 it isn t symmetric Choose some continuous random numeric outcomes of interest to you Are either potential drawbacks really drawbacks for your random outcomes If so which is the more serious drawback C10 Many basketball players and fans believe strongly in C11 the hot hand That is they believe that players tend to shoot in streaks either makes or misses If this is the case why does the binomial distribution not apply at least not exactly to the number of makes in a given number of shots Which assumption of the binomial model is violated the independence of successive shots or the constant probability of success on each shot Or can you tell Suppose the demands in successive weeks for your product are normally distributed with mean 100 and standard deviation 20 and suppose your lead time for receiving a placed order is three weeks A quantity of interest to managers is the lead time demand the total demanded over three weeks Why does the formula for the standard deviation of lead time demand include a square root of 3 What assumptions are behind this Level A Suppose the annual return on XYZ stock follows a normal distribution with mean 012 and standard deviation 030 52 3 b What is the probability that XYZ s value will decrease during a year What is the probability that the return on XYZ during a year will be at least 20 What is the probability that the return on XYZ during a year will be between 6 and 9 There is a 5 chance that the retum on XYZ during a year will be greater than what value There is a 1 chance that the retum on XYZ during a year will be less than what value 139 There is a 95 chance that the return on XYZ during a year will be between which two values equidistant from the mean Assume the annual mean return on ABC stock is around 15 and the annual standard deviation is around 25 Assume the annual and daily returns on ABC stock are normally distributed 3 b What is the probability that ABC will lose money during a year There is a 5 chance that ABC will earn a return of at least what value during a year There is a 10 chance that ABC will earn a return of less than or equal to what value during a year What is the probability that ABC will earn at least 35 during a year Assume there are 252 trading days in a year What is the probability that ABC will lose money on a given day H int Let Y be the annual return on ABC and XI be the retum on ABC on day i Then approximately Y X1 X2 X252 Use the fact that the sum of independent normal random variables is normally distributed with 263 58 Conclusion Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it 54 56 mean equal to the sum of the individual means and variance equal to the sum of the individual variances Suppose Comdell Computer receives its hard drives from Diskco On average 4 of all hard disk drives received by Comdell are defective a Comdell has adopted the following policy It samples 50 hard drives in each shipment and accepts the shipment if all hard drives in the sample are nondefective What fraction of shipments will Comdell accept b Suppose instead that the shipment is accepted if at most one hard drive in the sample is defective What fraction of shipments will Comdell accept c What is the probability that a sample of size 50 will contain at least 10 defectives A family is considering a move from a midwestern city to a city in California The distribution of housing costs where the family currently lives is normal with mean 105000 and standard deviation 18200 The distribution of housing costs in the California city is normal with mean 235000 and standard deviation 30400 The family s current house is valued at 1 10000 a What percentage of houses in the family s current city cost less than theirs b If the family buys a 200000 house in the new city what percentage of houses there will cost less than theirs c What price house will the family need to buy to be in the same percentile of housing costs in the new city as they are in the current city The number of traf c fatalities in a typical month in a given state has a normal distribution with mean 125 and standard deviation 31 a If a person in the highway department claims that there will be at least m fatalities in the next month with probability 095 what value of m makes this claim true b If the claim is that there will be no more than n fatalities in the next month with probability 098 what value of n makes this claim true It can be shown that a sum of independent normally distributed random variables is also normally distributed Do all functions of normal random variables lead to normal random variables Consider the following SuperDrugs is a chain of drugstores with three similar size stores in a given city The sales 57 58 calculate the maximum Then replicate this maximum 500 times and create a histogram of the 500 maximum values Does it appear to be normally shaped Whatever this distribution looks like use your simulated values to estimate its mean and standard deviation of the maximum In the game of baseball every time a player bats he is either successful gets on base or he fails doesn t get on base This is all you need to know about baseball for this problem His on base percentage usually expressed as a decimal is the percentage of times he is successful Let s consider a player who is theoretically a 0375 on base batter Speci cally assume that each time he bats he is successful with probability 0375 and unsuccessful with probability 0625 Also assume that he bats 600 times in a season What can you say about his on base percentage of successes600 for the season H int Each on base percentage is equivalent to a number of successes For example 0380 is equivalent to 228 successes because 0380600 228 a What is the probability that his on base percentage will be less than 0360 b What is the probability that his on base percentage will be greater than 0370 c What is the probability that his on base percentage will be less than or equal to 0400 In the nancial world there are many types of complex instruments called derivatives that derive their value from the value of an underlying asset Consider the following simple derivative A stock s current price is 80 per share You purchase a derivative whose value to you becomes known a month from now Speci cally let P be the price of the stock in a month If P is between 75 and 85 the derivative is worth nothing to you If P is less than 75 the derivative results in a loss of 10075 P dollars to you The factor of 100 is because many derivatives involve 100 shares If P is greater than 85 the derivative results in a gain of 100P 85 dollars to you Assume that the distribution of the change in the stock price from now to a month from now is normally distributed with mean 1 and standard deviation 8 Let Pbig loss be the probability that you lose at least 1000 that is the price falls below 65 and let Pbig gain be the probability that you gain at least 1000 that is the price rises above 95 Find these two probabilities How do they compare to one another in a given week for any of these stores is normally Level B distributed with mean 15000 and standard deviation 3000 At the end of each week the sales gure for 59 When you sum 30 or more independent random the store with the largest sales among the three stores is recorded Is this maximum value normally distributed To answer this question simulate a weekly sales gure at each of the three stores and variables the sum of the random variables will usually be approximately normally distributed even if each individual random variable is not normally distributed Use this fact to estimate the probability that a casino 264 Chapter 5 Normal Binomial Poisson and Exponential Distributions Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters 60 62 will be behind after 90000 roulette bets given that it wins 1 or loses 35 on each bet with probabilities 3738 and 138 The daily demand for six packs of Coke at Mr D s supermarket follows a normal distribution with mean 120 and standard deviation 30 Every Monday the Coke delivery driver delivers Coke to Mr D s If Mr D s wants to have only a 1 chance of running out of Coke by the end of the week how many should Mr D s order for the week Assume orders are placed on Sunday at midnight Also assume that demands on different days are probabilistically independent Use the fact that the sum of independent normal random variables is normally distributed with mean equal to the sum of the individual means and variance equal to the sum of the individual variances Many companies use sampling to determine whether a batch should be accepted An n c sampling plan consists of inspecting n randomly chosen items from a batch and accepting the batch if c or fewer sampled items are defective Suppose a company uses a 100 5 sampling plan to determine whether a batch of 10000 computer chips is acceptable 21 The producer s risk of a sampling plan is the probability that an acceptable batch will be rejected by the sampling plan Suppose the customer considers a batch with 3 defectives acceptable What is the producer s risk for this sampling plan b The consumer s risk of a sampling plan is the probability that an unacceptable batch will be accepted by the sampling plan Our customer says that a batch with 9 defectives is unacceptable What is the consumer s risk for this sampling plan Suppose that if a presidential election were held today 53 of all voters would vote for Obama over McCain You can substitute the names of the current presidential candidates This problem shows that even if there are 100 million voters a sample of several thousand is enough to determine the outcome even in a fairly close election a If 1500 voters are sampled randomly what is the probability that the sample will indicate correctly that Obama is preferred to McCain b If 6000 voters are sampled randomly what is the probability that the sample will indicate correctly that Obama is preferred to McCain A soft drink factory lls bottles of soda by setting a timer on a filling machine It has generally been observed that the distribution of the number of ounces the machine puts into a bottle is normal with standard deviation 005 ounce The company wants 999 of all its bottles to have at least 16 ounces of soda To what value should the mean amount put in each bottle be set Of course the company does not want to fill any more than is necessary 64 65 66 The time it takes you to swim 100 yards in a race is normally distributed with mean 62 seconds and standard deviation 2 seconds In your next ve races what is the probability that you will swim under a minute exactly twice A company assembles a large part by joining two smaller parts together Assume that the smaller parts are normally distributed with a mean length of 1 inch and a standard deviation of 001 inch 21 What fraction of the larger parts are longer than 205 inches Use the fact that the sum of independent normal random variables is normally distributed with mean equal to the sum of the individual means and variance equal to the sum of the individual variances b What fraction of the larger parts are between 196 inches and 202 inches long Suggested by Sam Kaufmann Indiana University MBA who runs Harrah s Lake Tahoe Casino A high roller has come to the casino to play 300 games of craps For each game of craps played there is a 0493 probability that the high roller will win 1 and a 0507 probability that the high roller will lose 1 After 300 games of craps what is the probability that the casino will be behind more than 10 Suggested by Sam Kaufmann Indiana University 68 69 MBA who runs Harrah s Lake Tahoe Casino A high roller comes to the casino intending to play 500 hands of blackjack for 1 a hand On each hand the high roller will win 1 with probability 048 and lose 1 with probability 052 After the 500 hands what is the probability that the casino has lost more than 40 A soft drink company produces 100000 12 ounce bottles of soda per year By adjusting a dial the company can set the mean number of ounces placed in a bottle Regardless of the mean the standard deviation of the number of ounces in a bottle is 005 ounce The soda costs 5 cents per ounce Any bottle weighing less than 12 ounces will incur a 10 ne for being underweight Determine a setting for the mean number of ounces per bottle of soda that minimizes the expected cost per year of producing soda Your answer should be accurate within 0001 ounce Does the number of bottles produced per year in uence your answer The weekly demand for TVs at Lowland Appliance is normally distributed with mean 400 and standard deviation 100 Each time an order for TVs is placed it arrives exactly four weeks later That is TV orders have a four week lead time Lowland doesn t want to run out of TVs during any more than 1 of all lead times How low should Lowland let its TV inventory drop before it places an order for more TVs Hint How many standard deviations above the mean lead time demand must the reorder point be for there to be 265 58 Conclusion Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it 70 71 72 a 1 chance of a stockout during the lead time Use the fact that the sum of independent normal random variables is normally distributed with mean equal to the sum of the individual means and variance equal to the sum of the individual variances An elevator rail is assumed to meet speci cations if its 74 diameter is between 098 and 101 inches Each year a company produces 100000 elevator rails For a cost of 100392 per year the company can rent a machine that produces elevator rails whose diameters have a standard deviation of o The idea is that the company must pay more for a smaller variance Each such machine will produce rails having a mean diameter of one inch Any rail that does not meet speci cations must be reworked at a cost of 12 Assume that the diameter of an elevator rail follows a normal distribution a What standard deviation within 0001 inch minimizes the annual cost of producing elevator rails You do not need to try standard deviations in excess of 002 inch For your answer in part a one elevator rail in 1000 will be at least how many inches in diameter A 50 question true false examination is given Each correct answer is worth 10 points Consider an unprepared student who randomly guesses on each question a If no points are deducted for incorrect answers what is the probability that the student will score at least 350 points If 5 points are deducted for each incorrect answer what is the probability that the student will score at least 200 points c If 10 points are deducted for each incorrect answer what is the probability that the student will receive a negative score 76 The percentage of examinees who took the GMAT Graduate Management Admission exam from June 1992 to March 1995 and scored below each total score is given in the le P0572xlsx For example 96 of all examinees scored 690 or below The mean GMAT score for this time period was 497 and the standard deviation was 105 Does it appear that GMAT scores can accu rately be approximated by a normal distribution Source 1995 GMAT Examinee Interpretation Guide What caused the crash of TWA Flight 800 in 1996 Physics professors Hailey and Helfand of Columbia University believe there is a reasonable possibility that a meteor hit Flight 800 They reason as follows On a given day 3000 meteors of a size large enough to destroy an airplane hit the earth s atmosphere Approximately 50000 ights per day averaging two hours in length have been own from 1950 to 1996 This means that at any given point in time planes in ight cover approximately two billionths of the world s atmosphere Determine the probability that at least one plane in the last 47 years has been downed by a meteor Hint Use the Poisson approximation to the binomial This approximation says that if n is large and p is small a binomial distribution with parameters n and p is approximately Poisson distributed with A np In the decade 1982 through 1991 10 employees working at the Amoco Company chemical research center were stricken with brain tumors The average employment at the center was 2000 employees Nationwide the average incidence of brain tumors in a single year is 20 per 100000 people If the incidence of brain tumors at the Amoco chemical research center were the same as the nationwide incidence what is the probability that at least 10 brain tumors would have been observed among Amoco workers during the decade 1982 through 1991 What do you conclude from your analysis Source AP wire service report March 12 1994 Claims arrive at random times to an insurance company The daily amount of claims is normally distributed with mean 1570 and standard deviation 450 Total claims on different days each have this distribution and they are probabilistically independent of one another 21 Find the probability that the amount of total claims over a period of 100 days is at least 150000 Use the fact that the sum of independent normally dis tributed random variables is normally distributed with mean equal to the sum of the individual means and variance equal to the sum of the individual variances If the company receives premiums totaling 165000 find the probability that the company will net at least 10000 for the 100 day period A popular model for stock prices is the following If p0 is the current stock price then the price k periods from now pk where a period could be a day week or any other convenient unit of time and k is any positive integer is given by pk p0expu 0502k sZE Here exp is the exponential function EXP in Excel p is the mean percentage growth rate per period of the stock I is the standard deviation of the growth rate per period and Z is a normally distributed random variable with mean 0 and standard deviation 1 Both p and o are typically estimated from actual stock price data and they are typically expressed in decimal for1n such as p 001 for a 1 mean growth rate 21 Suppose a period is de ned as a month the current price of the stock as of the end of December 2010 is 75 p 0006 and 039 0028 Use simulation to obtain 500 possible stock price changes from the end of December 2010 to the end of December 2013 Each simulated change will be the price at the 266 Chapter 5 Normal Binomial Poisson and Exponential Distributions Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters 77 78 end of 2013 minus the price at the end of 2010 Note that you can simulate a given change in one line and then copy it down Create a histogram of these changes to see whether the stock price change is at least approximately normally distributed Also use the simulated data to estimate the mean price change and the standard deviation of the change b Use simulation to generate the ending stock prices for each month in 2011 Use k 1 to get J anuary s price from December s use k 1 again to get February s price from J anuary s and so on Then use a data table to replicate the ending December 2011 stock price 500 times Create a histogram of these 500 values Do they appear to resemble a normal distribution Your company is running an audit on the Sleaze Company Because Sleaze has a bad habit of overcharging its customers the focus of your audit is on checking whether the billing amounts on its invoices are correct Assume that each invoice is for too high an amount with probability 006 and for too low an amount with probability 001 so that the probability of a correct billing is 093 Also assume that the outcome for any invoice is probabilistically independent of the outcomes for other invoices a If you randomly sample 200 of Sleaze s invoices what is the probability that you will find at least 15 invoices that overcharge the customer What is the probability you won t find any that undercharge the customer b Find an integer k such that the probability is at least 099 that you will find at least k invoices that overcharge the customer H int Use trial and error with the BINOMDIST function to find k Continuing the previous problem suppose that when Sleaze overcharges a customer the distribution of the amount overcharged expressed as a percentage of the correct billing amount is normally distributed with mean 15 and standard deviation 4 a What percentage of overbilled customers are charged at least 10 more than they should pay b What percentage of all customers are charged at least 10 more than they should pay c If your auditing company samples 200 randomly chosen invoices what is the probability that it will find at least five where the customer was overcharged by at least 10 Your manufacturing process makes parts such that each 80 part meets speci cations with probability 098 You need a batch of 250 parts that meet speci cations How many parts must you produce to be at least 99 certain of producing at least 250 parts that meet speci cations Let X be normally distributed with a given mean and standard deviation Sometimes you want to find two 82 values a and I such that Pa lt X lt b is equal to some speci c probability such as 090 or 095 There are many answers to this problem depending on how much probability you put in each of the two tails For this question assume the mean and standard deviation are p 100 and o 10 and that you want to nd a and I such that Pa lt X lt b 090 21 Find a and b so that there is probability 005 in each tail b Find a and b so that there is probability 0025 in the left tail and 0075 in the right tail c The usual answer to the general problem is the answer from part a that is where you put equal probability in the two tails It turns out that this is the answer that minimizes the length of the interval from a to b That is if you solve the following problem minimize b a subject to Pa lt X lt b 090 you will get the same answer as in part a Verify this by using Excel s Solver add in As any credit granting agency knows there are always some customers who default on credit charges Typically customers are grouped into relatively homo geneous categories so that customers within any category have approximately the same chance of defaulting on their credit charges Here we will look at one particular group of customers We assume each of these customers has 1 probability 007 of default ing on his or her current credit charges and 2 total credit charges that are normally distributed with mean 350 and standard deviation 100 We also assume that if a customer defaults 20 of his or her charges can be recovered The other 80 are written off as bad debt 21 What is the probability that a typical customer in this group will default and produce a write off of more than 250 in bad debt b If there are 500 customers in this group what are the mean and standard deviation of the number of customers who will meet the description in part a c Again assuming there are 500 customers in this group what is the probability that at least 25 of them will meet the description in part a d Suppose now that nothing is recovered from a default the whole amount is written off as bad debt Show how to simulate the total amount of bad debt from 500 customers in just two cells one with a binomial calculation the other with a normal calculation The Excel functions discussed in this chapter are useful for solving a lot of probability problems but there are other problems that even though they are similar to normal or binomial problems cannot be solved with these functions In cases like this simulation can often be used Here are a couple of 267 58 Conclusion Copyright 2010 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it such problems for you to simulate For each example simulate 500 replications of the experiment a You observe a sequence of parts from a manufacturing line These parts use a component that is supplied by one of two suppliers Each part made with a component from supplier 1 works properly with probability 095 and each part made with a component from supplier 2 works properly with probability 098 Assuming that 100 of these parts are made 60 from supplier 1 and 40 from supplier 2 you want the probability that at least 97 of them work properly b Here we look at a more generic example such as coin ipping There is a sequence of trials where each trial is a success with probability p and a failure with probability 1 p A run is a sequence of consecutive successes or failures For most of us intuition says that there should not be long runs Test this by nding the probability that there is at least one run of length at least six in a sequence of 15 trials The run could be of 0s or 1s You can use any value of 19 you like or try different values of 19 You have a device that uses a single battery and you operate this device continuously never tuming it off Whenever a battery fails you replace it with a brand new one immediately Suppose the lifetime of a typical battery has an exponential distribution with mean 205 minutes Suppose you operate the device continuously for three days making battery changes when necessary Find the probability that you will observe at least 25 failures Hint The number of failures is Poisson distributed 84 In the previous problem we ran the experiment for a certain number of days and then asked about the number of failures In this problem we take a different point of view Suppose you operate the device starting with a new battery until you have observed 25 battery failures What is the probability that at least 15 of these 25 batteries lived at least 35 hours Hint Each lifetime is exponentially distributed In the game of soccer players are sometimes awarded a penalty kick The player who kicks places the ball 12 yards from the 24 foot wide goal and attempts to kick it past the goalie into the net The goalie is the only defender The question is where the player should aim Make the following assumptions 1 The player s kick is off target from where he aims left or right by a normally distributed amount with mean 0 and some standard deviation 2 The goalie typically guesses left or right and dives in that direction at the moment the player kicks If the goalie guesses wrong he won t block the kick but if he guesses correctly he will be able to block a kick that would have gone into the net as long as the kick is within a distance d from the middle of the goal The goalie is equally likely to guess left or right 3 The player never misses high but he can miss to the right of the goal if he aims to the right or to the left if he aims to the left For reasonable values of the standard deviation and d nd the probability that the player makes a goal if he aims at a point t feet inside the goal By symmetry you can assume he aims to the right although the goalie doesn t know this What value of t seems to maximize the probability of making a goal 268 Chapter 5 Normal Binomial Poisson and Exponential Distributions Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it 5 EUROWATCH COMPANY he EuroWatch Company assembles expensive wristwatches and then sells them to retailers throughout EuropeThe watches are assembled at a plant with two assembly inesThese lines are 4 intended to be identical but line I uses somewhat older equipment than line 2 and is typically less reliable Historical data have shown that each watch coming offline independenty of the others is free of defects with probability O98The similar probability for line 2 is 099 Each line produces 500 watches per hourThe production manager has asked you to answer the following questions I She wants to know how many defectfree watches each line is likely to produce in a given hour Specifically nd the smallest integer k for each line separately such that you can be 99 sure that the line will not produce more than k defective watches in a given hour 2 EuroWatch currently has an order for 500 watches from an important customerThe company plans to ll this order by packing slightly more than 500 watches all from line 2 and sending this package off to the customer Obviously EuroWatch wants to send as few watches as possible but it wants to be 99 sure that when the customer opens the package there are at least 500 defectfree watches How many watches should be packed 3 EuroWatch has another order for I000 watches Now it plans to ll this order by packing slightly more than one hour s production from each line This package will contain the same number of watches from each ineAs in the previous question EuroWatch wants to send as few watches as possible but it again wants to be 99 sure that when the customer opens the package there are at least I000 defectfree watchesThe question of how many watches to pack is unfortunately quite dif cult because the total number of defectfree watches is not binomially distributed Why not Therefore the manager asks you to solve the problem with simulation and some trial and error Hint It turns out that it is much faster to simulate small numbers than large numbers so simulate the number of watches with defects not the number without defects Finally EuroWatch has a third order for I00 watchesThe customer has agreed to pay 50000 for the order that is 500 per watch If EuroWatch sends more than I00 watches to the customer its revenue doesn t increase it can never exceed 50000 Its unit cost of producing a watch is 450 regardless of which line it is assembled onThe order will be lled entirely from a single line and EuroWatch plans to send slightly more than I00 watches to the customer If the customer opens the shipment and nds that there are fewer than I00 defectfree watches which we assume the customer has the ability to dothen he will pay only for the defectfree watches Eu roWatch s revenue will decrease by 500 per watch short of the I00 required and on top of this EuroWatch will be required to make up the difference at an expedited cost of I 000 per watchThe customer won t pay a dime for these expedited watches If expediting is required EuroWatch will make sure that the expedited watches are defectfree It doesn t want to lose this customer entirely You have been asked to develop a spreadsheet model to nd EuroWatch s expected pro t for any number of watches it sends to the customerYou should develop it so that it responds correctly regardless of which assembly line is used to ll the order and what the shipment quantity is Hints Use the BINOMDIST function with last argument 0 to ll up a column of probabilities for each possible number of defective watches Next to each of these calculate EuroWatch s pro tThen use a SUMPRODUCT to obtain the expected profit Finally you can assume that EuroWatch will never send more than I I0 watches It turns out that this large a shipment is not even close to optimal I Case 5 EuroWatch Company 269 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it 52 CASHING IN ON THE LOTTERY M any states supplement their tax revenues with statesponsored lotteries Most of them do so with a game called ottoAthough there are various versions of this gamethey are all basically as follows People purchase tickets that contain r distinct numbers from I to mwhere r is generally 5 or 6 and m is generally around 50 For example in Virginia the state discussed in this case r 6 and m 44 Each ticket costs I about 39 cents of which is allocated to the total jackpot39 There is eventually a drawing of r 6 distinct numbers from the m 44 possible numbersAny ticket that matches these 6 numbers wins the jackpot There are two interesting aspects of this game First the current jackpot includes not only the revenue from this round of ticket purchases but also any jackpots carried over from previous drawings because of no winning ticketsTherefore the jackpot can build from one drawing to the next and in celebrated cases it has become huge Second if there is more than one winning ticket a distinct possibiity the winners share the jackpot equally This is called parimutuel betting So for example if the current jackpot is 9 million and there are three winning tickets then each winner receives 3 million It can be shown that forVirginia s choice of r and mthere are approximately 7 million possible tickets 7059052 to be exactTherefore any ticket has about one chance out of 7 million of being a winner That is the probability of winning with a single ticket is p 7059052 not very good odds If n people purchase tickets then the number of winners is binomially distributed with parameters n and p Because n is typically very large and p is smathe number of winners has approximately a Poisson distribution with rate A np This makes ensuing calculations somewhat easier For example if I million tickets are purchasedthen the number of winning tickets is approximately Poisson distributed with A I7 In I992 an Australian syndicate purchased a huge number of tickets in theVirginia lottery in an attempt to assure itself of purchasing a winner It worked Although the syndicate wasn t able to purchase all 7 million possible tickets it was about 5 million shy of this it did purchase a winning ticket and there were no other winnersTherefore the syndicate won a 20year income stream worth approximately 27 miionwith a net present value of approximately I4 miionThis made the syndicate a big pro t over the cost of the tickets it purchasedTwo questions come to mind I Is this hogging of tickets unfair to the rest of the public 2 Is it a wise strategy on the part of the syndicate or did it just get lucky To answer the rst question consider how the lottery changes for the general public with the addition of the syndicateTo be speci c suppose the syndicate can invest 7 million and obtain all of the possible tickets making itself a sure winnerAso suppose n people from the general public purchase tickets each of which has I chance out of 7 million of being a winner Finally let R be the jackpot carried over from any previous otteriesThen the total jackpot on this round will be R O397000000 n because 39 cents from every ticket goes toward the jackpotThe number of winning tickets for the public will be Poisson distributed with A n7000000 However any member of the public who wins will necessarily have to share the jackpot with the syndicate which is a sure winner Use this infor mation to calculate the expected amount the public will winThen do the same calculation when the syndicate does not play In this case the jackpot will be smaller but the public won t have to share any winnings with the syndicate For values of n and R that you can select is the public better off with or without the syndicateWoud you as a general member of the public support a move to outlaw syndicates from hogging the tickets The second question is whether the syndicate is wise to buy so many ticketsAgain assume that the syndicate can spend 7 million and purchase each possible ticket Would this be possible in reality Also assume that n members of the general public purchase tickets and that the carryover from the previous jackpot is RThe syndicate is thus assured of 1 Of the remaining 61 cents the state takes about 50 cents The other 11 cents is used to pay off lesser prize winners whose tickets match some but not all of the winning 6 numbers To keep this case relatively simple however we ignore these lesser prizes and con centrate only on the jackpot 270 Chapter 5 Normal Binomial Poisson and Exponential Distributions Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it having a winning ticket but is it assured of covering its costs Calculate the expected net bene t in terms of net present value to the syndicate using any reasonable values of n and R to see whether the syndicate can expect to come out ahead Actually the analysis suggested in the previous paragraph is not competeThere are at least two complications to considerThe rst is the effect of taxes Fortunately for the Australian syndicate it did not have to pay federal or state taxes on its winnings but a US syndicate wouldn t be so lucky Secondthe jackpot from a 20 million jackpot say is actually paid in 20 annual million payments The Lottery Commission pays the winner million immediately and then purchases I9 strips bonds with the inter est not included maturing at Iyear intervals with face value of million each Unfortunately the lottery prize does not offer the liquidity of the Treasury issues that back up the paymentsThis lack of liquidity could make the lottery less attractive to the syndicate I Case 52 Cashing in on the Lottery 27 I Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER DECIDINGWHETHERTO DEVELOP NEW DRUGS AT BAYER he formal decisionmaking process discussed in this chapter is often used to make difficult decisions in the face of much uncertainty large monetary values and longterm consequences Stonebraker 2002 chronicles one such decisionmaking process he performed for Bayer Pharmaceuticals in I999 The development of a new drug is a timeconsuming and expensive process that is lled with risks along the way A pharmaceutical company must rst get the proposed drug through preclinical trials where the drug is tested on animals Assuming this stage is successful and only about half arethe company can then le an application with the Food and Drug Administration FDA to conduct clinical trials on humansThese clinical trials have three phases Phase I is designed to test the safety of the drug on a small sample of healthy patients Phase 2 is designed to identify the optimal dose of the new drug on patients with the disease Phase 3 is a statistically designed study to prove the efficacy and safety of the new drug on a larger sample of patients with the disease Failure at any one of these phases means that further testing stops and the drug is never brought to Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Image copyright Reicaden 2010 Used under license from Shutterst0ckc0m Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it market Of coursethis means that all costs up to the failure point are lost If the drug makes it through the clinical tests and only about 25 of all drugs do sothe company can then apply to the FDA for permission to manufacture and market its drug in the United States Assuming that FDA approves the company is then free to launch the drug in the marketplace The study involved the evaluation of a new drug for busting blood clots called BAY 579602 and it commenced at a time just prior to the rst decision point whether to conduct preclinical testsThis was the company s rst formal use of decision making for evaluating a new drug so to convince the company of the worth of such a study Stonebraker did exactly what a successful management science study should do He formulated the problem and its objectives he identi ed risks costs and bene ts he involved key people in the organization to help provide the data needed for the decision analysis and because much of the resulting data consisted of educated guesses at best he performed a thorough sensitivity analysis on the inputs Although we are not told in the article how everything turned out the analysis did persuade Bayer management to proceed in January 2000 with preclinical testing of the drug The article provides a fascinating look at how such a study should proceed Because there is so much uncertainty the key is determining probabilities and probability distribu tions for the various inputs First there are uncertainties in the various phases of testing Each of these can be modeled with a probability of success For exampethe chance of making it through preclinical testing was assessed to be about 65 for BAY 579602 although management preferred to use the more conservative benchmark of 50 based on historical data on other drugs for the decision analysis Many of the other uncertain quantities such as the eventual market share are continuous random variables Because the decision tree approach discussed in this chapter requires discrete random variables usually with only a few possible values Stonebraker used a popular threepoint approximation for all continuous quantities He asked experts to assess the 0th percentiethe 50th per centile and the 90th percentile and he assigned probabilities 03 04 and 03 to these three values The validity of such an approximation is discussed in Keefer and Bodily 983 After getting all such estimates of uncertain quantities from the company experts the author examined the expected net present value NPV of all costs and bene ts from developing the new drugTo see which of the various uncertain quantities affected the expected NPV most he varied each such quantity one at a time from its 0th percentile to its 90th percentile leaving the other inputs at their base 50th percentile vauesThis identi ed several quantities that the expected NPV was most sensitive to including the peak product sharethe price per treatment in the United States and the annual growth rateThe expected NPV was not nearly as sensitive to other uncertain inputs including the product launch date and the production process yiedTherefore in the nal decision analysis Stonebraker treated the sensitive inputs as uncertain and the less sensitive inputs as certain at their base values He also calculated the risk pro le from developing the drug This indicates the probability distribution of NPV taking all sources of uncertainty into account Although this risk pro le was not exactly optimistic 90 chance of losing money using the conservative probabilities of success 67 chance of losing money with the more optimistic productspeci c probabilities of successthis risk pro le compared favorably with Bayer s other potential projectsThis evaluation plus the rigor and defensibility of the study led Bayer management to give the goahead on preclinical testing I 6 1 INTRODUCTION This chapter provides a formal framework for analyzing decision problems that involve uncertainty Our discussion includes the following I criteria for choosing among alternative decisions I how probabilities are used in the decision making process 274 Chapter 6 Decision Making under Uncertainty Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it I how early decisions affect decisions made at a later stage I how a decision maker can quantify the value of information I how attitudes toward risk can affect the analysis Throughout we employ a powerful graphical tool a decision tree to guide the analysis A decision tree enables a decision maker to View all important aspects of the problem at once the decision alternatives the uncertain outcomes and their probabilities the economic consequences and the chronological order of events We show how to implement decision trees in Excel by taking advantage of a very powerful and exible add in from Palisade called PrecisionTree Many examples of decision making under uncertainty exist in the business world including the following I Companies routinely place bids for contracts to complete a certain project within a xed time frame Often these are sealed bids where each company presents a bid for complet ing the project in a sealed envelope Then the envelopes are opened and the low bidder is awarded the bid amount to complete the project Any particular company in the bid ding competition must deal with the uncertainty of the other companies bids as well as possible uncertainty regarding their cost to complete the project if they win the bid The tradeoff is between bidding low to win the bid and bidding high to make a larger pro t I Whenever a company contemplates introducing a new product into the market there are a number of uncertainties that affect the decision probably the most important being the customers reaction to this product If the product generates high customer demand the company will make a large pro t But if demand is low and after all the vast majority of new products do poorly the company could fail to recoup its development costs Because the level of customer demand is critical the company might try to gauge this level by test marketing the product in one region of the country If this test market is a success the company can then be more optimistic that a fullscale national marketing of the product will also be successful But if the test market is a failure the company can cut its losses by abandoning the product I Whenever manufacturing companies make capacity expansion decisions they face uncertain consequences First they must decide whether to build new plants If they don t expand and demand for their products is higher than expected they will lose revenue because of insufficient capacity If they do expand and demand for their products is lower than expected they will be stuck with expensive underutilized capacity Of course in today s global economy companies also need to decide where to build new plants This decision involves a whole new set of uncertainties including exchange rates labor availability social stability competition from local businesses and others I Banks must continually make decisions on whether to grant loans to businesses or individuals As we all know many banks made many very poor decisions especially on mortgage loans during the years leading up to the nancial crisis in 2008 They fooled themselves into thinking that housing prices would only increase never decrease When the bottom fell out of the housing market banks were stuck with loans that could never be repaid I Utility companies must make many decisions that have signi cant environmental and economic consequences For these companies it is not necessarily enough to conform to federal or state environmental regulations Recent court decisions have found companies liable for huge settlements when accidents occurred even though the companies followed all existing regulations Therefore when utility companies decide say whether to replace equipment or mitigate the effects of environmental pollution they must take into account the possible environmental consequences such as injuries to people as 6 I Introduction 275 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it well as economic consequences such as lawsuits An aspect of these situations that makes decision analysis particularly difficult is that the potential disasters are often extremely unlikely hence their probabilities are difficult to assess accurately I Sports teams continually make decisions under uncertainty Sometimes these decisions involve longrun consequences such as whether to trade for a promising but as yet untested pitcher in baseball Other times these decisions involve short run consequences such as whether to go for a fourth down or kick a field goal late in a close football game You might be surprised at the level of quantitative sophistication in professional sports these days Management and coaches typically do not make important decisions by gut feel They employ many of the tools in this chapter and in other chapters of this book 62 ELEMENTS OF DECISION ANALYSIS Although decision making under uncertainty occurs in a wide variety of contexts all prob lems have three common elements 1 the set of decisions or strategies available to the decision maker 2 the set of possible outcomes and the probabilities of these outcomes and 3 a value model that prescribes monetary values for the various decision outcome combinations Once these elements are known the decision maker can nd an optimal deci sion depending on the optimality criterion chosen Before moving on to realistic business problems we discuss the basic elements of any decision analysis for a very simple problem We assume that a decision maker must choose among three decisions labeled D1 D2 and D3 Each of these decisions has three possible outcomes labeled 01 02 and 03 621 Payoff Tables At the time the decision must be made the decision maker does not know which outcome will occur However once the decision is made the outcome will eventually be revealed and a corresponding payoff will be received This payoff might actually be a cost in which case it is indicated as a negative value The listing of payoffs for all decision outcome pairs is called the payoff table1 For our simple decision problem this payoff table appears in Table 61 For example if the decision maker chooses decision D2 and outcome 03 then occurs a payoff of 30 is received A payoff table lists the payoff for each decision outcome pair Positive values corre spond to rewards or gains and negative values correspond to costs or losses Table 6 I Payoff Table for Simple Decision Problem Outcome 01 02 03 Decision D1 10 10 10 D2 10 20 30 D3 30 30 80 lln situations where all monetary consequences are costs it is customary to list these costs in a cost table In this case all monetary values are shown as positive costs 276 Chapter 6 Decision Making under Uncertainty Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it A decision maker gets This table shows that the decision maker can play it safe by choosing decision D1 to decide which W Of This provides a sure 10 payoff With decision D2 rewards of 20 or 30 are possible but the payofftable She a loss of 10 is also possible Decision D3 is even riskier the possible loss is greater and wants However she does not get to Choose the maximum gain is also greater Which decision would you choose Would your choice the column change if the values in the payoff table were measured in thousands of dollars The answers to these questions are what this chapter is all about There must be a criterion for making choices and this criterion must be evaluated so that the best decision can be identified As you will see it is customary to use one particular criterion for decisions involving moderate amounts of money Before proceeding there is one very important point we need to emphasize the distinction between good decisions and good outcomes In any decision making problem where there is uncertainty the best decision can have less than optimal results that is you can be unlucky Regardless of which decision you choose you might get an outcome that in hindsight makes you wish we had made a different decision For example if you make decision D3 hoping for a large reward you might get outcome 01 in which case you will wish you had chosen decision D1 or D2 Or if you choose decision D2 hoping to limit possible losses you might get outcome 03 in which case you will wish you had cho sen decision D3 The point is that decision makers must make rational decisions based on the information they have when the decisions must be made and then live with the conse quences Second guessing these decisions just because of bad luck with the outcomes is not appropriate FUNDAMENTAL INSIGHT What Is a Good Decision uncertainty is resolved a good decision might have unlucky consequences However decision makers should not be criticized for unlucky outcomesThey should be criticized only if their analysis at the time the decision has to be made is faulty In the context of decision making under uncertainty a good decision is one that is based on the sound decisionmallting principles discussed in this chapter Because the decision must usually be made before 622 Possible Decision Criteria What do we mean when we call a decision the best decision We will eventually settle on one particular criterion for making decisions but we first explore some possibilities With respect to Table 61 one possibility is to choose the decision that maximizes the worst payoff This criterion called the maximin criterion is appropriate for a very conservative or pessimistic decision maker The worst payoffs for the three decisions are the minimums in the three rows 10 10 and 30 The maximin decision maker chooses the decision corresponding to the best of these decision D1 with payoff 10 Such a criterion tends to avoid large losses but it fails to even consider large rewards Hence it is typically too conservative and is seldom used The maximin criterion nds the worst payoff in each row of the payoff table and chooses the decision corresponding to the best of these 62 Elements of Decision Analysis 277 Copyright 2010 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it The maximin and At the other extreme the decision maker might choose the decision that maximizes maximax Criteri make the best payoff This criterion called the maximax criterion is appropriate for a risk taker Eieorrie Irl39JtS I3939 22039 or optimist The best payoffs for the three decisions are the maximums in the three rows gene m y not used in 10 30 and 80 The maximax decision maker chooses the decision corresponding to the real dec39s39onmaqng best of these decision D3 with payoff 80 This criterion looks tempting because it focuses problems on large gains but its very serious downside is that it ignores possible losses Because this type of decision making could eventually bankrupt a company the maximax criterion is also seldom used The maximax criterion nds the best payoff in each row of the payoff table and chooses the decision corresponding to the best of these 623 Expected Monetary Value EMV We have introduced the maximin and maximax criteria because 1 they are occasionally used to make decisions and 2 they illustrate that there are several reasonable criteria for making decisions In fact there are other possible criteria that we will not discuss although a couple are explored in the problems Instead we now focus on a criterion that is generally regarded as the preferred criterion in most decision problems It is called the expected monetary value or EMV criterion To motivate the EMV criterion we first note that the maximin and maximax criteria make no reference to how likely the various outcomes are However decision makers typically have at least some idea of these likeli hoods and they ought to use this information in the decisionmaking process After all if outcome 01 in our problem is extremely unlikely then the pessimist who uses maximin is being overly conservative Similarly if outcome 03 is quite unlikely then the optimist who uses maximax is taking an unnecessary risk The EMV approach assesses probabilities for each outcome of each decision and then calculates the expected payoff from each decision based on these probabilities This expected payoff or EMV is a weighted average of the payoffs in any given row of the payoff table weighted by the probabilities of the outcomes You calculate the EMV for each decision then choose the decision with the largest EMV Note that the terms expected payo quot and mean payo quot are equivalent We will use them interchangeably The expected monetary value or EMV for any decision is a weighted average of the possible payoffs for this decision weighted by the probabilities of the outcomes Using the EMV criterion you choose the decision with the largest EMV This is sometimes called playing the averages Where do the probabilities come from This is a difficult question to answer in general because it depends on each specific situation In some cases the current decision problem is similar to those a decision maker has faced many times in the past Then the probabili ties can be estimated from the knowledge of previous outcomes If a certain type of outcome occurred say in about 30 of previous situations an estimate of its current prob ability might be 030 However there are many decision problems that have no parallels in the past In such cases a decision maker must use whatever information is available plus some intuition to assess the probabilities For example if the problem involves a new product decision and one possible outcome is that a competitor will introduce a similar product in the coming year the decision maker will have to rely on any knowledge of the market and the competitor s situation to assess the probability of this outcome It is important to note that 278 Chapter 6 Decision Making under Uncertainty Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it this assessment can be very subjective Two decision makers could easily assess the prob ability of the same outcome as 030 and 045 depending on their information and feelings and neither could be considered wrong This is the nature of assessing probabilities subjectively in real business situations Still it is important for the decision maker to consult all relevant sources historical data expert opinions government forecasts and so on when assessing these probabilities As you will see they are crucial to the decision making process With this general framework in mind let s assume that a decision maker assesses the probabilities of the three outcomes in Table 61 as 03 05 and 02 if decision D2 is made and as 05 02 03 if decision D3 is made2 Then the EMV for each decision is the sum of products of payoffs and probabilities EMV for D1 10 a sure thing EMV for D2 1003 2005 3002 13 EMV for D3 3005 3002 8003 15 These calculations lead to the optimal decision Choose decision D3 because it has the largest EMV It is important to understand what the EMV of a decision represents and what it doesn t represent For example the EMV of 15 for decision D3 does not mean that you expect to gain 15 from this decision The payoff table indicates that the result from D3 will be a loss of 30 a gain of 30 or a gain of 80 it will never be a gain of 15 The EMV is only a weighted average of the possible payoffs As such it can be interpreted in one of two ways First imagine that this situation can occur many times not just once If decision D3 is used each time then on average you will make a gain of about 15 About 50 of the time you will lose 30 about 20 of the time you will gain 30 and about 30 of the time you will gain 80 These average to 15 For this reason using the EMV criterion is some times referred to as playing the averages But what if the current situation is a oneshot FUNDAMENTAL INSIGHT deal that will not occur many times in the future Then the second interpretation of EMV is still rele for making decisions under uncertainty This is A EMV maX39m39Ze397 by def39 39t39 39 395 39 d39flere t when actually a point that has been debated in intellectual faced with the choice between entering a gamble that has a certain EMV and receiving a sure dollar amount in the amount of the EMV For example consider a gamble where you ip a fair coin and win 0 or 000 depending on whether you get a head or a tail If you are an EMV maximizer you are indif ferent between entering this gamble which has EMV 500 and receiving 500 for sure Similarly if the gamble is between losing 000 and winning 500 based on the flip of the coin and you are an EMV maximizer you are indifferent between entering this gamble which has EMV 250 and paying a sure 250 to avoid the gamble This latter scenario is the basis of insurance circles for years what is the best criterion for making decisions However researchers have gen erally concluded that EMV makes sense even for oneshot deals as long as the monetary values are not too large For situations where the monetary values are extremely large we will introduce an alternative criterion in the last section of this chapter Until then however we will use EMV This is the gist of decision making uncertainty You develop a payoff table assess probabilities of outcomes calculate EMVs and choose the decision with the largest EMV However before proceeding to examples it is useful to introduce a few other concepts sensitivity analysis decision trees and risk pro les 2In a change from the previous edition of this book we allow these probabilities to depend on the decision that is made which is often the case in real decision problems 62 Elements of Decision Analysis 279 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it 624 Sensitivity Analysis Some of the quantities in a decision analysis particularly the probabilities are often intelligent guesses at best Therefore it is important especially in real world business problems to accompany any decision analysis with a sensitivity analysis Here we systematically vary inputs to the problem to see how or if the outputs the EMVS and the best decision change For our simple decision problem this is easy to do in a spreadsheet The spreadsheet model is shown in Figure 61 See the le Simple Decision Problemxlsx Figure 6 A B i C I E F Spreadsheet Model 1 Simple decision problem under uncertainty 2 of a Simple Decision Problem 3 Outcome 4 O1 O2 O3 EMV 5 Decision D1 10 10 10 10 6 D2 10 20 30 13 7 D3 30 30 80 15 8 N N 9 Probabilities 10 D2 03 05 02 11 D3 05 02 03 Usually the most After entering the payoff table and probabilities calculate the EMVS in column F as a imP0 0quott iquotf0Fm0ti0 sum of products using the formula from a sensitivity analysis is whether SUMPRODUCTC6E6C10E10 the optimal decision continues to be optma in cell F6 and copying it down A link to the sure 10 for D1 1S entered in cell F5 Then it as one or more inputs is easy to change any of the inputs and see whether the optimal decision continues Change to be D3 For example you can check that if the probabilities for D3 change only slightly to 06 02 and 02 the EMV for D3 changes to 4 Now D3 is the worst decision and D2 is the best so it appears that the optimal decision is quite sensitive to the assessed probabili ties As another example if the probabilities remain the same but the last payoff for D2 changes from 30 to 45 then its EMV changes to 16 and D2 becomes the best decision Given a simple spreadsheet model it is easy to make a number of ad hoc changes to inputs as we have done here to answer speci c sensitivity questions However it is often useful to conduct a more systematic sensitivity analysis as we will do this later in the chapter The important thing to realize at this stage is that a sensitivity analysis is not an afterthought to the overall analysis it is a key component of the analysis 625 Decision Trees The decision problem we have been analyzing is very basic You make a decision you then observe an outcome you receive a payoff and that is the end of it Many decision prob lems are of this basic form but many are more complex In these more complex problems you make a decision you observe an outcome you make a second decision you observe a second outcome and so on A graphical tool called a decision tree has been developed to represent decision problems Decision trees can be used for any decision problems but they are particularly useful for the more complex types They clearly show the sequence of events decisions and outcomes as well as probabilities and monetary values The deci sion tree for the simple problem appears in Figure 62 This tree is based on one we drew and calculated by hand We urge you to try this on your own at least once However later in the chapter we will introduce an Excel add in that automates the procedure 280 Chapter 6 Decision Making under Uncertainty Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it 10 Figure 62 Decision Tree for Simple Decision D1 Problem 15 D2 D3 30 30 80 To explain this decision tree we introduce a number of decision tree conventions that have become standard Decision Tree Conventions 1 Decision trees are composed of nodes circles squares and triangles and branches lines 2 The nodes represent points in time A decision node a square represents a time when the decision maker makes a decision A probability node a circle represents a time when the result of an uncertain outcome becomes known An end node a triangle indicates that the problem is completed all decisions have been made all uncertainty has been resolved and all payoffs and costs have been incurred When people draw decision trees by hand they often omit the actual triangles as we have done in Figure 62 However we still refer to the righthand tips of the branches as the end nodes 3 Time proceeds from left to right This means that any branches leading into a node from the left have already occurred Any branches leading out of a node to the right have not yet occurred 4 Branches leading out of a decision node represent the possible decisions the decision maker can choose the preferred branch Branches leading out of proba bility nodes represent the possible outcomes of uncertain events the decision maker has no control over which of these will occur 5 Probabilities are listed on probability branches These probabilities are conditional on the events that have already been observed those to the left Also the probabilities on branches leading out of any probability node must sum to 1 6 Monetary values are shown to the right of the end nodes As we discuss shortly some monetary values are also placed under the branches where they occur in time 7 EMVS are calculated through a folding back process discussed next They are shown above the various nodes It is then customary to mark the optimal decision branches in some way We have marked ours with a small notch 62 Elements of Decision Analysis 28 I Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it The decision tree in Figure 62 follows these conventions The decision node comes first to the left because the decision maker must make a decision before observing the uncertain outcome The probability nodes then follow the decision branches and the probabilities appear above their branches Actually there is no need for a probability node after the D1 branch because its monetary value is a sure 10 The ultimate payoffs appear next to the end nodes to the right of the probability branches The EMVs above the probability nodes are for the various decisions For example the EMV for the D2 branch is 13 The maximum of the EMVs is for the D2 branch written above the decision node Because it corresponds to D3 we put a notch on the D3 branch to indicate that this decision is optimal This decision tree is almost a direct translation of the spreadsheet model in Figure 61 Indeed the decision tree is overkill for such a simple problem the spreadsheet model provides all of the required information However decision trees are very useful in business problems First they provide a graphical view of the whole problem This can be useful in its own right for the insights it provides especially in more complex problems Second the decision tree provides a framework for doing all of the EMV calculations Specifically it allows you to use the following folding back procedure to nd the EMVs and the optimal decision F oldingBack Procedure Starting from the right of the decision tree and working back to the left 1 At each probability node calculate an EMV a sum of products of monetary values and probabilities 2 At each decision node take a maximum of EMVs to identify the optimal decision The foldingback This is exactly what we did in Figure 62 At each probability node we calculated PVOCGSS is 0 5Y3tem0 C EMVs in the usual way sums of products and wrote them above the nodes Then at the Way fCaIC Iatquotg decision node we took the maximum of the three EMVs and wrote it above this node EMVs in a decision tree and thereby Although this procedure entails more work for more complex decision trees the same two identifying the optma steps taking EMVs at probability nodes and taking maximums at decision nodes are decision strategy the only arithmetic operations required In addition the PrecisionTree addin in the next section does the folding back calculations for you 626 Risk Profiles In our small example each decision leads to three possible monetary payoffs with various probabilities In more complex problems the number of outcomes could be larger maybe considerably larger It is then useful to represent the probability distribution of the monetary values for any decision graphically Specifically we show a spike chart where the spikes are located at the possible monetary values and the heights of the spikes correspond to the probabilities In decision making contexts this type of chart is called a risk profile By looking at the risk pro le for a particular decision you can see the risks and rewards involved By comparing risk profiles for different decisions you can gain more insight into their relative strengths and weaknesses The risk pro le for a decision is a spike chart that represents the probability distribution of monetary outcomes for this decision 282 Chapter 6 Decision Making under Uncertainty Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it The risk pro le for decision D3 appears in Figure 63 It shows that a loss of 30 has probability 05 a gain of 30 has probability 02 and a gain of 80 has probability 03 The risk pro le for decision D2 is similar except that its spikes are above the values 10 20 and 30 and the risk pro le for decision D1 is a single spike of height 1 over the value 10 The nished version of the Simple Decision Problemxlsx le provides instructions for constructing such a chart with Excel tools Figure 63 Risk pro le for D3 Risk Pro le for 05 39 Decision D3 I 05 04 03 02 01 210 50 U 0 2390 4390 6390 8390 100 A risk pro le shows the Note that the EMV for any decision is a summary measure of the complete risk C0mP ete Pr babquot39tY pro le it is the mean of the corresponding probability distribution Therefore when you distribution fm quote39 use the EMV criterion for making decisions you are not using all of the information in the tary outcomes but you typCam use On its risk pro les you are comparing only their means Nevertheless risk pro les can be useful mean the EMV for as extra information for making decisions For example a manager who sees too much risk making decisions in the risk pro le of the EMV maximizing decision might choose to override this decision and instead choose a somewhat less risky alternative We now apply all of these concepts to the following example 6 BIDDING FOR A GOVERNMENT CONTRACT AT ScITooLs EXAMPLE SciTools Incorporated a company that specializes in scienti c instruments has been invited to make a bid on a government contract The contract calls for a speci c number of these instruments to be delivered during the coming year The bids must be sealed so that no company knows what the others are bidding and the low bid wins the contract SciTools estimates that it will cost 5000 to prepare a bid and 95000 to supply the instruments if it wins the contract On the basis of past contracts of this type SciTools believes that the possible low bids from the competition if there is any competition and the associated probabilities are those shown in Table 62 In addition SciTools believes there is a 30 chance that there will be no competing bids What should SciTools bid to maximize its EMV Objective To develop a decision model that nds the EMV for various bidding strategies and indicates the best bidding strategy 62 Elements of Decision Analysis 283 Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Table 62 Data for Bidding Example Low Bid Probability Less than 115000 02 Between 115000 and 120000 04 Between 120000 and 125000 0 3 Greater than 125000 0 1 WHERE Do THE NUMBERS COME FROM The company has probably done a thorough cost analysis to estimate its cost to prepare a bid and its cost to manufacture the instruments if it wins the contract Actually even if there is uncertainty in the manufacturing cost the only value required for the decision problem is the mean manufacturing cost The company s estimates of whether or how the competition will bid are probably based on previous bidding experience and some subjec tivity This is discussed in more detail next Solution Let s examine the three elements of SciTools s problem First SciTools has two basic strategies submit a bid or do not submit a bid If SciTools submits a bid then it must decide how much to bid Based on the cost to SciTools to prepare the bid and supply the instruments there is clearly no point in bidding less than 100000 SciTools wouldn t make a pro t even if it won the bid Actually this isn t totally true Looking ahead to future contracts SciTools might make a low bid just to get in the game and gain experi ence However we won t consider such a possibility here Although any bid amount over 100000 might be considered the data in Table 62 suggest that SciTools might limit its choices to 115000 120000 and 1250003 The next element of the problem involves the uncertain outcomes and their probabili ties We have assumed that SciTools knows exactly how much it will cost to prepare a bid and how much it will cost to supply the instruments if it wins the bid In reality these are probably only estimates of the actual costs and a followup study could perform a sensitivity analysis on these quantities Therefore the only source of uncertainty is the behavior of the competitors will they bid and if so how much From SciTools s stand point this is difficult information to obtain The behavior of the competitors depends on 1 how many competitors are likely to bid and 2 how the competitors assess their costs of supplying the instruments Nevertheless we assume that SciTools has been involved in similar bidding contests in the past and can reasonably predict competitor behavior from past competitor behavior The result of such prediction is the assessed probability distribu tion in Table 62 and the 30 estimate of the probability of no competing bids The last element of the problem is the value model that transforms decisions and outcomes into monetary values for SciTools The value model is straightforward in this example If SciTools decides not to bid its monetary value is 0 no gain no loss If it makes a bid and is underbid by a competitor it loses 5000 the cost of preparing the bid If it bids B dollars and wins the contract it makes a profit of B minus 100000 that is B dollars for winning the bid minus 5000 for preparing the bid and 95000 for supplying the instruments For example if it bids 115000 and the lowest competing bid if any is greater than 115000 then SciTools wins the bid and makes a profit of 15000 3The problem with a bid such as 117000 is that the data in Table 62 make it impossible to calculate the proba bility of SciTools winning the contract if it bids this amount Other than this however there is nothing that rules out such in between bids 284 Chapter 6 Decision Making under Uncertainty Copyright 2010 Cengage Leaming All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall leaming experience Cengage Leaming reserves the right to remove additional content at any time if subsequent rights restrictions require it Developing the Payoff Table The corresponding payoff table along with probabilities of outcomes appears in Table 63 At the bottom of the table the probabilities of the various outcomes are listed For example the probability tha