Quantitative Business Analysis Probability and Statistics
Quantitative Business Analysis Probability and Statistics ISDS 361
Cal State Fullerton
Popular in Course
Mr. Jamey Collins
verified elite notetaker
Popular in Info Systems & Decision Sciences
Mr. Jamey Collins
verified elite notetaker
Mr. Jamey Collins
verified elite notetaker
verified elite notetaker
verified elite notetaker
verified elite notetaker
verified elite notetaker
This 23 page Class Notes was uploaded by Mr. Jamey Collins on Wednesday September 30, 2015. The Class Notes belongs to ISDS 361 at California State University - Fullerton taught by Nicholas Farnum in Fall. Since its upload, it has received 30 views. For similar materials see /class/217078/isds-361-california-state-university-fullerton in Info Systems & Decision Sciences at California State University - Fullerton.
Reviews for Quantitative Business Analysis Probability and Statistics
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 09/30/15
Numerical Summaries Two of the most important features of a data set that the histogram shows are 0 Central tendency the data tends to stack up around some typical or middle value 0 Dispersion the data points are not all the same some are larger than others 150 4 100 7 llllf1l o 3 he data var39ability h l l l l l 300 400 500 600 700 center of data location SECTION 41 Measures of Central Location The Arithmetic Mean 0 In statistics measures of location are often called mean values or simply means c There are several in fact in nitely many means using a slightly different method of describing the 39location39 or center of a set of data 0 To distinguish between them we use names such as the arithmetic mean the weighted mean the geometric mean and so forth 0 For instance the familiar average of a set of numbers is also called the arithmetic mean 0 The arithmetic mean of the measurements X1 X2 X3 Xn is found by dividing their sum by n We use the notation which is read quotX barquot to denote the result X1X2X3X 7 In statistical applications data usually arises by drawing samples from an ongoing process or population 7 is usually called the sample mean I7 0 The symbol Z is used to denote the sum ofthe data 1 2 X3X 11 The upper case Greek letter Z which corresponds to the English upper case S stands for the word quotsumquot and the expression 2 X is read quotthe sum ofthe values 1 X as the subscript i runsfrom i 1 up through i n The Z notation is a great timesaver in statistics because we are forever summing collections of numbers I7 and it is convenient to have a simple notation like 2 X so we can avoid writing 1 long phrases like the one in quotes in the previous sentence With this notation the sample mean can be compactly written as o always coincides with the center of gravity of the data That is if you imagine the data as equalsized weights spread out on a balancing beam see following gure then marks the position at which a fulcrum would be placed to exactly balance these weights This centerofgravity property is handy for quickly estimating 7 with no calculation directly from a histogram of the data LLAL IIJA L x sample mean balancing point Example The table below shows a company s monthly ending cash balance for several months Using the centerofgraVity property it appears that the mean lies somewhere between 120000 and 160000 although it is dif cult to pinpoint it exactly by eye To calculate the exact value of the sample mean we would rst nd their sum 188588 132691 138110 7897724 and then diVide by 50 the number of data points used in the sum to nd 15795448 Monthly Ending Cash Balance in 39s 188588 164149 136332 189911 84703 132691 250420 211840 197800 127134 125246 218811 218484 181532 116383 27040 168177 181973 205211 191439 123139 97844 132985 199288 213784 200466 158587 194562 62536 165292 118707 172195 141901 208126 151870 74883 180930 71733 135596 129598 233390 187027 147146 139364 136336 128652 173568 139400 222846 138110 Histogram of Monthly Ending Cash Balances 10 gt E C 5 9 LI 0 l l l 0 100000 200000 300000 balanc THE MEDIAN o The sample mean is very sensitive to extremely large or small values in the data 0 Example A sample of 10 people one of whom was a highlypaid sports professional see gure below The mean and median of a sample of 10 peoples39 incomes mean III T median 0 One solution to this problem is to use the median o The median of a set of numbers is de ned to be their 39middle value which is found by rst sorting the data from smallest to largest and then counting halfway into this sorted list If the sample size n happens to be an odd number then there will be exactly one middle value However if n is even then there will be two middle values and in this case we de ne the median to the average of these two values SECTION 42 Measures of VariabilityDispersionRisk The Range 0 The most intuitive measure of dispersion in a set of data is the range which simply the distance between the largest and smallest measurements in the data 0 To nd the range of a set of measurements 1 Sort the data from smallest to largest With very small sets of data a visual scan of the data may be all that is necessary 2 Letting m and M denote the minimum and maximum data values the range R is simply the difference between them R Mm Because the minimum m can never exceed the maximum M the range is always a positive measure of dispersion as are all measures of dispersion Example In a manufacturing operation parts are made that are supposed to be 05 inches in length A sample of 5 recently made parts have lengths 49 50 47 53 and 51 For this data the range is R M m 53 47 006 inches 0 The range is most useful when working with small samples where n is less than 10 or so 0 With large sample sizes the range encounters two problems that make it unsuitable as a measure of dispersion 1 As you add additional readings to your data set the range can increase Example Consider what would happen if you wanted to describe the variability in the cash balance data above by regularly calculating the range after each new month s data becomes available The following table shows what would happen for the rst 10 months reading down the rst column of data The range increases from 55898 up to 206351 about a 269 increase It would be dif cult to point to any ofthese ranges in as being 39typical of the cash balance data Successive ranges for the rst 10 cash balance values Range of items k X 1 through X 1 188588 2 132691 55898 3 125246 63342 4 27040 161548 5 123139 161548 6 200466 173426 7 118707 173426 8 74883 173426 9 233390 206351 10 128652 206351 2 The range does not describe the dispersion among the remaining items in the data ie the data between the maximum and minimum Example The following graph shows histograms of three distinctly different sets of data haVing the same range because the maximum and minimum value is the same in each data set However the dispersion of the rest of the data varies quite a bit among the data sets A Drawback 0f the Range Different sets of data having the same Range range 205 15 range 205 15 5 20 range 205 15 5 20 THE VARIANCE AND STANDARD DEVIATION 0 To overcome some of the shortcomings of the range we need a need a measure of dispersion that is better able to distinguish between the variation exhibited by different sets of data One approach is to measure the distance of each individual data point from some central point such as the sample mean and then combine all these distances into one overall measure of dispersion 0 With the sample mean 7 as the location of the 39center of the data the distances 1 in 2 3r 3 X Kr of each data point from the center can be averaged 0 However some of these distances will be negative since some of the s must fall to the left of 70 o A simple way of making them all positive is to square them ie to use X 02 instead of X o The last step is to combine all these new distances into E overall measure called the sample variance and denoted by 52 We 39 average these distances dividing by nl not by n to be explained later To illustrate the calculation of 52 consider the data of the numbers 49 50 47 53 and 51 The rst step is to nd the sample mean 4950475351 5 50 Next the distances X j are calculated and their squares are summed i X 4 30 4 302 1 49 4950 01 012 0001 2 50 5050 00 002 0000 3 47 475003 032 0009 4 53 5350 03 032 0009 5 51 5150 01 012 0001 Sum 0020 The nal step is to diVide 002 by nl 5l 4 which gives a sample variance of 002 z 002 0005 5 1 4 0 There are two reasons for the division by nl in the formula for 52 1 Using nl in the formula for 52 makes it a better estimate of the variability in the process from which we have obtained the data 2 If you look carefully there are e ectively only nl terms in the sum X1 T2 X2 T27 T02 so nl really is the proper divisor 3 The concept that the number of terms in a sum of squared quantities can be m somewhat will reoccur throughout this text Statisticians call this reduced number of terms the degrees of freedom associated with the sum of squares THE SAMPLE STANDARD DEVIATION 0 There is a slight problem with 52 It is measured in square units 1 Example ifthe sample variance of5 people s heights expressed in inches is 9 then you would have to report that 52 was 9 square inches 2 The cash balance data above sample variance of 2268902689 square dollars 0 The solution is to take the square root of 52 The resulting quantity is denoted by the letter s which is an abbreviation for the sample standard deviation 0 s is just as good a measure of dispersion as 52 is and s is measured in the same units as the data 1 Example The standard deviation of people s heights above would now be reported as s J5 3 inches 2 The standard deviation of the cash balance data is s 42 268 902 689 47633 o What does s tell you about the data 1 The more disperse a set of data is around its mean the larger s will be 2 s can not discern the direction of the variability in the data see gure Return in on two types of investments Mean and standard deviation are the same in each case but the risks are different 5 20 5 20 COMPUTING THE STANDARD DEVIATION 0 Be able to use your calculator and EXCEL to calculate s This web site shows how to use your calculator for entering data and then 39 39 quot X and s httpwwwgeocitiescomcalculatorheln 0 There are also a shortcut formulas for 52 and s page 109 of the text These formulas are based on the algebraic identity Interpreting the sample standard deviation s 1 The sample mean 7 and range R are easy to compute and easy to interpret Example If you examine a sample of invoices and calculate that the mean time takes to ship orders to your customers is 25 days then you can immediately infer that the shipment times are generally near 25 days some of them being shorter than 25 days and some a little longer If in addition you compute the range and nd R 6 you can quickly state that there is no more than 6 days difference between the shortest shipment time and the longest for this data 2 Interpreting the sample standard deviation s is not as immediate The key thing to remember about interpreting s is that it is not a 39standalone measure like the sample mean and range are There is no absolute scale that says an s of 100 is large while an s of 10 is small The standard deviation can onlx be interpreted in relation to the mean In a sense it should come as no surprise that must be involved in any interpretation of s since the very de nition of s involves the sample mean 3 s estimates the average or typical deviation of a data value from the center 7 Z scores 0 Convertng the raw distances from 7 into distances measured in terms of s is called standardization X 0 Each rat10 that we calculate when comparlng the d1stance i X to the standard deviation is called a z score o zscores can be used to describe the distance that any point is from the mean not just numbers belonging to the data set Example The scores on a lOOpoint vocational placement exam have a mean of 70 and a standard deviation of 10 Two job applicants who took this exam received scores of 60 and 85 What were their zscores 85 70 The applicant with a score of 85 has a zscore of Z 15 That is this applicant39s score was 15 standard deviations above the average of all people taking the exam Similarly the other applicant has a zscore of 60 70 Z 10 10 wh1ch 1s 10 standard dev1ations below the average score on the exam 0 What is the reason for converting numbers into zscores l The answer goes back to the beginning of this section where it is noted that there is no absolute scale for comparing various values of s 2 zscores on the other hand m be compared against xed scales one of which is described below The Empirical Rule Using such scales we can make more precise statements about how we expect the process generating the data to behave In essence the scales against which we compare zscores tell us when a particular value ofz is quottoo largequot or quotabout rightquot and so on Large z scores are associated with unlikely events while Small z scores are indicative of more likely events The scales that we use to determine what is quotlikely39 and what is quotunlikelyquot are called probability distributions in upcoming chapters THE EMPIRICAL RULE page 111 There is one scale for interpreting zscores that is exceedingly simple to use and that applies in a wide variety of situations It provides an easilyremembered rule for interpreting the sample standard deviation and once you become pro cient in its use you will have taken a large step towards 39thinking statistically and understanding the topics in the rest of this course The Empirical Rule as it is sometimes called is a brief table that tells you approximately what proportion of your data falls within certain distances measured as z scores from the mean The rule is based on the fact that a great many processes and populations generate data whose histograms look remarkably similar to a particular bellshaped curve the socalled Normal curve discussed in Chapter 8 of the text o The Empirical Rule is o en stated in terms of 7 and s as follows 1 About 68 of the data falls within 1 standard deviation of the mean 2 About 95 of the data fall within 2 standard deviations of the mean 3 Almost all over 99 of the data fall within 3 standard deviations of the mean 0 In practice the Empirical Rule acts as a rough table of probabilities You can use it to give rough bounds within which the process data is likely to fall or you can evaluate particular value to see how likely it is Example Cash balance data 1 For this data 15795448 and s 4763300 2 The Empirical Rule gives the following approximate bounds About 68 of the time the ending cash balance should be within 1 standard deviation from the mean Since 1 standard deviation is equivalent to 4763300 this translates into values of 157954484763300 11032148 ie one standard deviation below the mean and 157954 484763300 20558748 or 1 standard deviation above the mean About 95 of the time the ending cash balance will be within 2 standard deviations ofthe mean ie between 1579544824763300 6268848 and 1579544824763300 25322048 Almost all over 99 of the time the ending cash balance will fall within 3 standard deviations ofthe mean ie between 1579544834763300 1505548 and 1579544834763300 30085348 These results can be used in numerous ways Suppose for instance that the company is expecting to purchase an expensive piece of equipment costing 50000 next month Is it likely that this purchase will cause the ending cash balance to be negative that month which will require the company will need to get a shortterm bank loan to make up the difference The answer is most likely no loan will be needed because about 95 of the time the cash balance should fall in the 6268848 to 25322048 range There is only a relatively small chance about 5 or less that the cash balance will drop below 6268848 so the equipment purchase should not create any additional interest costs stemming from a bank loan Correlation Correlation coefficients measure the strength of the relationship between two variables There are several different types of correlation coefficients in use The one bestsuited for business applications is Pearson s correlation coef cient r Calculating r nyr ZmZy fo7ZxZZy3 zy1z Example sales data ny712x2y 197990 5465550 2393333 2x2 7fo 2000125 64652 9043333 th zyy 23036 6502 2869333 So I 2 72393333 2 46984 90433332869333 Alternate way of calculating r 1 From the formula for r you can quickly show that 2 SSR SST 2 If you have a regression printout available then you Look up the coefficient of determination SSRSST 3 Taking the square root of SSRSST gives r but be careful to affix the correct sign which always matches the sign of b1 Example sales data From the regression printout r2 220748 and the sign ofbl is so r 220748 46984 Properties of Pearson s correlation coef cient 1 2 3 4 5 6 7 r is symmetric That is the correlation between X and y is the same as the correlation between y and X r is dimensionless That is r has no units of measure even though the X and the y data do have units of measure For ANY set of data 1 5 r 5 1 If r 1 the data points MUST fall on a straight line with a positive slope If r l the data points MUST fall on a straight line with a negative slope If r 0 the variables are said to be uncorrelated However r 0 may not necessarily mean that there is no relationship between the a and y variables To give a verbal interpretation to r always square it because then it has the same meaning as the coefficient of determination 8 Example r 80 is 4 times as strong a relationship as r 40 Reason r2 802 64 lt 64 of variation in one variable is caused by changes in the other variable r2 402 16 lt 16 of variation in one variable is caused by changes in the other variable The direction of the correlation of will always match agree with the sign of b1
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'