Statistics for Business
Statistics for Business MATH 2283
Popular in Course
verified elite notetaker
Popular in Mathematics (M)
This 96 page Class Notes was uploaded by Dr. Tyrell McKenzie on Sunday October 11, 2015. The Class Notes belongs to MATH 2283 at East Carolina University taught by Staff in Fall. Since its upload, it has received 22 views. For similar materials see /class/221298/math-2283-east-carolina-university in Mathematics (M) at East Carolina University.
Reviews for Statistics for Business
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 10/11/15
Introductory Material Descriptive Statistics Graphs MATH 2283 Population vs Sample A population is an entire group we wish to study J A population can be real such as all adults residing in North Carolina A population can be theoretical such as all potential ball bearings made by a certain machine or all potential readings of a scale when a ten pound weight is placed upon it A sample is a subset of the population Why Sample A census is a study of the entire population This option may be too costly or timeconsuming if the population is too large This option is impossible when dealing with theoretical populations When censuses are unrealistic samples are taken We hope the sample is representative of the population The method of sampling determines the degree to which the sample likey represents the population Characteristics of an object A variable is any characteristic of an object Variables are numerical or categorical descriptions of objects For example height weight gender 0 When a specific variable is considered the population essentially goes from a being a collection of objects to being a collection of numbers or words Parameter vs Statistic A parameter is any number that summarizes a population J For example the population mean u for a specific variable u is a Greek letter pronounced mu Parameters are often unknown A statistic is any number that summarizes a sample For example the sample mean Y Statistics are often used to estimate unknown parameters Example A company reviews the salaries of its fulltime employees below the executive level at a large plant 50 out of 3000 employee records are chosen at random and the average yearly salary of the 50 selected employees is 30000 V Population Variable Sample Size Unknown the average salary of all 3000 employees is 31000 31000 is a as it describes a 30000 is a as it describes a Example The survival times in days of 72 guinea pigs after being injected with TB tubercle bacilli in a medical experiment is recorded The average survival time of the 72 guinea pigs is 162 days Population Variable Sample Size Unknown an average guinea pig will survive 170 days after injection 170 days is a as it describes a 162 days is a as it describes a Descriptive vs Inferential Statistics Descriptive Statistics is the art of describing important aspects of a set of measurements Describing data through graphs and numerical summaries is important Inferential Statistics is the science of using a sample of measurements to draw conclusions about a population x More important however is using this data to draw educated conclusions about a population Observational Study vs Experiment An observational study is where the scientist simply observes and does not attempt to influence the response Data collected by questionnaire An experiment is where the scientist deliberately imposes different treatments to different individuals Experiments provide good evidence of causation Sampling Schemes Simple Random Sample SRS In a simple random sample on each selection from the population every unit remaining in the population has the same chance of being chosen next This is the most basic method of random sampling A SRS of size n has the property that each group of size n that can be formed with objects in the population is equally likely to be the sample Sampling Schemes SRS For example Suppose that a class consists of 5 people and wish to do a SRS of n 3 people lfl cannot pick the same person twice then there are 10 possible groups of 3 people ABCDE represent the 5 people in the class The 10 possible groups are ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE With a SRS each possible group therefore must have a 1 in 10 chance of being the sample I pick Simple Random Sample SRS With or Without Replacement In a SRS without replacement after each selection from the population the object is not returned to the population for the remaining draws Preferred when sampling from real not theoretical populations Why Making n draws without replacement is equivalent to grabbing a handful of n objects from a thoroughly randomized population Simple Random Sample SRS With or Without Replacement In a SRS with replacement after each selection from the population the object is returned to the population for the remaining draws De facto when sampling from theoretical populations manufacturing processes and games of chance SR8 or not Example A class consists of 20 people 10 male and 10 female The teacher wishes to sample 10 students The teacher will flip a coin lfthe coin lands heads the teacher will take all the male students as the sample If tails the teacher will take all the female students as the sample SR8 or not V In orderto be a SRS it must be like drawing names out of a hat containing the names of all students There are actually 184756 different groups of 10 that can be formed from the class Do each of these groups have the same chance Sampling Schemes Voluntary Response Sample A voluntary response sample consists of people who choose themselves by responding to a general appeal They tend to be biased since people with strong opinions especially negative are most likely to respond Examples include rate our performance cards and callin surveys Sampling Schemes Convenience Sample A convenience sample is a nonrandom sampling method that chooses members of the population that are easiest to observe Great danger that the sample is not representative of the population Example using this class to represent a sample of all ECU students Sampling Schemes Systematic Sample The members of the population are in some order A systematic sample observes every kt value starting with a random starting point Example Selecting every 100th item from a production line Example Selecting every 50th name from an ECU directory Sampling Schemes Stratified Sample The population is broken into groups called strata SRSs are taken from each strata o The SRSs from each strata are combined to create a stratified sample Used when the scientist wants guaranteed representation from each strata Example A voting population consists of Democrats Republicans and Independents Take SRSs of size 100 from each group to and combine to form the stratified sample of size 300 Sampling Schemes Cluster Sample The population is broken into groups called clusters Some clusters are randomly chosen All the members of the chosen clusters are combined to create a cluster sample x Useful when each cluster is filled with similar individuals Example Taking 25 consecutive items off a production line every 10 minutes Example wish to obtain a sample of first graders in Wisconsin lfl randomly select 850 names from a list of all first graders in Wisconsin then my sample will be a of size n If I randomly select 2 first graders from each of the 697 elementary schools in V sconsin then my sample would be a of size n If I were to randomly select 4 counties from V sconsin s 72 counties and the take every first grader in the 4 selected counties to be my sample then my sample would be a of an undetermined size Nonresponse Nonresponse occurs when an individual chosen to be in the sample cannot be contacted or does not cooperate Just because you select someone to be in the sample doesn t mean they will cooperate A study with a large nonresponse rate suffers a greater chance that the cooperating part of the sample is unrepresentative of the population Categorical vs Quantitative Variables Cateqorical Variables place objects into one of several groups or categories Categorical variables can be ordinal if the categories have a meaningful order otherwise the variables is said to be nominal meaning the categories have no meaningful order For example eye color is a categorical variable Do the categories have a meaningful order NO therefore eye color is said to be nominal Categorical vs Quantitative Variables Quantitative Variables take on numerical values that represent quantities how many or how much Quantitative variables can be continuous if any value over an interval can happen or they can be discrete if there are gaps between possible values For example the number of keys in a person s pocket Discrete or Continuous Discrete the of keys can be 012 Spreadsheets Collected data is often entered into a spreadsheet like Microsoft Excel For example suppose I ask the following questions of a group of students What is your height in inches Are you male or female Would you label yourself as liberal moderate or conservative How many children would you like to have E Microsoft Excel 4 Book1 Eile Edit Eiew insert Format Innis Qata Spreadsheets E35 v7 r U A C I D Height Gender Political Childien M The collected data would look like this in an Excel spreadsheet Forthe four variables determine if categorical nominal vs ordinal or quantitative discrete vs con nuous Quantitative technically continuous 7 7 H E l G made discrete by rounding 39 G E N D E R Categorical nominal 39 L l T l Categorical Ordinal 39 C H l L D R E N Quantitative Discrete 039 U1 IIEIJIIeIDItoImI4IU7IUII4ItJJIwIeIDIWle 4 D llltltlllllilllil Z39HEZEEZEE39HE HEmHEEZ HEE HEE39nEZmEE39nHEE 33700300033DZOZEETZEEOI EOZI I Ei Z JLUGLAJMMLAJMLUAMLUMMLUUWJAMMLQMLUMAhMMNthnMD E I I I I In ll himiwl I lwlwl I I I Definitions Frequency Count Tells how many For example the frequency of people wearing sandals is 14 0 Relative Frequency Proportion Percent Tells how many relative to the total For example the relative frequency of people wearing sandals is 04 or 40 or 14 out of 35 1435 Summaries for Categorical Data Frequency Table Lists out all the categories with their associated frequencies Often the relative frequency of each category is listed also x If several categories have very low counts they may be combined into an ogen class labeled other Example Categorical Data Suppose that we have data on 122 defective aluminum can lids as follows Bent Too Thin Bent Not Round Too Thick the list is 122 long x We can easily summarize this list of 122 words or phrases by creating a frequency table PROBLEM FREQUENCY RELATIVE FREQUENCY Lids too thick or too thin 46 46122 0377 or 377 Lids bent 12 12122 0098 or 98 Lids not round 8 8122 0066 or 66 Lid tabs not properly formed 36 36122 0295 or 295 Lids not of proper diameter 9 9122 0074 or 74 Other problems 11 11122 0090 or 90 TOTAL 122 122122 1 or 100 Graphs for Categorical Data Pie Chart each category gets a slice of the entire pie proportional to the category s relative frequency Useful for visualizing how much of the whole each category represents39 Pie Chart of Defect Example 98 J Improper Diameter Too Thick Thin 377 Graphs for Categorical Data Bar Chart either frequency or relative frequency each category is listed along a horizontal line a rectangle is drawn over each category the height of the rectangle is the frequency or relative frequency Example Chart of Defect 50 Defective Lid Data 40 Frequency vs Relative Frequency Bar Charts Chart of Defect 50 Appearance is the 40 36 46 same the only difference is the vertical scale W m 20 The first scale is 60f y count or frequency Q the second scale is 40 percent or relative a frequency H I U u 6 f N yes 065 3 9 Os 639 Example Nominal vs Ordinal Could the defective lid data be legitimately graphed as Chart of Defect 5039 46 40 36 30 YES the categories are nominal 8 20 10 9 C I I I Q g k k 2 quot6 Qo sz 0amp6 6 59 Q9 16 39 Q 53k 0 0 0 2 to Defect Graphs for Quantitative Data Stemplots Stemglot 1 Separate each observation into a stem consisting of all but the final rightmost digit and the lei the final digit 2 Write the stems in a vertical column with the smallest at the top and draw a vertical line at the right of the column 3 Write each leaf in the row to the right of its stem in increasing order out from the stem Example Stemplots Deqree of Readinq Power Data The Degree of Reading Power DRP test is often used to measure the reading ability of children Create a stemplot of the following DRP scores of 44 thirdgrade students 40 26 39 12 42 18 25 43 46 57 19 47 19 26 35 34 15 44 40 38 31 46 52 25 35 35 33 29 34 41 49 28 52 47 35 48 22 33 41 51 27 14 54 45 Stemandleaf of DRP n 44 STEM LEAF 1 2 4 5 8 9 9 The scores are not evenly spread out 2 2 5 5 6 6 7 8 9 It is easily seen that there is a concentration 3 13344555589 of scoresinthe40s 4 00112345667789 5 12247 Stemplots Advantages Disadvantages None of the data is lost Impractical for large data sets Spacing is important see below Can do backtoback stemplots BacktoBack StemandIeaf Homeruns per season BABE RUTH ROGER MARIS 8 346 368 39 52 52 9766611 944 0 OWLTlthJNI O Graphs for Quantitative Data Histograms A histogram is essentially a bar chart after grouping quantitative data into categories 1 Divide the range of data into classes intervals of equal width 2 Count the number frequency of observations in each class OR determine the percent of all observations relative frequency that fall in each class Consider an endpoint conven on 3 Draw the histogram Over each class draw a rectangle whose height is the frequency or relative frequency for the class There are no gaps between rectangles Example Histograms Airglane Age Data The ages of airliners cause some safety and economic concerns For the 40 ages of the following commercial aircraft randomly selected in the US construct a frequency table and a histogram the data is in increasing order for convenience 01 04 07 21 32 36 39 49 53 58 63 66 70 77 81 96 102 105 115 119 124 125 136 152 158 162 168 169 170 187 206 213 224 225 228 231 241 257 263 273 Roughly speaking the data ranges from 0 to 28 I will break this range into 10 subintervals classes Since the total length is 28 each subinterval will be 28 long So the first class will be from 0 to 28 the second will be from 28 to 56 and the last class will be from 252 to 28 How long would each class be if I used only 8 classes Example Histograms Airplane Aqe Data We have created ten categories We can summarize the data with a frequency table and a pseudo bar chart called a histogram CLASS I FREQUENCY RELATIVE FREQUENCY 0 28 4 440 01 or 10 2 8 56 5 540 0125 or 125 56 84 6 640 015 or 15 84112 3 340 0075 or 75 112140 5 540 0125 or 125 140168 4 440 01 or 10 168196 3 340 0075 or 75 196224 3 340 0075 or 75 224252 4 440 01 or 10 252280 3 340 0075 or 75 Example Histograms Airplane Age Data I togram of AirplaneAge 224 2 0 Histograms Advantages Disadvantages Too many classes results in a jagged graph Too few classes results in an oversmoothed graph Individual data is lost Read the following histogram Histogram of StoppingTIme Frequency 12 8 4 05 40 75 11390 14395 18390 21395 25390 28395 StoppingTIme Examining Distributions Shape J Does the distribution have one of several peaks called modes A distribution with one major peak is called unimodal a distribution with two major peaks is called bimodal etc Is the distribution approximately symmetric or is it skewed in one direction A distribution is symmetric if the values smaller and larger than the midpoint are mirror images of each other It is called skewed to the right if the right tail larger values is much longer than the left tail smaller values Outliers Observations that falls outside the overall pattern of a distribution Always look for outliers and try to explain them Timeplots or Runplots Timeplots or runplots are useful when observations are taken at regular intervals over a period of time They allow us to see how the observations change as time progresses The steps are as follows 1 Chronological time is always put on the horizontal scale and the variable you are measuring goes on the vertical scale 2 Connecting the points over time helps emphasize any change over time Example Timeplots or Runplots Marathon Data Women were allowed to enter the Boston marathon in 1972 The following table gives the winning times from 19761995 Create a timeplot of the winning times YEAR TIME 1976 l 167 TimeplotofWinningTime 1977 168 1978 165 17039 1979 155 A 1980 154 1981 147 16539 Q 1982 150 1983 143 1984 149 g 16039 1985 154 1986 145 a 1987 146 E 155 Q p 1988 145 5 I 1989 144 3 150 Q 1990 145 Id 1 1991 144 6 I 1992 144 145 I g at quotg A 1993 145 g 0 F x 1994 142 d I I I I I I I I I I I 1977 1979 1981 1983 1985 1987 1989 1991 1993 1995 Year Timeplots or Runplots A trend is a longterm rise or fall Seasonal Variation refers to a pattern in a time series that repeats itself at known regular intervals V usually happens clue to the seasons of the year V For example if you are tracking quarterly sales there may be a upward spike every fourth quarter Christmas Generically representing a list of numbers The sample size is represented by n The numbers in the list are represented as X1X2Xn X1 represents the first number in the list X2 represents the second number in the list Xn represents the last number in the list If there are 50 numbers in the list then n 50 and the last number is represented as X50 or Xn since n 50 Generically representing a list of numbers The sample size is represented by n The numbers in ascending order are represented as xa 352 xn Xm represents the smallest number in the list X2 represents the second smallest in the list Xm represents the largest number in the list For example X5 represents the fifth smallest number I can find this by arranging the list in ascending order and the value in the fifth spot is X5 Measures of Center When dealing with a list of numbers or even words if they have an order ordinal we often like to convey the center of the list We will discuss four measures of center The mean median midrange and mode We will discuss the properties of each understanding the way each measures the center Measures of Center Mean The mean or average of a list of numbers is found by adding the values together and then dividing by the length of the list Symbolically m mean n If the list of numbers is a population then we represent the mean by the symbol u and pronounced mew If the list of numbers is a sample then we represent the mean by the symbol 7 x This distinction is VERY meaningful Oftentimes u is unknown A sample can always be taken so calculating f is always doable Y is used to estimate or guess at u when it is unknown Example Mean Example Bank Times The following are a sample of times in minutes for telephone calls in the technical support department of the Jefferson Valley Bank Find the sample mean of the times 71376911899 x1 x2 x5 x7 x10 xn Note that these numbers are a sample and so our answer will be represented by the symbol 7 Jr Properties of the Mean It is NOT a resistant measure meaning extreme values can have a substantial impact this is a very unappealing property beware outliers The mean is the center of gravity of the data Think of a numberline as a teetertotter If onepound weights were placed on the numberline itself weightless at each value in the list then the teetertotter would balance at the mean Every observation has a direct role in the determination of the mean If you change any single value the mean will change One can estimate the mean from a histogram by the place the histogram would balance Measures of Center Median The median of a list of numbers is the middle value of the ordered list Represented as Md lfthe number of numbers in the list n is odd then there is exactly one value in the middle of the ordered list The n1 position of this middle value is position 7 The value in the ordered list in position quot71 is represented by the symbol WT If the number of numbers in the list n is even then there are two values vying for the middle of the ordered list The positions of these middle values are 3 and quot2 We typically average these two middle values Symbolically X X 2 Md 2 o 2 e Example Median Example Bank Times The following are a sample of times in minutes for telephone calls in the technical support department of the JV Bank Find the sample median of the times 7 1 3 7 6 9 11 8 9 9 XM x0 Xe x00 xlt9gt n x00 Thus two values are tied for middle position Position n2 And position n22 Median Properties of the Median It is a resistant measure meaning extreme values tend to have impact this is a very attractive property the median is not bullied by outliers The median is the middle value of the data The median splits the data in half If you are told the median of a list is 84 then you immediately know that half the list is less than 84 and half is more than 84 very attractive One can estimate the median on a histogram by the place where half the area is to the left and half the area is to the right The Mean Vs the Median For symmetric distributions the mean and the median are approximately equal to each other For skewed right distributions the mean is bigger than the median For skewed left distributions the mean is smaller than the median the tail pulls the mean The median is easily interpretable and thus is very attractive in purely descriptive settings Although the mean has the unattractive property of being heavily influenced by any outliers making it worrisome in descriptive settings it has some very useful statistical properties that make king in analytical settings inference Measures of Center Midrange and Mode The midrange of a list of numbers is the average of the smallest and largest values Represented as Mr Symbolically Mew MOSt definitely NOT a resistant measure average of the most extreme values The mode of a list is the most frequently occurring number or word Represented as Mo Only meaningful measure of center for nominal data Not useful when the variable under consideration is continuous since it is likely that each value will only occur once in the list Example Midrange and Mode Example m The following are a sample of times for telephone calls in the technical support department of the JV Bank Find the sample midrange and mode of the times 71376911899 The smallest value is Xm The largest value is Xmo Midrange The value is the most frequently occurring value It s associated frequency of the is Thus the Mode equals Measures of Spread When dealing with a list of numbers we often like to convey the spread or variability of the list We will discuss three measures of spread The standard deviation range and interquartile range We will discuss the properties of each understanding the way each measures the spread Measures of Spread The Standard Deviation The standard deviation of a list of numbers is found by averaging the squared deviations and then taking the square root Symbolically x1 u2 x2 u2 xN u2 N In essence we are looking at how far each number in the list is away from the mean deviation squaring these deviations averaging these squared deviations then taking the square root The standard deviation can be thought of as roughly the average distance each value in the list is away from the mean A squared standard deviation is called the variance 502V Measures of Spread The Population Standard Deviation o If a list of numbers is a population then we represent the population standard deviation by the Greek letter 0 sigma The population variance is represented as 02 There is an equivalent formula to the one given on the previous page 2 2 2 2 0 x1 x2 xN l Nm N This formula is generally easier to number crunch and will always give the same answer as the formula on the previous page Measures of Spread The Sample Standard Deviation If a list of numbers is a sample then we by definition do not care directly about this list We only care about what the sample can tell us about the population We will look at the spread of the sample to estimate the spread of the population 0 It turns out that if we use the previous formula we will tend to underestimate 0 So we tweak the formula only for samples The sample standard deviation represented as s is calculated as leX22xn2nfz 5 11 1 s is used to estimate 0 s2 is called the sample variance quotthetweakquot Example Standard Deviation Example Bank Times The following are a sample of times for telephone calls in the technical support department of the JV Bank Find the sample standard deviation 71376911899 Recall The standard deviation of ALL call durations at the bank represented as o is unknown and would be estimated as x The sample variance is 2 Properties of the Standard Deviation It is NOT a resistant measure meaning extreme values tend to have a major impact this is a very unattractive property The standard deviation as a descriptive measure of spread is commonly paired with the mean as the measure of center Just like the mean the standard deviation is preferred in analytical settings because of its mathematical properties The Empirical Rule Suppose we have a list of numbers Further suppose we know the mean of this list u and the standard deviation of this list 0 Lastly suppose the list of numbers has a bellshaped histogram or not too drastically far from bellshaped THE EMPIRCAL RULE J Approximately 6826 of the values in the list will be within one standard deviation of the mean and thus lie within the interval ll 0 ll 0l1i0 Approximately 9544 of the values in the list will be within two standard deviations of the mean and thus lie within the interval u 20 u 20 u 20 Approximately 9974 of the values in the list will be within three standard deviations of the mean and thus lie within the interval ll 30 H 30 Hi301 Utility of The Empirical Rule With the empirical rule we can replace a list with a mean standard deviation and a concept V We lose exactness but gain simplicity V The approximation is generally quite good V There are occasions where a list of possibilities is impossible to calculate yet we know the mean and standard deviation of the list and know that the list is approximately bellshaped The numbers in the empirical rule come from the normal or bell curve Example The Empirical Rule 40 students in a statistics course take an exam The average of the forty scores is 707 with a standard deviation of SD 9 The scores are as follows 50 52 57 59 6O 61 62 64 64 65 66 66 66 67 68 68 68 69 70 71 71 72 72 73 73 73 74 74 75 75 76 77 79 gt 81 82 83 83 85 85 92 According to the empirical rule if the data is approximately normal or moundshaped about 6826 of the data will fall between 707 9 617 and 707 9 797 Looking at the actual scores exactly what percent of the scores were between 617 and 797 Example The Empirical Rule The mean and standard deviation of trash bag breaking strengths for a process are u 5075 and o 164 According t0 the Empirical RUIG assuming the distribution of breaking strengths is moundshaped The chance a randomly chosen trash bag will break when the stress is in the interval 4911 5239 is approximately A randomly chosen trash bag will break with approximate 9974 probability when the stress on it is in the interval Measures of Location Percentiles The pth percentile of a list of numbers is a value where p of the list is less than orequal to this value and 100 p of the list are more than orequal to this value The median is the 50th percentile The 70th percentile is a value where 70 of the list is less than orequal to this value and 30 of the list is greater than or equalto this value The pth percentile of a list consisting of n things can be n calculated by first calculating L p L is for location then If L is not a whole number Le has decimal places then round L up to get L The pth percentile is the value in the ordered list at position L If L is a whole number then the pth percentile is the average of the values in the ordered list at positions L and L1 The Quartiles The 25th 50th and 75th percentiles of a list of numbers are called the guartiles The first quartile denoted Q is the 25th percentile The second quartile denoted Q2 is the 50th percentile median The third quartile denoted 03 is the 75th percentile Called the quartiles since they divide the data in quarters below Q1 between Q1 and Q2 between Q2 and 03 and largerthan 03 The data between the first and third quartiles between Q1 amp 03 is called the middle half of the data Example Percentiles Example Pa er lm erfections The Action Paper Company makes reels of paper from mixtures of wood pulp and recycledpaper A sample of 10 reels is obtained and the number of Imperfections is counted and the results are as follows the data is no particular order 8 Z 5 6 E 7 8 8 9 14 Find the quartiles the 32nd percentile and the 68th percentile Note 17 10 l The firstquartile is the 25th percentile mp 10 25 SO Q1 2 5T3positi0n The third quartile is the 75th percentile Since the 25th is 3 positions from the beginning so the 75th is 3 from the end Q3 The second quartile is the median or 50th percentile np 1050 SO Md 5 d 6 poxmon an poxmon np 3 2 T 4 The 32nd percentile is 100 100 pasman V The 68th percentile is More Measures of Spread Range and lnterquartile Range The range of a list of numbers is the distance from the smallest number to the largest number Calculated as range largest smallest Definitely not a resistant measure of spread meaning it is strongly influenced by outliers The IQR or interquartile range is the distance between the quartiles Calculated as IQR Q3 Q1 Length of the middle half of the data A resistant measure of spread outliers have weak influence When wanting to convey the center and spread of a list of numbers the IQR is commonly paired with the median Outliers FiveNumber Summary We have previously only impreciser defined an outlier as a value which looks unlike the other numbers in a list We want to make the identification of outliers more objective by use of a formula We will classify an observation as an outlier if it is not in the interval From Ql 15IQR to Q315IQR lfa value is found to be an outlier it still belongs to the list itjust now has an extra designation The fivenumber summary of a list is a presentation of the smallest value the first quartile the median the third quartile and the largest value The fivenumber summary of a list is simply a presentation of these five values Example Five Number Summary and Outliers Example Wolf Packs A random sample of winter wolf packs in regions of Alaska Minnesota Michigan Wisconsin Canada and Finland showed packs sizes of in ascending order already 2 2 2 3 3 4 4 4 5 7 7 7 7 8 8 10 13 15 Give the fivenumber summary and determine if there are any outliers Note n 18 Boxplots A boxplot is essentially a graph of the fivenumber summary with outliers drawn differently The step to draw one are Draw a numberline either vertically or horizontally The boxplot will be drawn alongside the numberline A box spans the quartiles A line inside the box indicates where the median is Suspected outliers are drawn separately Lines extend out from the box to the smallest and largest values in the list that are not already drawn Boxplots Below is a computer generated boxplot for the Wolfpack data Advantages include Easy to draw Easy to read Can use the same numberline Pack Size to draw many boxplots for multiple lists of numbers Disadvantages include Can mask the shape of the distribution especially modality Example Boxplots Example Manaqer Paperwork Hours The following list the amount of time in hours that a sample of of ce managers spent working on papenvork In one day 00 00 10 15 18 19 20 20 21 23 24 24 29 33 34 37 44 45 60 Create a boxplot for the above data Note n 19 E Microsoft Excel Boak1 Elle Edit ew insert Format Iouls Q Related Variables E33 7v f 03 l l C D L Height Gender Politic1 Children J o After looking at single variables doing graphs like histograms and numerical summaries one might begin to examine relationship between variables YCS Are Height and Gender Related Maybe Gender and Political Orientation Maybe Political Orientation and Children Doubtful Height and Children We will focus on examining the relationship between two quantitative variables m mlulmlmlnlm 4 L l l l l l l l l M 39wlmlmlblwlwlelal l l PB KWKZEZZE39HZWE nHEZZ HEE39HEETIEE39HZE39H HEE iii 00300033OZOEZETEEEOrzozii zi ZE MLOILONMLUMLu3MmMMmmhMMmMmIQkhMMMthMD Illllllllll llmlllllllllllllll El Scatterplots We start with two quantitative variables The explanatom variable also independent variable X variable or predictor variable is the variable which will be used to explain or predict another variable The response variable also dependent variable orY variable is the variable which will be explained or predicted A scatterplot is a graph which displays the relationship between two quantitative variables The explanatory variable is plotted along the horizontal x axis the response variable along the vertical y axis Each individual appears as a point on the scatterplot Eii i Example Scatterplots Example Beer Sales To help determine the relationship between temperature and beer sales at Yankee Stadium a sample of 10 games were selected noting the temperature and the beer sales at the LDCOHJG39IUlJ39an i 5 9 C1 L Temp I Beers l Eile gdit Dgta gal St BE C 2053339 1439 13329 21288 30935 118 30240 concession stand Scatterplot of Beers vs Temp 40000 V The m do the points tend to follow a linear or curved path 30000 The direction are the points rising positive or falling negative The strength the more tightly clustered the points are to the form 20000 Beers linear 10m 39 Positive Somewhat Strong the stronger the relationship 0 80 85 I I 90 95 The Correlation A measure of the strength of linear association between two quantitative variables Denoted by r39 xlylxnynnf7 V The formula is r n1sxsy V The correlation will always be between 1 and 1 V The sign indicates the direction of the relationship V If the points of a scatterplot all fall on the same line the correlation will be 1 or 1 If the points form a directionless cloud the correlation will be close to 0 Example Correlation Example Car Data The following table summarizes sample data for 5 cars of comparable type Mileage 35 22 32 18 25 Selling Price 70 85 72 89 76 The X variable Mileage Scatterplot of Selling Price vs Mileage 89 O The Y vanable Selllng Pnce 85 o 3 E 3 In 76 o 72 O 70 0 1398 2392 2395 3392 3395 Mileage Regression Line A regression line is any line that describes how a response variable Y changes as an explanatory variable X changes V Often used to predict the value of Y corresponding to a known value of X Any line relating yto X has an equation of the form y a bx where b is the slope and a is the yintercept V Thus one can indentify a specific line by identifying the slope b and the yintercept a Which Line to use Scatterplot of Selling Price vs Mileage 89 O I How close is the red line to the points We will measure the Cltoselne s of the line to the points by adding up all the squared vertical distances 72 70 Mileage 85 Selling Price N Oquot I LeastSquares Regression Line The least squares reqression line using X to predict y has equation A yabx gtlt Sy where br and ay Comments Regression Line Only used to predict Yvariable when X is known If you want to use Y to predict X then you must start over Switching which is the X and which is the Y variable Extrapolation is dangerous The slope b of the regression line has a nice interpretation It is the amount at which the predicted Y changes when X is increased by one unit Example Regression Line 39 Example Car Data The X variable Mileage The Y variable Selling Price O 96 72264 sx 2702 y7s4 sy 2082644 7 Scatterplot of Selling Price vs Mileage 89 O 85 0 3 m 76 O 72 O 70 0 1398 2392 2395 3392 3395 Mileage Example Regression Line Example Car Data f2 1082 0113x To graph plot any two points on the line and connect If X 18 then Scatterplot of Selling Price vs Mileage 89 85 Ifx 35 then Selling Price 76 72 70 1398 2392 2395 3392 35 M ilea g e Example Regression Line 0 Example Car Data j 10 82 O113x To predict Y plug the known value ofX into the equation The predicted value can be ballparked from the graphed line Scatterplot of Selling Price vs Mileage IfX 28 predict Y 85 Selling Price fX 100 predict Y 7396quot 72 70 1398 2392 2395 3392 35 M ilea g e Example Regression Line 0 Example Car Data j 21082 01 13x To predict Y plug the known value ofX into the equation Scatterplot of Selling Price vs Mileage IfY 8 predict X 9 up I 9 Ln Interpret the slope Selling Price l 390 I l N I o I Mileage Association DOES NOT IMPLY Causation In a study one or more lurking variables may be present that have an impact on the response variable explanatory variables or both but is not included accounted for among the variables studied Some explanations for an observed association The dotted lines show an association The arrows show a cause and effect link We observe variables X and Y but Z is a lurking variable CAUSATION COMMON CONFOUNDING RESPONSE Example Causation A study of grade school children ages 6 to 11 found a high positive correlation between reading ability Y and shoe size X Do children with larger feet tend to read better What accounts for this relationship Over the past 10 years there has been a high positive correlation between the number of South Dakota safety inspection stickers issued and the number of South Dakota traffic accidents Do safety inspection stickers cause traffic accidents Discussion Bias and Variability Suppose we are interested in estimating the mean of a population u We will use the sample mean Xsample to be taken in the future to estimate u There are many different samples that could result Each different sample has a sample mean We could list out all these possible sample means one per possible sample Call this the list of possible values Discussion Bias and Variability When we take a sample and then calculate the sample mean we are essentially just picking a sample mean at random from the list of all possible sample means Therefore Think of the sample mean as coming randomly from the list of all possible sample means So in order to simplify our understanding of this list of possible sample means we could convey the center and spread of this list by its mean and standard deviation It turns out that the average of the list of all possible sample means is the population mean u Since the average of the list of possible sample means iat then we say that the sample mean X is an unbiased estimator ofu Discussion Bias and Variability Unbiasedness indicates that some of the possible values of the statistic are below the parameter some values are larger than the parameter but on average the possible values of the statistic are right on target for estimating the parameter Unbiasedness is an attractive property but it does not guarantee that an estimator is good Unbiasedness coupled with low variability does ensure a good estimator Discussion Bias and Variability The standard deviation of the list of possible values of the statistic describes the variability of the statistic V Unbiasedness says the center of the list of possible statistics is the parameter Low variability says the list is close together Combined knowing the center of the list of possible statistics is the parameter and that the list is close together implies that the list is close to the parameter gt good estimator Dartboard Illustration The bullseye represents the population mean p or generically any parameter The darts represent the possible sample means 27 or generically the possible sample statistics High Bias High Variability High Bias Dartboard Illustration IDEAL
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'