Popular in Course
Popular in Statistics
This 41 page Class Notes was uploaded by Brody O'Connell on Monday October 5, 2015. The Class Notes belongs to STA392 at Central Michigan University taught by KahadawalaCooray in Fall. Since its upload, it has received 40 views. For similar materials see /class/218955/sta392-central-michigan-university in Statistics at Central Michigan University.
Reviews for Prob&StatsforEngineers
Report this Material
What is Karma?
Karma is the currency of StudySoup.
Date Created: 10/05/15
Chapter 1 Descriptive Statistics 11 The Science of Statistics Statistics is the science of information This involves collecting classifying organizing summarizing analyzing interpreting and presenting information to draw conclusions or make decisions The information is data Data are a fact or proposition used to draw a conclusion or to make a decision Data consist of information coming from observations counts measurements or responses The singular for data is datum There are two types of data sets called populations and samples 1 A Population is the complete collection of events outcomes responses measurements objects counts people individuals or transactions that we are interested in studying Note that Hypothetical Population is a population that does not actually exist 2 A Sample is a portion or subset taken from a population Enumerative Study Interest of this study is focused on a finite identifiable unchanging collection of individuals or objects that make up a population There is a sampling frame listing of all individuals or objects of interest Analytic Study An analytic study is not in enumerative nature This studies are often carried out with the objective of improving a future product by taking action on a process of some sort There is no sampling frame listing the individual or objects of interest 12 Branches of Statistics A Descriptive statistics is the branch of statistics that involves the organization summarization and display of the data Descriptive statistics utilizes numerical and graphical methods to look for patterns in a data set to summarize the information revealed in a data set and to present that information in a convenient form B Inferential statistics is the branch of statistics that involves using a sample to draw conclusions or make decisions about a population and measures the reliability of the result Inferential statistics utilizes sample data to make estimates decisions predictions or other generalizations about a larger set of data A measure of reliability is a statement usually quantified about the degree of uncertainty associated with a statistical inference Statistical thinking involves applying the rational thought to assess data and the inferences made from them critically 13 The Process of a Statistical Investigation Step 1 Identify the Research Objective I Researcher must determine questions heshe wants answered Questions must be detailed Identify the group to be studied This group is called the population I An individual is a person or object that is a member ofthe population being studied is the Sampling unit Step 2 Collect the information needed to answer the questions In conducting research we typically look at a subset of the population called a sample Step 3 Organize and summarize the information This step in the process is referred to as the descriptive statistics Step 4 Draw conclusions from the information This step in the process is referred to as the inferential statistics 14 Distinguishing between Variables and Data Variables are the characteristics or properties ofthe individuals sampling units within the population Variables can be classified into two groups qualitative or quantitative 1 Qualitative or categorical variables allow for classification of individuals based on some attribute or characteristic 2 Quantitative variables provide numerical measures of individuals Arithmetic operations such as addition and subtraction can be performed on the values ofthe quantitative variable and provide meaningful results Quantitative variables can further be classified into two groups discrete or continuous 1 A discrete variable is a quantitative variable that has either a finite number of possible values or a countable number of possible values The term quotcountablequot means the values result from counting such as O 1 2 3 and so on 2 A continuous variable is a quantitative variable that has an infinite number of possible values that are not countable Qualitative variables can further be classified into two groups nominal or ordinal 1 A nominal variable is a qualitative variable that describes an attribute of an individual 2 An ordinal variable is a qualitative variable that has all the properties of a nominal variable but also has observations that can be ranked or put in order Qualitative data are observations corresponding to a qualitative variable Quantitative data are observations corresponding to a quantitative variable Discrete data are observations corresponding to a discrete variable Continuous data are observations corresponding to a continuous variable Nominal data are observations corresponding to a nominal variable Ordinal data are observations corresponding to an ordinal variable 15 Four Sources of Data A Data from Observational Study 1 A census 2 Existing sources 3 Survey sampling An observational study measures the characteristics of a population by studying individuals in a sample but does not attempt to manipulate or influence the individuals Observational studies are sometimes referred to as ex postfacto after the fact studies because the value of the variable of interest has already been established A census is a list of all individuals in a population along with certain characteristics of each individual There are many existing data sources that are collected by researchers and government agencies For example you can find available data from published sources such as books journals or newspapers In a survey sampling the researcher samples a group of people asks one or more questions and records the responses Probably the most familiar survey is the political poll conducted by any one of a number of organizations for examples Harris Gallup Roper and CNN and designed to predict the outcome of a political election Data from Designed Experimental Study 4 Designed experiments A designed experiment applies a treatment to individuals referred to as experimental units and attempts to identify the effects of the treatment on a response variable 16 Two Common Techniques for selecting Samples 1 Simple Random Sampling 2 Stratified Random Sampling A sample of size n from a population of size N is obtained through simple random sampling if every possible sample of size n has an equally likely chance of occurring The sample is then called a simple random sample A stratified sample is one obtained by separating the population into at least two homogeneous non overlapping groups called strata and then obtaining a simple random sample from each stratum To select a convenience sample simply use any members of a population that are readily available This method is likely to produce biased results Graphical Presentation of Data StemandLeaf Plot Histogram Dot Plot Box Plot Bar Graph for Discrete or Qualitative Data 17 Organizing Qualitative Categorical Data A frequency distribution is a table that lists each category of data with number of occurrences counts The relative frequency is the proportion or percent of observations within a category and is found using the formula Relative frequency frequency Sum of all frequencies A relative frequency distribution lists each category of data together with relative frequency A bar graph chart is a chart with rectangular bars of lengths equal for frequency or proportional for relative frequency to the value of each category The bars can be horizontally or vertically oriented Sideby side bar graph can be used to compare two or more data sets 18 Organizing Quantitative Data For discrete quantitative data The values of a discrete variable are used to construct the categories of the frequency distribution The associated histogram is actually a bar graph since width of the bars have no meaning unless we divide the data set into intervals For continuous data The interval of numbers of a continuous variable are used to construct the categories of the frequency distribution The associated histogram is not a bar graph since width of the bars represent the length of the intervals A Histogram is constructed specially for quantitative continuous data by drawing rectangles for each class The height of each rectangle measures the frequency or relative frequency of the class The width of each rectangle is quantitative and measures the data values The width of each rectangle is the same and the rectangles touch each other When constructing histograms use more classes as the number of values in the data set gets larger Why do we want to construct histogram 0 Histogram summaries data values of the variable in a graph that can demonstrate the distribution of the variable It helps us to quickly visualize where are the majority of data values If there are some very unusual data values if these unusual data on the high side or on the low end Are data values very far apart or are they very close to each other and so on Constructing a relative frequency histogram for continuous variables Choose the appropriate number of classes Calculate the approximate class width by dividing the difference between the largest and smallest values Range largest smallest by the number of classes Round the approximate class width up to a convenient number Locate the class boundaries gt If discrete assign one or more integers to a class gt If continuous use Method of left inclusion Include the left class boundary point but not the right boundary point in the class NOTE Different methods may be used in different software Some may use right inclusion Some may add an additional decimal place for the class boundary Construct a statistical table containing the classes their boundaries and their relative frequencies Construct the histogram like a bar graph Why relative frequency histograms are important than histograms gt In inferential statistics we use data from a sample to guide us to conclusions about the larger population from which the sample is drawn The actual frequencies for a sample do not by themselves give much useful information about the population but the relative frequencies for the sample data will usually be similar to the relative frequencies for the population gt In addition the relative frequency histograms are useful to compare two populations StemandLeaf Plot This plot presents a graphical display of the data using the actual numerical values of each data point Constructing a StemandLeaf Plot 1 Divide each measurement into two parts the stem and thelea 2 the stems in a column with a vertical line to their right 3 For each measurement record the leaf portion in the same row as its matching stem 4 Order the leaves from lowest to highest in each stem 5 Provide a key to your stem and leaf coding so that the reader can recreate the actual measurements if necessary Dotplot A dotplot is drawn by placing each observation horizontally in increasing order and placing a dot above the observation each time it is observed Interpreting Graphs with a Critical Eye What to look for as you describe the data Scales The measurement unit such as S inches and etc location Where is the center of the data shape The shape of the frequency distribution outliers Some unusual data values Distributions are often described by their shapes symmetric skewed to the right long tail goes right skewed to the left long tail goes left Unimodal one peak bimodal two peaks and multimodal many pea s 19 Measures of Central Tendency Mean Median 0 Mode Midrange A Parameter is a number that describes a population characteristic For example population mean A Statistic is a number that describes a sample characteristic For example sample mean The population mean is computed usingache individuals in a population The population mean u is a parameter x1er2 CN 2x N N Where N is the total of all individuals in the population The sample mean is computed using sample data The sample mean is a statistic that is an unbiased estimator of the population mean x1x2xn in n n Note In real world applications population mean 1 is usually not known and is estimated by using sample mean The median of a variable is the value that lies in the middle of the data when arranged in ascending order That is half the data are below the median and half the data are above the median We use M to represent the median Steps in Computing the Median of a Data Set 1 Arrange the data in ascending order 2 Determine the number of observations n 3 Determine the observation in the middle ofthe data set gt If n is odd locate the median data value at the n12 position This is the middle data value gt If n is even locate the median data value by averaging the data values at n2 and n12 positions This is the mean of the two middle data values The mode of a variable is the most frequent observation of the variable that occurs in the data set If no data value is repeated the data set has no mode If two data values occur with the same greatest frequency each data value is a mode and the data set is called bimodal The midrange is the average of largest and smallest data values 0 A Trimmed mean is a compromise between sample mean and sample median For example a 10 trimmed mean can be calculated by eliminating the smallest 10 and the largest 10 of the sample and then averaging what is left over gt The mean is sensitive not resistant not robust to extreme data values gt Median is not sensitive resistant robust to extreme data values gt The mode is not sensitive to extreme values gt The midrange is sensitive to the extreme data values gt Trimmed mean is not sensitive to the extreme data values 110 Measures of Dispersion Variance Standard Deviation Fourth Spread lnterquartile Range IQR Range Measures of dispersion measure the degree of the data values spread The larger the data values spread the larger the variation of the data values Visualizing Variability using Histogram The sample variance is S2 Zxi762 x1 c2x2 c2xn Jc2 Degrees of n 1 freedom oThe sample standard deviation is S V82 The population variance is OJ Zxi w x1 y2 x3 p3 391111 39 N N The population standard deviation is 0 Xg The simplest measure of dispersion is Range largest data value smallest data value Points to remember about variance and standard deviation and the relationship with histogram The value of s and s2 is always greater than or equal to zero The larger the value of s2 or s the greater the variability of the data set If s2 or sis equal to zero all measurements must have the same value The standard deviation sis computed in order to have a measure of variability having the same unit as the observations The larger the standard deviation the more spread the data the flatter the histogram The smaller the standard deviation the more clustered the data around the mean the taller the peak of the histogram 111 Definition of outlier An observation or measurement that is unusually large or small relative to the other values in a data set is called an outlier Outliers typically are attributable to one of the following cases 0 The measurement is observed recorded or entered into the computer incorrectly The measurement comes from a different population The measurement is correct but represent a rare chance event 112 The FiveNumber Summary and Boxplots The FiveNumber Summary MINIMUM 21 Median 23 MAXIMUM 21 lower fourth 23 upper fourth lnterquartile Range IQR Fourth Spread f5 Q3 Q1 Fractiles are numbers that partition or divide an ordered data set into equal parts Quartiles Divide a data set into four equal parts Q1 Q2 Q3 Deciles Divide a data set into ten equal parts D1 D2 D3 D9 Percentiles Divide a data set into 100 equal parts P1 P2 P3 P99 b Drawing a Box plot Find the fivenumber summary of the data set Construct a horizontal scale that spans the range of the data Determine the fences Lower inner fence Q1 15IQR Lower outer fence Q1 30IQR Upper inner fence Q3 15IQR Upper outer fence Q3 30IQR Draw vertical lines at Q1 M Q3 Enclose these vertical lines in a box Draw a whisker a horizontal line from Q1 to the smallest data value that is larger than the lower inner fence Draw a whisker from Q3 to the largest data value that is smaller than the upper inner fence Any data values between lower outer fence and lower inner fence or upper inner fence and upper outer fence are outliers and are marked with an asterisk Any data values less than the lower outer fence or greater than the upper outer fence are extreme outliers and are marked as a small circle 0 Upper outer fence Lower Quartile lower hinge Upper quartile upper hinge Mean Smallest data value Outliers Median Upper innerfence Largest data value inside the upper inner fence Extreme outlier 0 1 2 3 4 5 Distribution Shape Based on the Box Plot 1 lfthe median is near the center ofthe box and each of the horizontal lines are approximately equal length the distribution is roughly symmetric lfthe median is left ofthe center of the box andor the right line is substantially longer than the left line the distribution is right skewed lfthe median is right ofthe center ofthe box andor the left line is substantially longer than the right line the distribution is left skewed Frequency Symmetric 25 30 35 40 45 50 55 60 65 70 75 Frequency Skewed Right 0246810121416182022 0 Frequcucy 16 17 18 Skewed Left 1 20 13 14 15 16 17 18 19 20 21