### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# Probability and Statistics for Engineers ST 370

NCS

GPA 3.79

### View Full Document

## 41

## 0

## Popular in Course

## Popular in Statistics

This 261 page Class Notes was uploaded by Jordane Kemmer on Thursday October 15, 2015. The Class Notes belongs to ST 370 at North Carolina State University taught by Yichao Wu in Fall. Since its upload, it has received 41 views. For similar materials see /class/223949/st-370-north-carolina-state-university in Statistics at North Carolina State University.

## Reviews for Probability and Statistics for Engineers

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 10/15/15

North Carolina State University STAT 370 Probability and Statistics for Engineers Yichao Wu Last class Statistics the science of collecting organizing and interpreting data Methods of statistics follow a process Identify the research objective Collect information Organize and summarize information Draw conclusion Terminologies population variable of interest sample observation parameter statistics enumerativeanalytical study reasons for sampling Assignment Read Chapter 1 of Vardeman and Jobe Example 1 SAT Scores Parents and teachers have been concerned about the trend of declining SAT scores and sought ways to halt the decline at least at local level 50 students 24 males and 26 females matched according to socioeconomic background participated in a study to examine the effect of classroom atmosphere strict or liberal on student performance as measured by SAT scores at the end of the school year The students were divided into two groups of 25 each 12 males and 13 females with Group 1 to study under a strict atmosphere while Group 2 under a very permissive atmosphere Example 1 SAT Scores After nine months all students were given the same standardized tests the verbal test and the mathematics test Student Group Gender SATMath SATVer A Strict F 670 700 B Strict M 700 680 C Liberal F 750 730 D Liberal M 690 750 Example 1 SAT Scores This example involves data collection data analysis and statistical inference How Questions Does stricter classroom atmosphere increase the average score Is the group size 50 large enough to make a confident conclusion Why matched according to socioeconomic background Why 12 males and 13 females per group Fundamental Concepts Population the entire group of individuals that we want information about Students about to take SAT Sample a part of the population that we actually examine in order to gather information 50 students selected into the study Sample size number of observationsindividuals in a sample 50 Statistical inference to make an inference about a population based on the information contained in a sample Based on the data from the study to infer whether a stricter classroom atmosphere increases SAT scores in general SAT Score Data Student Group Gender SATMath SATVer A Strict F 670 700 B Strict M 700 680 C Liberal F 750 730 D Liberal M 690 750 Data contain Individuals the objects described by the data Variables any characteristic of an individual A variable can take different values for different individuals lnclass exercise 11 One hundred people volunteer to take an experimental drug for weight loss Amount of weight loss in pounds for the 6 months a person takes the drug is recorded What is the variable of interest What is the sample and sample size What is a possible population Give an example of statistic that would be of interest here SAT Score Data Student Group Gender SATMath SATVer A Strict F 670 700 B Strict M 700 680 C Liberal F 750 730 D Liberal M 690 750 Types of Variables A categorical or qualitative variable places an individual into one of several groups or categories A quantitative variable takes numerical values for which arithmetic operations such as adding and averaging make sense Example 1 SAT Scores Student Group Gender SATMath SATVer A Strict F 670 700 B Strict M 700 680 C Liberal F 750 730 D Liberal M 690 750 Categorical or qualitative variables Quantitative variables lnclass exercise 12 Classify as qualitative or quantitative hair color salary weight of cars religious affiliation STAT 370 final exam score Discrete vs continuous A discrete variable is a quantitative variable that has either a finite number of possible values or a countable number of values can be lined up with 0123 Counts are a classical example of discrete variables A continuous variable is a quantitative variable that has an infinite number of possible values takes values in intervals or a continuum Lifetime of a light bulb is continuous though we tend to make it discrete by grouping into number of days etc ncass exercise 13 Determine whether the quantitative variable is discrete or continuous Home runs hit by Barry Bonds this season Time spent studying for your first exam Strength of concrete in pounds per square inch Number of typos in a 500page novel Example 2 Students in STAT 370 Class Roll Variables College Class Degree Major How many categories How many students in each category Equivalently what is the distribution for each variable Distributions of Variables The distribution of a variable indicates What values a variable takes and how often it takes these values For a categorical variable distribution categories countpercent for each category For a quantitative variable distribution pattern of variation of its values Variable Class Class Count Percent FR 80 80 SO 12 12 JR 4 4 SR 4 4 Observational vs Experimental Observational study investigator s role is basically passive Individuals in a sample are studied but no attempt is made to manipulate or influence the variables of interest Good for establishing whether two variables are related or to learn characteristics of a population Carried out when control is unethical or impossible A process phenomenon or group of subjects is watched and data are recorded Experimental study investigator s role is active Variables are manipulated the study environment is regulated Treatments are applied to experimental units to try to determine the effects of the treatment on the response variable Better for establishing causa on Example smoking and lung cancer Example To determine whether there is a connection between smoking and lung cancer individuals who have been smoking for some time are examined The individuals are not controlled in terms of their eating habits how much they smoked etc They are simply interviewed to determine if they are smokers or non smokers and their rate of cancer is monitored If there is a significant difference between the two groups cancer rates the researcher may claim that smoking causes cancer actually determined that they are associated Smokers could have some characteristic eg amount of exercise the lurking variable that differs from the non smoking group that is the cause of the cancer Example smoking and lung cancer cont On the spectrum of studies the experimental end is preferred as opposed to observational but at times an observational study is the best we can do for eg wouldn t want to make people smoke To do this as an experimental study we would need to randomly divide the population into two groups and eg require one group to smoke a pack a day each day for the next 20 years We could then control for other factors that aren t under our control in an observational study eg assign same diet and exercise regimen Causality It is easier and safer to infer causality from an experiment than from an observational study Real systems are complex There may be important variables in the background that are changing and are the true reason for instances of favorable system behavior Lurking variables lnclass exercise 1 4 Classify the following study as experimental or observational Gallup News service conducted a survey of 1012 adults aged 18 years old or older Aug 29 Sept 5 2000 The respondents were asked Has anyone in your household been a victim of a crime in the past 12 months 24 of the respondents answered yes Gallup concluded that about 24 of households had been victimized by crime during the past year More terminologies Univariate data a single characteristic is observed on each unit Multivariate data more than one characteristics are observed on each unit Special case 1 bivariate Ex Gender and weight of a patient Special case 2 repeated measures data Same measurement is made multiple times on the same unit Ex weight of child every month Special case paired data A special type of data structure Example making paper airplanes construction paper newspaper or typing paper with paper clip or without paper clip smal wingspan medium wingspan or large wnngspan A complete factorial study several process variables and settings of each are identified as being of interest and data are collected under each possible combination of settings of the process variables Factorial study Factor process variablesusually controllable Levels the settings of a factor Fractional factorial study A complete factorial study is sometimes not feasible many factors and levels A fractional factorial study data are collected for only some of the combinations that would make up a complete factorial study Measurement A measurement or measuring method is called valid if it usefully or appropriately represents the feature of an object or system that is of engineering importance A measurement is called accurate or unbiased if on average it produces the true or correct value of a quantity being measured A measurement system is called precise if it produces small variation in repeated measurement of the same subject my Has lawmisibillly Law um Mgr mummy In M mummmmmm d lnclass example 15 Semiconductor wafers from a certain company have a defective rate of about 8 You have just ordered a batch of 10 wafers and want to predict how many will be defective What is the sample and sample size What is the population What type of population is this Is the 8 given in the description a parameter statistic variable or something else Would you use inferential statistics or deductive statistics to make the prediction of the number of defective wafers of the 10 Sampling The goal in sampling is to obtain individuals in such a way that accurate information may be obtained about the population Here we discuss a basic sampling technique that has certain good properties A simple random sample of size n from a population is a sample selected in such a manner that every collection of n items in the population is a priori equally likely to be compose the sample Only works for enumerative studies Simple random sampling RSRS WHY Random sampling avoids selection bias eg suppose I am producing a drug and I want to show that it has good effects I can select the healthier or younger patients as the group to make my drug Can quantify bias and general effects of sampling Does not guarantee a good or representative sample every time we can get all small values or all large valuesonly guarantees longrun behavior SRS HOW Every group of n distinct units of N in population has an equal chance of being selected Consequence every unit in the population has an equal chance of being selected to be in the sample Paradigm drawing names out of a hat In practice we use computer generated random numbers SRSE39xample Example Draw sample of 2 from population of 7 students with population mean weight 150 Data weights 100 110 120 130 150 200 240 Note also variance in weights for population another population parameter Take 21 samples of size 2 compute average for each 12 24 37 13 25 45 14 26 46 15 27 47 16 34 56 17 35 57 SRS Example cont Notice the variation in Though the actual average weight is 150 if we take sample 12 the average weight is 105 if we take sample 67 the average is 220 Why SRS Why select SRS of 5 students to represent the class Bias in selecting students in front or back Bias in selecting students whose names I know Potential bias in sample of convenience SRS why not Not always feasible eg what if population is conceptual Example Take SRS of people who run at a particular track How do we randomize selection Convenience sample Often we take a sample of convenience individuals are easily obtained the most popular being when they are selfselected ie decide themselves to participate in the survey Most popular convenience sample is one in which individuals in the sample are selfselected the individuals themselves decide to participate in the survey These are also called voluntary response surveys Examples include phonein polling and Internet surveys This is not a good sampling design and thus we should be careful in generalizing the conclusions from them to the entire population Takehome message Variable of interest population sample sample size enumerative study analytical study parameter statistics inferential statistics categoricalqualitative variable quantitative variable discrete variable continuous variable distribution observational study experimental study causality lurking variable univariate variable multivariate variable repeated measurement paired data factor level complete factorial study fractional factorial study valid accurate precise Simple random sample Assignment Read Chapter 2 of Vardeman and Jobe North Carolina State University STAT 370 Probability and Statistics for Engineers Yichao Wu Lass class Histogram Intervals must be nonoverlapping Typically an observation equal to a boundary value is put in the higher interval Intervals must be contiguous rectangles touch each other Intervals must be equal width Often choose nice boundaries Irequency table relative trequency table Examining distribution Sha e s mmetric rivht skewed left skewed bellshaped Center mode mean median Spread Homework 1 due Friday 16 at 5 PM Homework 2 due Friday 23 at 5PM Left skewed Right skewed Histogram how Choose intervals or bins that cover the entire range of the data Count the number of observations per interval Draw rectangles with heights corresponding to number in interval for relative frequency histogram height is relative frequency of interval Notes Intervals must be nonoverlapping Typically an observation equal w a boundary value is put in c higher interval Intervals must be contiguous rectangles touch each other Intervals must be equal width Often choose nice boundaries lnclass exercise The volume of a stock is the number of shares traded on a given day The following data given in millions so that 378 represents 3780000 shares traded represents the volume of Altria Group d d of 35 trading days in 2004 378 606 532 304 1032 338 1096 874 575 325 334 333 333 453 435 334 63 333 23 44 797 502 692 757 716 652 970 301 840 623 607 488 443 356 558 a Construct a frequency distribution of the data using bin widths of size 2 b Construct a relative frequency distribution of the data using bin widths of size 2 c Construct a frequency histogram and a relative frequency histogram of the data using bin widths of size 2 d On what percentage of the 40 days were at least 6 million shares traded e Describe the shape of the distribution Choose intervals 2 4 4 6 6 8 8 10 10 12 Measuring center the mean f Mean Average value The sample mean f If the n observations in a sample are x1 x 2 x n then their mean is 3x1x2 xnn 1 szi Measuring center the median Median middle value or center point The sample median M the number such that half of the observations are smaller than it and the other half are larger the midpoint of a distribution Procedure to calculate the median M 1 Arrange all observations in order of size from smallest to largest 2 If the number of observations n is odd the median M is the center observation in the ordered list 3 If n is even then M is the mean of the two center observations in the ordered list Note n112 is the location of the median not the median itsel Example Fuel economy miles per gallon for 2001 twoseater cars The highway mileages of 18 gasolinepowered twoseater cars 13131619 212123 23 24 26 26 27 27 27 28 28 3O 3O Mean Median The highway mileages of 19 twoseater cars 13131619 212123 23 24 26 26 27 27 27 28 28 30 30 68 Mean Median Example Salary Survey of UNC Graduates Survey a certain number of graduates from UNC A lot of departments are surveyed Question Which department produces students that earn the most on average ten years after they got their degrees Answer Georaphr Michael Jordan Mean vs Median Mean easy to calculate easy to work with algebraically highly affected by outliers Not a resistant measure Median can be time consuming to calculate more resistant to a few extreme observations sometimes outliers robust Mode The most frequent value in the data Important for categorical data Possible to have more than one mode Mean Median and Mode If the unimodal distribution is exactly symmetric the mean the median and the mode are exactly same If the distribution is skewed the three measures differ Which one to use Different by definition Meanand median are uni rue and only for quantitative variables Mode is not unique Mode is defined for categorical variables also The choice depends on the shape of the distribution the type of data and the purpose of your study Skewed median Categorical mode Total quantity mean Outliers Observations that lie outside the overall pattern of a distribution Possible reasons error in data entry most likely reason Equipment failure Human error Missing value code eXIraorcunary InoIVIouaIs Jordan39s salary Handling Outliers Detect it using graphical and numerical methods Check the data to make sure correct entry Reducing influence of outlier delete the observation BE CAREFUL Use transformations robust methods Speed of Light Histogram Frequency N N w Ln 0 U1 0 l l l l O l l 40 20 O 20 Time 40 6O Numerical Summary for Distributions Center Mean Median Mode Spread Fivenumber summary and Boxplot Standard Deviation Choose at least one from eaw eategory Why do we need Spread Knowing the center of a distribution alone is not a good enough description of the data Two basketball players with the same shooting percentage may be very different in terms of consistency Two companies may have the same average salary but very different distributions we need to KnOW the Spread or the variability of the values A raw measure Range Range maximum minimum Depends only on two values Tends to increase with larger samples Affected by outliers Not robust Quantiles De nition on page 78 of Vardeman and Jobe For a data set consisting of n values that when ordered are x 5 x2 g s x n 1 if p for a positive integer i 5 n the p quantile of the data setis 90 Qi395 xi The ith smallest data point will be called the quantile n S N for any number p between and 1 that is not of the form for an integer i the p quantile of the data set will be obtained by linear interpolation between the two values of Qquot395 with corresponding 7 i5 T that bracket p In both cases the notation Qp will be used to denote the p quantile Quartiles The sample quartiles are the values that divide the sorted sample into quarters just as the median divides it into half The most commonly used quantiles The median M 50th quantile The first quartile Q1 25th quantile The third quartile Q3 75th quantile Calculations of Quartiles The first quartile Q1 is the median of the observations who are leoo than u e overall median The third quartile Q3 is the median of the observations who are greater than the overall median Interquartile Range IQR IQR Q3 Q1 The range ot the center halt ot the data A resistant measure for spread IQR can be used to identify suspected outliers Ruleof thumb An observation is called a suspected outlier if it falls more than 15IQR above Q3 or below Ql Examples 2001 Twoseater Cars The highway mi leages of 18 gasolinepowered twoseater cars 13131619 212123 23 24 2626 27 27 27 28 28 30 30 o quotquotVV39V Vquot 39V3 39V1 39V1 39Vah 3915112 V15V1h 39J 21 03 median of 262627272728283030 27 IQR Q3 Q1 2721 6 15QR9 QB15IQR36 Q115IQR12 No outliers Examples 2001 TwoSeater Cars The highway mileages of 19 twoseater cars 13131619 212123 2324 2626 27 27 27 28 28 303068 0 o o Any outlier The fivenumber summary To get a quick summary of both center and spread use the following venumber summary Minimum Q1 M Q3 Maximum Example CityHWY Mileage of 2001 Twoseater Cars City Mileage 1 1819202021212222 25 61 Fivenumber summary Highway Mileage 13 13 16 1921 21 23 23 24 2626 27 2727 28 28 30 3068 Fivenumber summary Boxplots a visual representation of the fivenumber summary A boxplot consists of A central box spans the quartiles Q1 and QB A line inside the box marks the median M Lines extend from the box out to the smallest and largest observations Example CityHWY Mileage of 2001 Twoseater Cars City Mileage 1 1819202021212222 25 61 Fivenumber summary Highway Mileage 13 13 16 1921 21 23 23 24 2626 27 2727 28 28 30 3068 Fivenumber summary Boxplots of highwaycity gas mileages Twoseatersminicompaots 30 25 20 15 Tonwy TwoCity Pros and cons of Boxplots Location of the median line in the box indicates symmetryasymmetry Best used for sidebyside comparison of more than one distribution at a glance Less detailed than histograms or stem plots The box focuses attention on the central half of the data Income for different Education Level Income 40000 80000 120000160000 200000 0 No Some HS Some Bachelor s Higher H5 H5 grad college degree Modified Boxplot The current boxplot can not reveal those possible outliers To modify it the two lines extend out from the central box only to the smallest and largest observations that are not suspected outliers Observations more than 15IQR outside the box are plotted as individual points TABLE 12 Percent of Hispanics in the adult population by state 2000 State Percent Slate Percent State Percent Alabama 15 Louisiana 24 Ohio 16 Alaska 36 Maine 06 Oklahoma 43 Arizona 213 Manland 40 Oregon 65 Arkansas 28 Massachusetts 56 Pennsylvania 26 California 281 Michigan 27 Rhode Island 70 Colorado 149 Minnesota 214 South Carolina 22 Connecticut 80 Mississippi 13 South Dakota 12 ware 40 Missou 1 8 Tennessee 20 Florida 1611 Montana 16 Texas 286 Georgia 50 Nebraska 45 Utah 81 Hawaii 57 Nevada 167 Vermont 08 Idaho 64 New Hampshire 14 Virginia 42 Illinois 107 New Jersey 123 Washington 60 Indiana 31 New Mexico 387 West Virginia 06 Iowa 213 New York 138 Wisconsin 29 Kansas 58 North Carolina 43 Wyoming 55 Kentucky 13 North Dakota 10 Percent of Hispanic Adults per State 4o 35 30 25 20 Percent Hispanic Take Home Message Examine distributions Overall pattern Shape Symmetric or skewed How many modes Bellshaped Outliers Graphical tools for quantitative data Stemplot Histograms Boxplot Mean median mode Read Sections 31 and 32 of Vardeman and Jobe North Carolina State University STAT 370 Probability and Statistics for Engineers Yichao Wu Optional Textbook ISBN 053436957X Amazon 16395 Addallcom Around 70 bucks Packbackers Hillsborough Street Textbooks Stephen IVardeman 1 Marcus lobe Lecture format PowerPoint presentation blackboard illustration Keep focused in class Class Web Page httpwwwstatncsuedupeoplewucoursesst370 WebAssignNCSU httpwebassignncsuedu Make sure you have the correct email address on file Homework and quiz Ten two optional homework assignments Assign and due online WebAssign No late homework will be accepted I on count our ten best homework scores and the lowest 2 homework scores will be dropped At least five inclass quizzes will be given The semester homework average will count as 20 of the course grade Quiz average will count 5 of the course grade Exam Policy The final exam is accumulative All exams are required With no makeup permitted A review session will be held prior to each exam Course grade Homework 100pts Best 10 out of 12 assignments First Midterm 100pts Friday February 13 in class Second Midterm 100pts Wednesday March 25 in class Project 75pts Quizzes 25pts Final Exam 100pts Monday April 27 800AM11OOAM in class To Know YOU Better and Quicker Talk with me before or after class Concerns about this mathematics class Suggestions to improve the lectures Things you like and dislike Come to my office during office hours or not Patterson 2090 Call 5137677 Email wustatncsuedu Usually prompt reply Ask and answer questions in class What is your background in Statistics Freshman Sophomore Junior Senior Biomedical engineering Chemistrl Phlsics Soil Science What I would like to know about you Name Year Major Background in Statistics previous training Opinions about Statistics Likehate do not know One problem that you think Statistics can help you to solve An other thin39s 39 ou want me to know Hand in the above info on a piece of paper after class My Strategy for Success Stay activeinvolved in class answer questions Ask questions during class especially if you can n0t see read near or underSIand anything Do not feel shy or stupid Make effective use of office hours Keep pace with the lectures review daily Do homework after each lecture to help understand the materials Include in preparation for exams the solving of many problems eg those in the text with solutions provided Any questions What I would like to know about you Name Year Major Background in Statistics Opinions about Statistics One problem that you think Statistics can help you to solve Any other concerns you want me to know Hand in the above info on a piece of papen How can Statistics help us claims at least it used to claim that it contaIns 1000 chips Is this true What is the chance that a poker gambler gets a Royal Flush AKQJT of the same suit at Atlantic City Among a group of randomly chosen people how likely is it for two of them to have the same birthday What is the relationship between Income and Years of Education Design your own experiment collect data analyze data and draw conclusions Why do engineers need statistics Engineers build design operate andor improve physical systems and r Wums UM fails the engineer may need to collect and inter ret data to he understand the rocess Statistic is the study of how best to a collect data b summarize or describe data c draw formal inferences and practical conclusions on the basis of data all the while recognizing the reality of variation What is Statistics Statistics the science of collecting organizmg and Interpreting data Population 39 Inference about population using statistical tools Sample of data Methods of statistics follow a process 1 Identify the research objective what is the question to be answered and the collection of values or individuals that we want to make statements about the rou of interest or population 2 Collect the information needed to answer the questions Gaining access to the entire population may pose problems and thus we typically look at a subset of the population called a sample to observe the variable of interest Example want opinion on issue at NCSU Variable of interest opinion One observation one student Sample students giving opinion Population who do we want to generalize to Possibilities are Everyone at NCSU census Engineering students Male students Undergraus CSU ST37O students Example lifetime of a lightbulb Variable of interest lifetime in hrs One observation one lightbulb Sample 30 lightbulbs 3O lifetimes one sample Po ulation all lightbulbs that could be manufactured Population could be conceptual Enumerative vs Analytical In an Enumerative study we have a finite population eg population is our class In an Analytical study we have an infiniteconceptual population does not all xiin n im rl Why do we take samples Why do we take samples instead of observing the whole population The population may be too large Time restrictions The population might be conceptual like in the example above lmpractical the experiment breaks what we are testing Limited resources to collect accurate data PopUIatIon might be Inaccessmle Methods of statistics follow a process cont 3 Organize and summarize the information give descriptive statistics that describe the data through numerical measurements tables charts graphs Collecting Data observing values of one or more variables We want to know about the distribution of the variables that is the possible values and the corresponding prevalence of different sets of possible values Sometimes we might settle tor summaries or the distribution Summaries of the distribution of the whole population are called parameters Summaries of the distribution of the sample Observed valueS only are called StatlSthS Methods of statistics follow a process cont 4 Draw conclusions from the information the information collected from the sam le is generalized to the population and their reliability is measured ie inferential statistics Example a researcher is conducting a study bacw o h population of Americans aged 18 years or older and obtains a sample of 1100 Americans in that age group The results obtained fro te ca wod W generalized to the population The c always uncertainty when using samples to draw conclusions regarding a population because we can t learn everything about a popula y g Therefore statisticians wi report a level of confidence in their conclusions This level of confidence is a way of representing the reliability of results If the entire I o ulation is studied then inferential statistics is not necessary because descriptive statistics would provide all the information that we need regarding the population Example 1 SAT Scores Parents and teachers have been concerned about the trend of declining SAT scores and sought ways to halt the decline at least at local level 50 students 24 males and 26 females matched according to socioeconomic background articir ated in a study to examine the effect of classroom atmosphere strict or liberal on student performance as measured by SAT scores at the end of the school year The students were divided into two groups of 25 each 12 males and 13 females with Group 1 to study under a strict atmosphere WhilC Uluup 4 unuer a very permissive atmosphere Example 1 SAT Scores After nine months all students were given the same standardized tests the vc W w and the mathematics test Student Group Gender SATMath SATVer A Strict F 670 700 B Strict M 700 680 C Liberal F 750 730 D Liberal M 690 750 Example 1 SAT Scores This example involves data collection data analysis and statistical inference How Questions Does stricter classroom atmosphere increase the average score Is the group size 50 large enoug w mam a confident conclusion Why matched according to socioeconomic background Why 12 males and 13 females per group Fundamental Concepts Population the entire group of individuals that we want information about Students about to take SAT Sample a part of the population that we actually examine in order to gather information 50 students selected into the study Sample size number of observationsindividuals in a sample 50 Statistical inference to make an inference about a population based on the information contained in a sample Based on the data from the study to infer whether a stricter classroom atmosphere increases VAT swws tnal SAT Score Data Student Group Gender SATMath SATVer A Strict F 670 700 B Strict M 700 680 C Liberal F 750 730 D Liberal M 690 750 Data contain Individuals the objects described by the data Variables any characteristic of an individual A variable can take different values for different individuals lnclass exercise 11 One hundred people volunteer to take an experimental drug tor weignt IOSS Amount of weight loss in pounds for the 6 months a person takes the drug is recorded What is the variable of interest What is the sample and sample size What is a possible population Give an example of statistic that would be of interest here SAT Score Data Student Group Gender SATMath SATVer A Strict F 670 700 B Strict M 700 680 C Liberal F 750 730 D Liberal M 690 750 Types of Variables A categorical or qualitative variable places an individual into one of several groups or categories A quantitative variable takes numerical values for which arithmetic operations such as adding and averaging make sense Example 1 SAT Scores Student Group Gender SATMath SATVer A Strict F 670 700 B Strict M 700 680 C Liberal F 750 730 D Liberal M 690 750 Categorical or qualitative variables Quantitative variables lnclass exercise 12 Classify as qualitative or quantitative hair color salary weight of cars religious affiliation STAT 370 final exam score Quantitative variable Discrete vs continuous A discrete variable is a quantitative variable that has either a finite number of possible values or a countable number of values can be lined up with 0123 Counts are a classical example of discrete variables A continuous variable is a quantitative variable that has an infinite number of possible values takes values in intervals or a continuum Lifetime of a light bulb is continuous though we tend to make it discrete by grouping into number of days etc ncass exercise 13 Determine whether the quantitative variable is discrete or continuous Home runs hit by Barry Bonds this season Time spent studying for your first exam Strength of concrete in pounds per square Incn Number of typos in a 500page novel Example 2 Students in STAT 370 Class Roll Variables College Class Degree Major How many categories How many students in each category Equivalently what is the distribution for each variable Distributions of Variables The distribution of a variable indicates what values a variable takes and how often it takes these values For a categorical variable distribution categories countpercent for each category For a quantitative variable distribution pattern of variation of its values Variable Class Class Count Percent FR 80 80 SO 12 12 JR 4 4 SR 4 4 Observational vs Experimental Observational study investigator s role is basically passive Individuals in a sample are studied but no attempt is made to manipulate or influence me variables of interest Good for establishing whether two variables are related or to learn characteristics of a population Carried out when control is unethical or impossible A I rocess I henomenon or grou of subects is watched and data are recorded Experimental study investigator s role is active Variables are manipulated the study environment Ia regulated Treatments are applied to experimental units to try to determine the effects of the treatment on the response variable Better for establishing causa on Example smoking and lung cancer Example To determine whether there is a connection between smoking and lung cancer individuals who have been smoking for some time are examined The individuals are not controlled in terms of their eating habits how much they smoked etc They are simply interviewed to determine if they are smokers or non smokers and their rate of cancer is monitored If there is a significant difference between the two groups cancer rates the researcher may claim that smoking causes cancer actualL determined that the39 are associated Smokers could have some characteristic eg amount of exercise the lurking variable that differs from the non smoking group that is the cause of the cancer Example smoking and lung cancer cont To do this as an experimental study we would need to randomly divide pcpuac mu groups and eg require one group to smoke a ack a da each da for the next 20 ears We could then control for other factors that aren t under our control in an observational study eg assign same diet and exercise regimen On the spectrum of studies the experimental end IS preterred as opposed to Observational but at times an observational study is the best we can do for eg wound t want to make people smoke Causality It is easier and safer to infer causality from an experiment than from an observation study Real systems are complex There may be important variables in the background that are changing ano are the true reason for instances of favorable system behavior Lurking variables Takehome message Statistics the science of collecting organizing and interpreting data Methods of statistics follow a process Identify the research objective Collect information Organize and summarize information Draw conclusion Variable of interest population sample sample size enumerative study analytical study parameter statistics inferential statistics categoricalqualitative variable quantitative variable discrete variable continuous variable distribution observational study experimental study causality Assignment Read Chapter 1 of Vardeman and Jobe North Carolina State University STAT 370 Probability and Statistics for Engineers Yichao Wu Lass class Mean Median Quantile special case percentile Quartile Homework 2 due Friday Jan 23 at 5PM Quartiles The sample quartiles are the values that divide the sorted sample into quarters just as the median divides it into half The most commonly used quantiles The median M 50th quantile The first quartile Q1 25th quantile The third quartile Q3 75th quantile Calculations of Quartiles The first quartile Q1 is the median of the observations who are leoo than u e overall median The third quartile Q3 is the median of the observations who are greater than the overall median Interquartile Range IQR IQR Q3 Q1 The range ot the center halt ot the data A resistant measure for spread IQR can be used to identify suspected outliers Ruleof thumb An observation is called a suspected outlier if it falls more than 15IQR above Q3 or below Ql Examples 2001 Twoseater Cars The highway mi leages of 18 gasolinepowered twoseater cars 13131619 212123 23 24 2626 27 27 27 28 28 30 30 o quotquotVV39V Vquot 39V3 39V1 39V1 39Vah 3915112 V15V1h 39J 21 03 median of 262627272728283030 27 IQR Q3 Q1 2721 6 15QR9 QB15IQR36 Q115IQR12 No outliers Examples 2001 TwoSeater Cars The highway mileages of 19 twoseater cars 13131619 212123 2324 2626 27 27 27 28 28 303068 0 o o Any outlier Can we get Q1 meidan Q3 from histogram Heidi l39feet39l The fivenumber summary To get a quick summary of both center and spread use the following venumber summary Minimum Q1 M Q3 Maximum Example CityHWY Mileage of 2001 Twoseater Cars City Mileage 1 1819202021212222 25 61 Fivenumber summary Highway Mileage 13 13 16 1921 21 23 23 24 2626 27 2727 28 28 30 3068 Fivenumber summary Boxplots a visual representation of the fivenumber summary A boxplot consists of A central box spans the quartiles Q1 and QB A line inside the box marks the median M Lines extend from the box out to the smallest and largest observations Example CityHWY Mileage of 2001 Twoseater Cars City Mileage 1 1819202021212222 25 61 Fivenumber summary Highway Mileage 13 13 16 1921 21 23 23 24 2626 27 2727 28 28 30 3068 Fivenumber summary Boxplots of highwaycity gas mileages Twoseatersminicompaots 30 25 20 15 Tonwy TwoCity Pros and cons of Boxplots Location of the median line in the box indicates symmetryasymmetry Best used for sidebyside comparison of more than one distribution at a glance Less detailed than histograms or stem plots The box focuses attention on the central half of the data Income for different Education Level Income 40000 80000 120000160000 200000 0 No Some HS Some Bachelor s Higher H5 H5 grad college degree Modified Boxplot The current boxplot can not reveal those possible outliers To modify it the two lines extend out from the central box only to the smallest and largest observations that are not suspected outliers Observations more than 15IQR outside the box are plotted as individual points TABLE 12 Percent of Hispanics in the adult population by state 2000 State Percent Slate Percent State Percent Alabama 15 Louisiana 24 Ohio 16 Alaska 36 Maine 06 Oklahoma 43 Arizona 213 Manland 40 Oregon 65 Arkansas 28 Massachusetts 56 Pennsylvania 26 California 281 Michigan 27 Rhode Island 70 Colorado 149 Minnesota 214 South Carolina 22 Connecticut 80 Mississippi 13 South Dakota 12 ware 40 Missou 1 8 Tennessee 20 Florida 1611 Montana 16 Texas 286 Georgia 50 Nebraska 45 Utah 81 Hawaii 57 Nevada 167 Vermont 08 Idaho 64 New Hampshire 14 Virginia 42 Illinois 107 New Jersey 123 Washington 60 Indiana 31 New Mexico 387 West Virginia 06 Iowa 213 New York 138 Wisconsin 29 Kansas 58 North Carolina 43 Wyoming 55 Kentucky 13 North Dakota 10 Percent of Hispanic Adults per State 4o 35 30 25 20 Percent Hispanic Reading boxplots See the 5number summary Measures of center Median Q1 Q3 interval description of middle 50 of data Measure of spread Rangemaxmin IQR Reading boxplots cont Shape Skewness Right skewed data set would have a longer right top whisker and m w V re vs the left bottom of the box Left skewed data set would have a longer left bottom whisker and median closer to the right top of the box Can t see multiple modes lnclass exercise The following are data on the impact strength of sheets of insulating material cut in two different was in ft lb Lengthwise cuts 115 084 088 091 086 088 092 087 093 095 Crosswise cuts 089 069 046 085 073 067 078 077 080 079 1 For each group compute the fivenumber summary min Q1 02 QB max 2 Draw and label sidebyside boxplots for comparing the two cutting methods Discuss what these show about the two methods Sample Variance 82 Deviation from mean the difference between an observation and the sample mean x1 x Sample Variance 822 the average of the squares of the deviations of the observations from their mean S2 x1 c2 x2 c2 xn c2 n l xi C2 n l Toy Examples Data 210 12 What is the sample variance How about this 40V 40V 40V 40V 40 Sample Standard Deviation s Sample Standard Deviation s the square root of the sample variance 2quot xi W 11 1 Remarks on the definition of sample SD The sum of the deviations of the obs from their mean is always 0 Why square the deviations rather than absolute deviations Mean is a natural center under the squaring SD is a natural measure of spread for the normal distributions Remarks on sample SD Why sample SD rather than sample variance SD is natural for measuring spread for normal dist SD is in the original scale Why n1 rather than lntuitively speaking SD is not defined for n1 Sum of deviations is always 0 which means if we know n1 of them we know the last one Only n1 deviations can change freely n1 degrees of freedom Properties of sample SD 3 measures the spread about the mean 3 should be used only when the mean is chosen to measure the center sO if and only if there is no spread When sgtO elsewhere increases when more spread 3 like the mean is not resistant Even less resistant Why Examples 2001 Twoseater Cars The highway mileages of the 18 gasolinepowered twoseater cars 13131619 212123 23 24 26 26 27 27 27 28 28 3O 3O Mean 234 SD53 The highway mileages of the 19 twoseater cars 13131619 212123 23 24 26 26 27 27 2 w w 30 3O 68 Mean 258 SD114 Three measures of spread The range is the spread of all the observations The interquartile range is the spread of roughly the middle 50 of the observations Sample 0D is a measure of the distance from sample mean Sample SD can be regarded as a typical distance of the observations from their mean The fivenumber summary vs sample Mean and SD The fivenumber summary is preferred for a skewed distribution or a distribution with strong outliers x and s are preferred for reasonably symmetric distributions that are free of outliers Alwa s lot our data rst Use boxplots Statistics and parameter Numerical summarizations of sample data are called sample statistics Numerical summarizations of population and theoretical distributions are call parameters Roman letters are used as symbols for statistics and Greek letters are used to stand for parameters Take Home Message Examine distributions Overall pattern Shape Symmetric or skewed How many modes Bellshaped Outliers Graphical tools for quantitative data Stemplot Histograms Boxplot Measuring center Mean median mode Spread IQR range standard deviation Read Sections 31 32 and 33 of Vardeman and Jobe Homework 2 due Friday Jan 23 at 5PM North Carolina State University STAT 370 Probability and Statistics for Engineers Yichao Wu Lass class Variable of interest population sample sample size enumerative study analytical study parameter statistics inferential statistics categoricalqualitative variable quantitative variable discrete variable continuous variable distribution observational study experimental study causality lurking variable univariate variable multivariate variable repeated measurement paired data factor level complete factorial study fractional factorial study valid accurate precise Assignment Read Chapter 2 of Vardeman and Jobe Homework 1 due Friday 16 at 5 PM A special type of data structure Example making paper airplanes construction paper newspaper or typing paper with paper clip or without paper clip smal wingspan medium wingspan or large wnngspan A complete factorial study several process variables and settings of each are identified as being of interest and data are collected under each possible combination of settings of the process variables Factorial study Factor process variablesusually controllable Levels the settings of a factor Fractional factorial study A complete factorial study is sometimes not feasible many factors and levels A fractional factorial study data are collected for only some of the combinations that would make up a complete factorial study Measurement A measurement or measuring method is called valid if it usefully or appropriately represents the feature of an object or system that is of engineering importance A measurement is called accurate or unbiased if on average it produces the true or correct value of a quantity being measured A measurement system is called precise if it produces small variation in repeated measurement of the same subject my Has lawmisibillly Law um Mgr mummy In M mummmmmm d lnclass example 15 Semiconductor wafers from a certain company have a defective rate of about 8 You have just ordered a batch of 10 wafers and want to predict how many will be defective What is the sample and sample size What is the population What type of population is this Is the 8 given in the description a parameter statistic variable or something else Would you use inferential statistics or deductive statistics to make the prediction of the number of defective wafers of the 10 Sampling The goal in sampling is to obtain individuals in such a way that accurate information may be obtained about the population Here we discuss a basic sampling technique that has certain good properties A simple random sample of size n from a population is a sample selected in such a manner that every collection of n items in the population is a priori equally likely to be compose the sample Only works for enumerative studies Simple random sampling RSRS WHY Random sampling avoids selection bias eg suppose I am producing a drug and I want to show that it has good effects I can select the healthier or younger patients as the group to make my drug Can quantify bias and general effects of sampling Does not guarantee a good or representative sample every time we can get all small values or all large valuesonly guarantees longrun behavior SRS HOW Every group of n distinct units of N in population has an equal chance of being selected Consequence every unit in the population has an equal chance of being selected to be in the sample Paradigm drawing names out of a hat In practice we use computer generated random numbers SRSE39xample Example Draw sample of 2 from population of 7 students with population mean weight 150 Data weights 100 110 120 130 150 200 240 Note also variance in weights for population another population parameter Take 21 samples of size 2 compute average for each 12 24 37 13 25 45 14 26 46 15 27 47 16 34 56 17 35 57 SRS Example cont Notice the variation in Though the actual average weight is 150 if we take sample 12 the average weight is 105 if we take sample 67 the average is 220 Why SRS Why select SRS of 5 students to represent the class Bias in selecting students in front or back Bias in selecting students whose names I know Potential bias in sample of convenience SRS why not Not always feasible eg what if population is conceptual Example Take SRS of people who run at a particular track How do we randomize selection Convenience sample Often we take a sample of convenience individuals are easily obtained the most popular being when they are selfselected ie they decide themselves to participate in the survey Most popular convenience sample is one in which individuals in the sample are selfselected the individuals themselves decide to participate in the survey These are also called voluntary response surveys Examples include phonein polling and Internet surveys This is not a good sampling design and thus we should be careful in generalizing the conclusions from them to the entire population Numerical summaries of continuous variable Example Midterm Scores of STAT 101 The following data set contains the midterm exam scores of STAT 101 74 76 7839 88 8397 87 53 9395 82 79 79 7839 62 8O 77 7O 6O 6O 84 95 85 93 79 84 71 8510077 72 95 79 83 97 87 73 84 74 83 85 95 62 5O 86 83 86 36 Type of variable Stemplot Separate each observation into a stem consisting of all but the final rightmost digit and a leaf the final digit Stems may have as many digits as needed but each leaf contains only a single digit Write the stems in a vertical column with the smallest at the top and draw a vertical line at the right of this column Write each leaf in the row to the right of its stem in increasing order out from the stem Example Midterm Scores of STAT 101 The following data set contains the midterm exam scores of STAT 101 74 76 7839 88 8397 87 53 9395 82 79 79 7839 62 8O 77 7O 6O 6O 84 95 85 93 79 84 71 8510077 72 95 79 83 97 87 73 84 74 83 85 95 62 5O 86 83 86 36 Example Midterm Scores of STAT 101 A stemandIeaf display is follows 6 03 Leaf last digit 0022 Stem remaining digits 012344677889999 02333444555667778 355557 O OOOONCDO Ihpg A Backtoback Stemplot Number of home run hits each season Babe Ruth New York Yankees 1920 1934 54 59 35 4146 25 47 6O 54 46 49 46 4134 22 Mark McGwire St Louis Cardinals 19862001 49 32 33 3922429939 52 587065 3229 Splitting stems amp rounding For a moderate number of obs Split each stem into two one with leaves 04 and the other with leaves 59 Increase of stems reduce of leaves Rounding If many stems have no leaves or only one leaf rounding may help convenient to round or even truncate the data so that the final digit after rounding is suitable for a leaf Do this when the data have many digits Spending at a supermarket a QUINGUIhWNAO 599 15456775599 00125455665555 25699 1545579 0559 1 0 566 QWWQVOWQUIUIFFWUNN AOO 5 99 154 56775599 001254 55665555 2 5699 154 5579 05 59 1 66 Example A study on litter size Data 170 observations 68565453 577974343 664653966 675456656 W67577585 554787566 95376878 441357446 4767m3688 397577535 776835885 676677799 577559236 626878788 486875448 Stem andIeaf plot for pups 0122333333333333344 35 0555555555555555555555555 132 1 001 Limitations of Stemplot Awkward for large data sets Splitting stemrounding is not very helpful Histogram how Choose intervals or bins that cover the entire range of the data Count the number of observations per interval Draw rectangles with heights corresponding to number in interval for relative frequency histogram height is relative frequency of interval Notes Intervals must be nonoverlapping Typically an observation equal to a boundary value is put in the higher interval Intervals must be contiguous rectangles touch each other Intervals must be equal width Often choose nice boundaries Example A study on litter size 1 2 34567 8910 1 a Example Example The manager at Wendy s is interested in studying typical arrival patterns during lunch hour She records the number of arrivals for 40 randomly selected 15minute intervals over lunch hour and obtains the following data 75262664 66752286 661 59629 611237568 44475755 Make frequency and relative frequency table Draw histogram Example Call Center Data Financial firm call center Calls handled by AVI within 60 seconds October 666 December 523 Avi Service Time Data October Frequency Histogram 120 100 W 80 60 40 20 HHHH 6121824 30 36 42 48 54 60 calling time December Frequency Histogram 120 100 W 80 W 60 W WHMIHe 7quotj 173917 6 12 18 24 30 36 42 48 54 60 calling time Notes for Making Histogram Choose the number of classes sensibly Too few classes skyscraper graph Too many pancake graph Sturge s rule Choose number of classes k such that log n ltk lt log n 1 where n is the sample size Intervals must be of equal width Areas of the bars are proportional to the frequency Examining Distributions Overall Pattern Shape Center midpoint Spread range Deviations Outliers some values that fall outside the overall pattern Shapes of Distributions Graphs can help to determine shapes Modes peaks of a distribution Unimodal one peak Bimodal two peaks Symmetric or skewed Shapes of Distributions Symmetric histogram in which the right half is a mirror image of the left half Skewed to the right histogram in which the right tail is more stretched out than the eftong tail to the right Skewed to the left histogram the left tail is more stretched out than the rightlong tail to the left Bellshaped A histogram looks like a bell Shakespeare s Words Percent of Shakespeare s words 123456789101112 Number of letters in word Tuition and fees Number of colleges O 3 6 9 12 15 18 21 Tuition and fees 1000 24 27 30 A bimodal hi sssss am Shakespeare s Words Percent of Shakespeare s words 123456789101112 Number of letters in word Right skewed Left skewed Iowa Test of Basic Skills vocabulary scores Number of seventh graders m quot G N a o 8 o 8 o O 4 6 8 10 12 14 Iowa Test vocabulary score A study on litter size I Lu s rara Shapes of Distributions Symmetric histogram in which the right half is a mirror image of the left half Skewed to the right histogram in which the right tail is more stretched out than the eftong tail to the right Skewed to the left histogram the left tail is more stretched out than the rightlong tail to the left Bellshaped A histogram looks like a bell lnclass exercise The volume of a stock is the number of shares traded on a given day The following data given in millions so that 378 represents 3780000 shares traded represents the volume of Altria Group stock traded for a random sample of 35 trading days in 2004 378 606 532 304 1032 338 1096 874 575 325 564 338 553 450 435 534 657 500 725 474 797 502 692 757 716 652 970 301 840 623 607 488 443 356 558 a Construct a frequency distribution of the data using bin widths of size 2 b Construct a relative frequency distribution of the data using bin widths of size 2 c Construct a frequency histogram and a relative frequency histogram of the data using bin widths of size 2 d On what percentage of the 40 days were at least 6 million shares traded e Describe the shape of the distribution Measuring center the mean 7 Mean Average value The sample mean 37 If the n observations in a sample are x1 x2 xn then their mean is 7c1x2 xnn i ani Measuring center the median Median middle value or center point The sample median M the number such that half of the observations are smaller than it and the other half are larger the midpoint of a distribution Procedure to calculate the median M 1 Arrange all observations in order of size from smallest to largest 2 If the number of observations n is odd the median M is the center observation in the ordered list 3 If n is even then M is the mean of the two center observations in the ordered list Note nhj12 is the location of the median not the median itse Example Fuel economy miles per gallon for 2001 twoseater cars The highway mileages of 18 gasolinepowered twoseater cars 13131619 212123 23 24 26 26 27 27 27 28 28 3O 3O Mean Median The highway mileages of 19 twoseater cars 13131619 212123 23 24 26 26 27 27 27 28 28 30 30 68 Mean Median Example Salary Survey of UNC Graduates Survey a certain number of graduates from UNC A lot of departments are surveyed Question Which department produces students that earn the most on average ten years after they got their degrees Answer Geography Michael Jordan Mean vs Median Mean easy to calculate easy to work with algebraically highly affected by outliers Not a resistant measure Median can be time consuming to calculate more resistant to a few extreme observations sometimes outliers robust Mode The most frequent value in the data Important fOr categorical data Possible to have more than one mode Mean Median and Mode If the unimodal distribution is exactly symmetric the mean the median and the mode are exactly the same If the distribution is skewed the three measures differ Mode I Mean Mean Mode Which one to use Different by definition Mean and median are unique and only for quantitative variables Mode is not unique Mode is defined for categorical variables also The choice depends on the shape of the distribution the type of data and the purpose of your study Skewed median Categorical mode Total quantity mean Outliers Observations that lie outside the overall pattern of a distribution Possible reasons error in data entry most likely reason Equipment failure Human error Missing value code extraordinary individuals Jordan s salary Handling Outliers Detect it using graphical and numerical methods Check the data to make sure correct entry Reducing influence of outlier delete the observation BE CAREFUL Use transformations robust methods Speed of Light Histogram Frequency w O I N U1 l N O l U39l I O I U1 l O 60 I II I 40 20 0 20 Time 40 6O Numerical Summary for Distributions Center Mean Median Mode Spread Fivenumber summary and Boxplot Standard Deviation Choose at least one from each category Take Home Message Simple random sampling Examine distributions Overall pattern Shape Symmetric or skewed How many modes Bellshaped Outliers Graphical tools for quantitative data Stemplot Histograms Boxplot next time Mean median mode unimodal bimodal Read Section 31 of Vardeman and Jobe Homework 1 due Friday 16 at 5 PM North Carolina State University STAT 370 Probability and Statistics for Engineers Yichao Wu lnclass example 1 The purpose of this experiment is to determine the effect of the number of Mentos and the initial volume of Diet Coke on the percent of soda volume lost A reaction occurs when Mentos are added to a bottle of diet coke Consequently the coke erupts out of the top of the bottle resulting in volume loss The idea of the experiment is to apply a varying amount of Mentos to a varying initial amount of soda and record the volume lost In our experiment we include three different sizes of Diet Coke bottles 20 ounces 1 liter and 2 liters Each experimental unit be givcn CILIICI IUUI Mentos or eight Mentos To measure the response variable the remaining liquid is poured into a measuring utensil and volume is recorded This volume is subtracted from the initial volume which represents the volume lost Data Mentos 4 8 0591 mL 0565 3 0591 mL 0526 0577 0591 mL 054 0558 GEJ1000 mL 0561 0587 31000 mL 0532 0539 1000 mL 0519 0559 2000 mL 0475 0537 2000 mL 0565 0615 2000 mL 0537 05 Questions What is the response variable How many factors are there What are the factors and what are their levels Put in Factor leve1 level2 form How many treatments are there What are the treatments What are the experimental units Can ou think of an lurkin variables variables that are neither factors nor response but might affect the values of the response How many replicates are there for each treatment Why are the responses for the replicate different The amount of flow through a solenoid valve in an automobile s ollution control s stem is an im ortant characteristic An experiment was carried out to study how flow rate depends on three factors armature length spring load and bobbin depth Two different levels hgh and low of each factor wee ccse and a single observation on flow rate was made for each combination of levels Circle the answer that best describes the response variables in this study a low and high levels of the factors b armature length spring load and bobbin depth c flow rate through the solenoid valve d the automobile s pollutioncontrol system e none of the above The amount of flow through a solenoid valve in an automobile s ollution control s stem is an im ortant characteristic An experiment was carried out to study how flow rate depends on three factors armature length spring load and bobbin depth Two different levels hgh and low of each factor wee ccse and a single observation on flow rate was made for each combination of levels Circle the answer that best describes the response variables in this study a low and high levels of the factors b armature length spring load and bobbin depth c flow rate through the solenoid valve d the automobile s pollutioncontrol system e none of the above In Class exercise 2 The purpose of this experiment is to determine the effect of the number of Mentos and the initial volume of Diet Coke on the percent of soda volume lost A reaction occurs when Mentos are added to a bottle of diet coke Consequently the coke erupts out of me lUp of the bottle resulting in volume loss The idea of the experiment is to apply a varying ount of Mentos to a varying initial amount of soda and record the volume lost In our experiment we include three different sizes of Diet Coke bottles 20 ounces 1 liter and 2 liters Each experimental unit will be given either four Mentos or eight IVIentos IO measure tne response variable tne remaining liquid is poured into a measuring utensil and volume is recorded This volume is subtracted e initial volume which represents the volume lost 0591 mL 0591 mL 0591 mL aEJ1000 mL 21000 mL 91000 mL 2000 mL 2000 mL 2000 mL Mentos 4 0565 0526 054 0561 0532 0519 0475 0565 0537 057 0577 0558 0587 0539 0559 0537 0615 05 What is the response variable How many factors are there What are the factors and what are their levels Put in Factor level1 eve2 form How many treatments are there I What are the treatments What are the experimental units I Can you think of any lurking variables variables that are neither factors nor response but might affect the values of the response I How many replicates are there for each treatment Why are the responses for the replicate different In class exercise The amount of flow through a solenoid valve in an automobile s pollution control system is an important characteristic An experiment was carrIed out to study how flow rate depends on three factors armature length spring load and bobbin depth wo different levels wig and low of each factor were chosen and a single observation on flow rate was made for eac combination of levels Circle the answer that best describes the response variables in this study alow and high levels of the factors barmature length spring load and bobbin depth c flow rate through the solenoid valve dthe automobile s pollutioncontrol system elnone of the above The amount of flow through a solenoid valve in an automobile s pollution control system is an important characteristic An experiment was carrIed out to study how flow rate depends on three factors armature length spring load and bobbin depth wo different levels wig and low of each factor were chosen and a single observation on flow rate was made for eac combination of levels How many treatments are there in this experiment Is this an observational or experimental study Explain your answer The amount of flow through a solenoid valve in an automobile s pollution control system is an important characteristic An experiment was carrIed out to study how flow rate depends on three factors armature length spring load and bobbin depth wo different levels wig and low of each factor were chosen and a single observation on flow rate was made for eac combination of levels Circle the best answer describing what t e of data the flow rate through the solenoid is a qualitative b controlled c quantitative d replicates Assume now that the experimenter believes that type of engine 4c linder 6c linder or 8 cylinder also has some effect on the flow rate The experimenter redesins the stud by repeating the entire study for each type of engine What type of variable is engine type Circle your answer a controlled b blocking cResponse dboth a and b are correct e none of the above Assume now that the experimenter believes that type of engine 4c linder 6c linder or 8 cylinder also has some effect on the flow rate The experimenter redesins the stud by repeating the entire study for each type of engine gtHow many treatments are there now Another exampleRandomized Complete Block Design A researcher is carr in out a stud of the effectiveness of four different skin creams for the treatment of a certain skin disease He has eight subjects and plans to divide them into 4 treatment Urou s of twent subJects each Using a randomized block design the subjects are assessed and I ut in blocks of four according to how severe their skin condition is the four most severe cases are the first block the next four most severe cases are the second block and so on to the twentieth block The four members of each block are then randomly assigned one to each of the four treatment groups North Carolina State University STAT 370 Probability and Statistics for Engineers Yichao Wu Today We will move on to ANOVA How to do hypothesis testing on main effects and interactions lnclass example 2 0 mg 30 35 31 80 85 75 100 mg 2025 30 102 92 79 200 mg 19 22 22 100 95 93 Consider a hypothetical experiment on the effects of a stimulant drug on the ability to solve problems There were three levels of dosage 0 mg 100 mg and 200 mg Asecond variable type of task has two levels a simple welllearn m naming colors and a more complex task finding hidden figures in a complex display Three replicates were assigned to each of the six treatment groups The data times in seconds to complete the task for each treatment were in the above table 1 DU 9 80 H t 50 4o 30 20 it Simple Complex task 50 40 3390 2U it HRH Complex tas k Simple task has Ram 0mg 00mg zoomg Interaction adding dosage helps with simple task but makes things worse with complex task Treatment mean table with i 611 bj in parentheses I Simple task Complex task 0 mg 32 245 80 875 yr 56 a1 1 s 100 mg 25 265 91395 58 a 05 200 mg 21 27 96 90 quot132535 3 1 26 89 25739s 3 1 5 bl 3 1 5 52 Main effects Numbers in parentheses tell us what to expect if effects are additive Note the pattern for numbers in parentheses For rows add 2 from row 1 to row 2 then add 05 from row 2 to row 3 For columns add 63 Another example of treatment mean table no interaction 1 15 y 125 a 45 15 20 y 1775 01 25 371 125 J72 175 15 25b1 25172 Mam effects If we cover up the row 2 column 1 entry and try to predict it we would predict 20 based on the increase for row 1 No interaction In practice what would a model with no interaction look like Numbers might not be this perfect so that there is an indication of a slight interaction but that doesn t mean that there really is one there could be noise there is variation anyway AN OVA A full factorial experiment factors A I levels and H J levels equal number Kof replications in each treatment group The response variable is called Y We index observed values by the level of each factor and by the replicate number yijk Example Golf ball flight distance on different evenings Compression Evening 1 Evening 2 80 l8019319718 9 l87 196l92l91i9 l186 l92lQU182192l79 130l9539l9Tl92lQ3 lOO l8018516T lB2l70 190l95l80 l0 l80 l75lQD1851BDl85 135l67l80lg 165 Use evenings as a factor instead of a blocking variable We might expect there are compression effects on flight distances Are there also evening effects Are compression effects different on different evenings Namely is there interaction Candidate models Compression effects only yijk aigijk Compression effects and evening effects yijk aibj 811 Compression effects evening effects and interaction effects yijk y aibj abij 51 ANaIysis Of VAriance ANOVA Partitions variability in values of response variable into those due to the factors being manipulated treatments and those due to ever39thinv else ex erimental error This partitioning allows us to measure the magnitude of the effec of a certain factor or of interaction on the response relative to the size of the experimental error If variation is mainly due to treatments we would expect the treatment means to be very different relative to within group variation AN OVA The TOTAL SUM OF SQUARES SS Total is the sum of squared deviations of the observed values from the overall mean ignoring the factors I J K SSTZZ k 11 1 yijk 2 1 This is a measure of total variability among the y values Partitioning SST One useful way IS to partition It Into Variability due to factor A affecting Y Variability due to factor B affecting Y Variability due to the interaction of factors A and B Experimental Error that is variability among experimental units that received the same treatment This partitioning can be called a full factorial model or saturated model 88 TOTAL SSASSBSSABSSE This decomposition nolds for balanced deSIgn same number of replicates for each treatment Is not true in general for unbalanced design Partitioning SST I 397 SSA U quot7 The model behind it is J 3 assumptions for population SSE KIZG39 we yy yaljay8yk 88 TOTAL SSASSBSSABSSE Example Golf ball flight distance on different evenings Treatment means with f aj 453 in parentheses and fitted effect Compression Evening 1 Evening 2 80 i831 1884 1916 MEWS F1 4393quot 91 54 iDU i79 1715 1802 1805 E 47905 a 54 i1 1330 1359 13445 b1 145 Ea 145 SSA11664 SSB841 SSAB36 SSE22518 SST35059 ANOVA table 5mm Sum mm 1 is r Trcall mcln 1 LI MSASSAKHgt MSAMSE 2 SSA KJZXyL yquot 11 mm 2 my SSB KIZQ 17 1 lulcmc nn SJSAE HMJ MSAE MSE I J 2 K2 yy 70 61 b 1 11 mu 1 J K 7 11quot MsESSEltn gt SSE ZZZltlek fig 21 11 k1 I K 7 SST 2211 y 1 11 11 Assignments Read Section Sections am 8 allu 812 North Carolina State University STAT 370 Probability and Statistics for Engineers Yichao Wu Homework 4 due on Monday Feb 9 5pm Read Section Sections 811 and 812 Design of experiment Design of experiment factors treatments response Experimental error Variables affecting the response Factors treatments controlled variables lurking variables Techniques for handling lurking varialbes Randomization Blocking Reproducibility of the result Replication Full factorial design In general we don t have to use ALL possible combinations or may not be able to But if ALL the combinations are used then it is called a Full Factorial Experiment Ex If Brand Alatex Brand Blatex and Brand A oil are the only treatments the experiment is NOT a full factorial experiment We will focus Full Factorial designs in this class Assumption For the analyses that follow we ll mostly leave behind the issues we ve been discussing Assume that the experiment is welldesigned in terms of Controlled variables Potential lurking variables Randomization Replication will be apparent during analysis Thus any effects we see are due only to the factors being studied For the moment we ll deal only w VRDs although a blocking variable can essentially be treated as another factor in this type of analysis Why we need more tools The overall stats eg mean and standard deviation don t really tell us about the treatment effect Why not use sidebyside boxplots Not enough replications and too many treatments Use table of treatment means to start to get a sense of the treatment effect Qualitatively if the treatment means are different there appears to be a treatment effect The interesting thing about factorial data is to investigate effects of each factor separately and to investigate whether they act independently of each other Factorial data analysis2 factors A I levels and B J levels yy the sample mean response when factorlAJis at level i and factor B is at levte E 7JZZIE the average sample mean when 1factor A is at level i J ZYEE the average sample mean when factor B is at level j 1 o y 3 J j the grand average sample mean Make a table then an interaction plot Interaction plots Graphical representation of treatment means Choose one factor for the X axis doesn t matter which one On the Yaxis is the Treatment Means Plot treatment means Connect treatment means that have the same level of the OTHER factor n 7 Interaction plot initial volume and mentos n 7 057 057 0565 0565 0 56 Mentos8 0 56 0555 0555 591 055 055 0545 0545 054 054 1 000 2000 0535 0535 Mentos4 053 053 2 I an 7 7 591 1000 2000 Mentos4 Mentos8 Interaction plots What do we look for in an interaction plot Effects of each factor and interaction Each factor qualitative The distance between the traces represents the effect of the factor NOT on the xaxis The other factor s effect is represented by the change in height across the tracenot so easy to see so often make both interaction plots Which factor seems to have a bigger effect overall and remember qualitative How do we describe this numerically Fitted effects Fitted effects Fitted Simple effects Difference between ell s yij am overall mean yy 7 Fitted Main Effects Compare row and column means to overall average to get effect or level on response The fitted main effect for factor A at its ith level isal 22 and B atjth 9 2f 7 Note these add to zero for balanced design same number of replicates for each treatment Interaction If the size or direction of the effect of one factor is different for different levels of the other factor then there is interaction If not then there is no interaction thus you can talk about the effect of one factor without mentioning the other factor Graphically interaction can be detected through the parallelism or lack thereof u we traces j i 61 bj So the effect of changing levels of factor A is exactly the same for each level of factor B thus lines are I arallel lf lines are not parallel then there is interaction We can compute interactions ably 371 7aibj n 7 Interaction plot initial volume and mentos n 7 057 057 0565 0555 0 56 Mentos8 0 56 0555 0555 591 055 055 0545 0545 054 054 1 000 2000 0535 0535 Menlos4 053 053 2 I Am 7 591 1000 2000 Mentos4 Mentos8 Example The purpose of this experiment is to determine the effect of the number of Mentos and the initial volume of Diet Coke on the percent of soda volume Iost A reaction occurs when Mentos are added to a bottle of diet coke Consequently the coke erupts out of me lUp of the bottle resulting in volume loss The idea of the experiment is to apply a varying ount of Mentos to a varying initial amount of soda and record the volume lost In our experiment we include three different sizes of Diet Coke bottles 20 ounces 1 liter and 2 liters Each experimental unit will be given either four Mentos or eight IVIentos IO measure tne response variable tne remaining liquid is poured into a measuring utensil and volume is recorded This volume is subtracted e initial volume which represents the volume lost 0591 mL 0591 mL 0591 mL aEJ1000 mL 21000 mL 91000 mL 2000 mL 2000 mL 2000 mL Mentos 4 0565 0526 054 0561 0532 0519 0475 0565 0537 057 0577 0558 0587 0539 0559 0537 0615 05 Interaction If the size or direction of the effect of one factor is different for different levels of the other factor then there is interaction If not then there is no interaction thus you can talk about the effect of one factor without mentioning the other factor Graphically interaction can be detected through the parallelism or lack thereof u we traces j i 61 bj So the effect of changing levels of factor A is exactly the same for each level of factor B thus lines are I arallel lf lines are not parallel then there is interaction We can compute interactions ably 371 7aibj

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "I was shooting for a perfect 4.0 GPA this semester. Having StudySoup as a study aid was critical to helping me achieve my goal...and I nailed it!"

#### "I used the money I made selling my notes & study guides to pay for spring break in Olympia, Washington...which was Sweet!"

#### "Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

#### "Their 'Elite Notetakers' are making over $1,200/month in sales by creating high quality content that helps their classmates in a time of need."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.