New User Special Price Expires in

Let's log you in.

Sign in with Facebook


Don't have a StudySoup account? Create one here!


Create a StudySoup account

Be part of our community, it's free to join!

Sign up with Facebook


Create your account
By creating an account you agree to StudySoup's terms and conditions and privacy policy

Already have a StudySoup account? Login here


by: Clyde McLaughlin II

StatisticalInferenceI STAT3502

Clyde McLaughlin II

GPA 3.83


Almost Ready


These notes were just uploaded, and will be ready to view shortly.

Purchase these notes here, or revisit this page.

Either way, we'll remind you when they're ready :)

Preview These Notes for FREE

Get a free preview of these Notes, just enter your email below.

Unlock Preview
Unlock Preview

Preview these materials now for free

Why put in your email? Get access to more of this material and other relevant free materials for your school

View Preview

About this Document

Class Notes
25 ?




Popular in Course

Popular in Statistics

This 0 page Class Notes was uploaded by Clyde McLaughlin II on Monday November 2, 2015. The Class Notes belongs to STAT3502 at California State University - East Bay taught by Staff in Fall. Since its upload, it has received 11 views. For similar materials see /class/234384/stat3502-california-state-university-east-bay in Statistics at California State University - East Bay.


Reviews for StatisticalInferenceI


Report this Material


What is Karma?


Karma is the currency of StudySoup.

You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 11/02/15
Hayward Statistics Confidence Intervals and Tests of Hypothesis A Minitab Demonstration 1 The Data The data for this demonstration are taken from Ott Introduction to Statistical Methods anal Data Analysis 4th eal Dquury Problem 5107 page 256 We are told there that the observations are the proportions of patients with a particular kind of insurance coverage at 40 hospitals selected at random from a particular population of hospitals The main issue here is to make inferences about the proportion of patients with this kind of insurance in the population of hospitals from which our sample of 40 was taken Although this demonstration can be read without using Minitab we recommend that you use your browser to print it out and follow through the steps using Minitab as you read 2 Putting the Data Into a Minitab Worksheet 11 If you are working in the CSU Hayward Statistics Lab Start Minitab select the Session window in Files open the Minitab worksheet CHOS 10 7 MTW located on the server in I COURSWRKSTAT3 5 02 OTTDAT Check the worksheet to make sure you have retrieved the correct data The data are as follows 067 074 068 063 091 081 079 073 082 093 092 059 090 075 076 088 085 090 077 051 067 067 092 072 069 073 071 076 084 074 054 079 071 075 070 082 093 083 058 084 12 If you do not have access to the server Open Minitab go to the Data window and type the 40 observations into C1 of the worksheet proofreading carefully Alternatively once you have opened Minitab you can switch to your browser and highlight the data then cut and paste into row 1 col 1 of a blank Minitab worksheet spaces as delimiters and finally quotstackquot all of the data into Cl menu path MANIP gt stack This method will not preserve the order of the observations but their order is unimportant in this situation 2 Descriptive Statistics MTB gt desc C1 C1 MTB gt dotp cl 21 Numerical Summary Before doing any formal inference it is always a good idea to use descriptive methods to understand the data Here we begin by finding some numerical descriptive statistics N MEAN MEDIAN TRMEAN STDEV SEMEAN 40 07620 07550 07658 01086 00172 MIN MAX Q1 Q3 05100 09300 06925 08400 Questions The mean and the median are very nearly equal What clue does this give you about the possible skewness of the sample The mean is very nearly equal to the trimmed mean mean of the middle 90 of the observations What clue does this give you about the possible presence of outliers Make a boxplot of these data command boxp and describe what you see What five numbers from above are needed for making the boxplot The values that will prove crucial for formal inference below are sample size 40 sample mean 07620 and the sample standard deviation 01086 The estimated standard error of the mean is also useful Show how it is computed from the sample size and the sample standard deviation 22 Dotplot Next make a dotplot as shown below The data appear to be nearly but perhaps not exactly normal In any case they are not severely skewed and there are no outliers So with a sample size as large as 40 t procedures for inference are OK Questions From the dotplot try to judge what interval of values of the population mean u might be believable considering the inevitable sampling error Notice that the plot seems to quotbalancequot at about 076 the sample mean In particular which of the following values are believable values ofthe population mean u 064 070 073 075 088 3 Confidence Intervals 31 A 90 Con dence interval with known population standard deviation Assume that the population standard deviation is known to be 039 01 In most reallife applications 039 is not known Here is a scenario in which it would be reasonable to assume that the population standard deviation is known to be 039 0 1 For several past years a state agency has collected data for all of the hospitals in the population The population standard deviation for each of these past years was computed and has held steady at about 039 0 1 Furthermore we know of no reason that the population dispersion should have changed this year A Minitab printout of the required con dence interval is shown below The command Zint stands for a con dence interval based on the standard normal distribution which is often represented by the letter Z MTB gt Zint 90 1 Cl THE ASSUMED SIGMA 0100 N MEAN STDEV SE MEAN 900 PERCENT CI Cl 40 07620 01086 00158 07360 07880 Things to notice 0 The command line or the menu dialog box must say what con dence level is desired Do not type a percent sign 0 The command line or the menu dialog box must also contain the known value of the population standard deviation Notice that the value for SE MEAN in this output differs from the estimated standard error of the MTB gt tint Cl mean shown among the descriptive statistics Here SE MEAN is computed using the known population standard deviation 039 Questions For a standard normal random variable Z what cutoff value 0 gives PZ lt c 005 Use tables of the standard normal distribution Alternatively use the Minitab command invcdf or the Minitab menu path CALC gt Probabilitygt Normal Preferably use both methods and compare the results What is the formula for the margin of error in this confidence interval Hint It involves the cutoff value of the previous question and SE MEAN Give the numerical value resulting from this formula What is the total length of the confidence interval just computed How is the total length related to the value of the margin of error How is the margin of error in the previous question used to find the endpoints of the 90 confidence interval shown in the printout above Recall that we are computing a 90 confidence interval here Why was the cutoff value c in the first question of this group calculated using 5 instead of 100 90 10 32 A 90 con dence interval for the usual case where the population standard deviation is NOT known Here the confidence interval is based on the tdistribution with 39 degrees of freedom The command name tint re ects the use of the tdistribution Of course the command line does not include the value of the population standard deviation because it is unknown 90 Cl N MEAN STDEV SE MEAN 900 PERCENT cI 40 07620 01086 00172 07331 07909 Questions 0 For a random variable T distributed according to the tdistribution with 39 degrees of freedom what cutoff value 0 gives PT lt c 005 Use tables of the tdistribution for an approximate value Alternatively for an exact value use the Minitab command invcdf or the Minitab menu path CALC gt Probability gt t 0 Why does the value 00172 of SE MEAN given just above differ from the value 00158 given in the previous printout MTB gt tint Cl 0 What is the formula for the margin of error in this con dence interval Hint It involves the cutoff value of T from above and SE MEAN o How is the margin of error in the previous question used to find the endpoints of the 90 tconfidence interval shown in the printout above 33 A 95 con dence interval based on the t distribution Because 95 is the default confidence level for Minitab the command does not require you to state that the confidence level is 95 But it does no damage to include 95 in the command line try it both ways Notice that the increase in confidence level from 90 to 95 increases the margin of error and thus makes the interval longer cl N MEAN STDEV SE MEAN 950 PERCENT CI 40 07620 01086 00172 07273 07967 Questions What was the total length of the 90 tcon dence interval computed previously What is the total length of this 95 confidence interval Give an intuitive explanation why the 95 CI must be longer What cutoff value for the tdistribution did Minitab use to compute this 95 confidence interval Suppose you are wondering whether u 074 is a quotbelievablequot value for the population mean Suppose also that the population standard deviation 039 is unknown What does the 90 confidence interval have to say about this issue Adjusting your concept of quotbelievablequot somewhat what does the 95 confidence interval you just found have to say Answer the previous question again but this time you are wondering whether u 073 is a believable value 34 Which distribution do I use for a con dence interval when estimating the population mean The correct choice between the the standard normal distribution and the tdistribution is very easily made Ifthe population standard deviation is known always use the standard normal distribution Ifthe population standard deviation is NOT known always use the t distribution Ifthe sample size is larger than 30 the cutoff values for 95 con dence intervals are about the same regardless of whether the standard normal or the tdistribution is used And the same can be said for 90 or 99 confidence intervals Before statisticians knew about the tdistribution it was common to use the standard normal distribution to get approximate con dence intervals when 039 was unknown However the tdistribution always gives more accurate results when 039 is unknown Notice that there is never any doubt which distribution is correct when you are using Minitab Zint requires you to input a known value of 039 tint does not This same distinction depending on whether or not 039 is known also holds for tests of hypothesis which we consider next 4 Hypothesis Testing 41 Tests With Two Sided Alternatives Now suppose that comprehensive data from all relevant hospitals last year showed u 073 Since then there have been many changes in the health insurance industry Using the random sample of 40 we want to test whether the population mean has changed this year In this situation 0 The null hypothesis is that u 073 no change from last year 0 The alternative hypothesis is that u is not equal to 073 some change Because we have comprehensive data we could compute last year s population standard deviation 039 but suppose we are concerned that 039 may have changed also and so we are not willing to use last year s 039 for our test of hypothesis Thus we regard 039 as unknown Notice that the statement of the null hypothesis involves the population mean not the sample mean We already know that the sample mean of our data is 07620 Because the sample mean is a good estimate of the population mean we would not expect the population mean to be quotvery muchquot different from 07620 The question is whether the sample mean and the hypothetical value 073 of the population mean are quotsignificantly differentquot How big a difference is quotsignificantquot that is to say more different than we would expect on the basis of sampling error In this situation a onesample ttest is the appropriate statistical procedure to judge whether the sample mean differs significantly from the hypothetical value 073 of the population mean The results of this test are shown in the Minitab printout below MTB gt ttest TEST OF MU N Cl 40 07300 VS MU NE 73 Cl 07300 MEAN 07620 STDEV 01086 SE MEAN T 00172 186 P VALUE 0070 Some notes on Minitab 0 You must specify the hypothetical value of u in the command or in the menu dialog box Otherwise Minitab doesn39t quotknowquot what value you have in mind Minitab39s default hypothetical value is 0 which would make no sense at all in this problem 0 Notice that you do not need to specify a xed signi cance level for your test when using Minitab This is because Minitab prints a P VALUE from which you can tell whether or not to reject the null hypothesis at any signi cance level of your choosing More on this below 0 Minitab does not print Greek letters MU stands for u o In printouts some versions of Minitab use VS abbreviation for Latin versus meaning against to indicate that a statement of the alternative hypothesis is coming next 0 In printouts some versions of Minitab use the abbreviations N E not equal L T less than and G T greater than in specifying alternative hypotheses Questions Find the formula for the tstatistic in your text Plug in the values of the sample size sample mean and sample standard deviation from this Minitab printout and verify the value of T given in the printout For a 10 xed signi cance level nd the critical values of the t distribution with 39 degrees of freedom Critical values are those that separate the rejection regions in the tails of the distribution from the nonrejection region quotacceptancequot region in the center At the 10 level do you reject the null hypothesis At the 5 level Notice that the formula for the test statistic T involves the difference between the sample mean and the hypothetical value of the population mean But it also involves the sample size and a measure of dispersion explain intuitively why Hint What if you converted the data from proportions to percents 67 74 Then the difference would be 100 times as big but would the difference be quot100 times as convincingquot Returning to proportions what if you had only 4 observations with a sample mean of 0762 Would the difference be as convincing as with 40 observations 42 Using P values First here is how to interpret a Pvalue Suppose you have a fixed significance level 0c in mind for your test of hypothesis 0 Ifthe Pvalue is less than or equal to or then REJECT the null hypothesis 0 Ifthe Pvalue is greater than 0L then DO NOT REJECT the null hypothesis The smaller the Pvalue the stronger the evidence against the null hypothesis Example In the above printout the Pvalue is 007 This means that the null hypothesis cannot be rejected at the 5 significance level However it would be rejected at the 10 level Second here is how to compute a Pvalue Assuming the null hypothesis to be true the Pvalue is the probability of observing by chance alone a more extreme value of the test statistic than the one actually obtained Some books use the terminology quotobserved significance levelquot to mean the same thing as Pvalue This terminology seems not to be used very much any more Sample computation In the above printout the test statistic has the value T 186 It is distributed according to the t distribution with 39 df Values larger than 186 or smaller than 186 are judged to be quotmore extremequot farther from the 0 center of the distribution than 186 Thus the Pvalue 1s PTgt186 PTlt 186 2PTgt186 007 Questions 0 For our data would the null hypothesis that u 073 be rejected at the fixed significance level 0L 1 0 Generally speaking tables of the tdistribution that appear in textbooks are not adequate to find exact Pvalues because they provide cutoff values of tdistributions corresponding only to a very few probabilities for example perhaps 1 05 025 01 005 001 Furthermore tables do not provide information for all degrees of freedom for example skipping from 30 df to 40 df to 60 df etc Can you gure out how to use the ttables in your text to quotbracketquot the Pvalue for the present test of hypothesis Typical answer 39 df is between 30 df and 40 df In either case 30 df or 40 df T 186 lies between the onetailed cutoffs for 5 and 25 Thus the Pvalue is between 25 10 and 225 5 These values bracket the true Pvalue which is 7 Use Minitab39s probability calculation capabilities to find the P value corresponding to T 186 for a twosided test 39 df Hint It is probably easiest to use the menu path CALC gt Probability gt t 43 Connection of two sided test with con dence interval The value u 073 is contained in the 95 CI Thus 073 is a quotbelievablequot value of u with 0L 5 Logically enough using the same data the test of hypothesis quotacceptsquot 073 as a believable value ofu at the 5 level Similar connections exist between 90 confidence intervals and tests with 0L 5 and between 99 confidence intervals and tests with 0L 1 Caution Below we consider tests with one sided alternatives Minitab39s confidence intervals for a population mean are fundamentally twosided based on the sample mean plus or minus a margin of error The connection stated above between confidence intervals and tests of hypothesis holds for two sided alternatives only Do not try to use it for tests with onesided alternatives 44 One sided alternative Now suppose we have not seen any data yet But we know that the insurance carrier in question has conducted a large public relations campaign and has increased coverage while keeping prices about the same Ifthere has been any change in the choice of this coverage we are quotsurequot it must have been an increase In this situation the null hypothesis is still that u 073 but here the sensible alternative is that u gt 073 To perform the quotonesided testquot also called a quotonetailed testquot you must select this option in menus or use a subcommand ALTERN 1 for a leftsided test ALTERN 1 MTB gt ttest 73 Cl SUBCgt altern 1 TEST OF MU 07300 VS MU GT 07300 N MEAN STDEV SE MEAN T P VALUE Cl 40 07620 01086 00172 186 0035 Notice that the Pvalue here is half what it was for the twosided test Thus at the 5 level we REJECT the null hypothesis H 073 against the rightsided alternative If we are working at a fixed signi cance level of 5 the difference between a onesided and a twosided alternative makes the makes the difference whether or not to reject the null hypothesis In some cases the choice whether to use a onesided or a twosided alternative can be 1 v ial quot the 1 v J centers on the reliability of background information sometimes on the purpose of the experiment and sometimes on philosophical issues about hypothesis testing Our view is that twosided alternatives should be used unless there is very strong reason to support the use of a onesided alternative In any case there is no controversy about the following two statements o The choice whether to use a one or twosided alternative must always be made before the data are collected After one sees the data it would be too easy to invent a rationale why the data quothadquot to fall in the direction they did Electing a onesided alternative after seeing the data is deceptive and unethical For example it may amount to claiming a fixed significance level of only 5 when the real significance level is 10 o If a onesided alternative is chosen and the data should turn out emba1rassingly to be in the opposite direction from the selected alternative then the null hypothesis cannot be rejected no matter how extreme the result For example when one chooses a right tailed alternative one is declaring in advance of data collection that values of the test statistic in the left tail will be attributed to faulty procedure or bad luck Questions 0 For the present data the null hypothesis that u 073 and a left taz39led alternative what is the Pvalue What is your conclusion at the 5 level 0 For the present data the null hypothesis that u 073 and a right tailed alternative and the assumption that 039 0 1 what is the P value What is the conclusion 5 level Hint The command is Ztest and the known value of the population standard deviation must be specified on the command line alternatively use menus or do the computation by hand and use tables of the standard normal distribution Additional Things to Try 0 Is a confidence interval based on the standard normal distribution always shorter than a confidence interval same confidence level based on the tdistribution Hint Remember that the assumed value of 039 is part of the computations What if the sample standard deviation is quite different from the assumed population standard deviation For these data what happens to the zinterval if we assume 039 015 Even though our data appear to be nearly normal some statisticians may prefer to do a quotnonparametricquot test which does not depend at all on the assumption that the data are normal One such test is a Sign test Here is how to apply it to the present situation The null hypothesis is that the population median is 073 Use the twosided alternative Each observation greater than 073 is called a PLUS each observation smaller than 073 is called a MINUS This test is only for continuous data if due to rounding any observation is recorded as 073 it is thrown out and the sample size is reduced accordingly For our data how many PLUSes and how many MINUSes are there Ifthe null hypothesis is true then PLUS and MINUS each have probability 12 due to the de nition of the median Then the sign test becomes a test of whether the true probability of PLUS is really 12 Just like judging whether a coin is fair Use the normal approximation to the binomial Assuming that the probability of PLUS is 12 standardize the observed number of PLUSes Remember to use the adjusted sample size when you find the mean and standard deviation of the relevant binomial distribution Ifthis Zscore is less than l96 or greater than 196 then reject the null hypothesis against the twosided alternative at the 5 level of significance What is the Pvalue of this sign test Again use the normal approximation and compare your result with the Minitab printout from the procedure obtained via the menu path STAT gt Nonparametrics gt l sample Sign Answer 14 MINUSes 24 PLUSes 2 eliminated The Pvalue is about 14 not even significant at the 10 level Because this test looks only at PLUSes and MINUSes it loses some of the information in the data and thus is less powerful less able to detect when the null hypothesis is false Copyright 1999 by Bruce E Trumbo All rights reserved Intended for use at California State University Hayward Other prospective users please contact the author for permission Hayward Statistics Introduction to Minitab Demonstrations What You Need to Get Started This introduction is mainly for students who are working on their own and who have no previous experience with Minitab or with the network in the CSUH School of Science Computer Lab If the instructor of your class has given you an orientation session or if you have used Minitab in the Science computer lab before you may be able to skip parts of this introduction In that case just remember that useful information is here in case you need help Your location computer equipment and personal preferences will determine how you use these demonstrations Even though you may get some benefit just from browsing through the notes for each part they are meant to be used interactively on a computer that is running Minitab statistical software You should try each procedure on the computer as you read about it That way you will begin get the feel of handson interactive data exploration In order to use these notes as intended you will need 0 Parts 1 6 of these notes from the CSU Hayward Statistics website 0 You may wish to make a printed copy of these notes using your web browser so that you can write comments on the paper pages as you go along The notes are protected by copyright but making printouts for any noncommercial educational purpose is hereby authorized Ifyou have a system with enough speed and memory to run programs in two windows at once you may wish to read the notes in one window on your screen and to work with Minitab in another Maximize the web window to read about a procedure then maximize the Minitab window to try the procedure for yourself then go back to the web window again and so on 0 Access to the Data The datasets have been prepared in quotMinitab worksheetquot format so that they can be loaded instantly for use with Minitab 0 At present the worksheets are available only on the Cal State Hayward campus from the School of Science network They are available on server drive I public les in the path I COURSWRK STAT BTRUMBO MINDAT You need the Minitab worksheet MINDEMOMTW DOS or Windows Minitab Release 7 or higher or MINDEMO MTP quotportablequot format for all versions of Minitab with sufficient capacity 0 For the benefit of those using machines with limited memory student versions of Minitab or Minitab releases limited to 50 columns of data the 0 0 0 Minita 0 0 0 0 0 data have been broken out into separate worksheets for each part MINDEMOl M39I39W MINDEM02 MTW and so on Column numbers remain the same as in the combined worksheet except for Part 6 Portable versions with extensions MTP are also available An ASCII text printout of the data is provided With some editing each dataset could be extracted from this le for use in almost any statistical software package Eventually we may be able to provide the quotportablequot version of the worksheet over the internet If so these instructions will be changed accordingly b software M initab available on campus Some version of Minitab is available in almost every computer lab on campus The School of Science Computer Lab uses Release 11 for Windows You are free to use Minitab in campus computer labs wherever it has been installed but you may not copy Minitab software from university computers for use off campus Lab administrators have made this very difficult to do the installation you would copy probably won t work on a different computer and it is seriously illegal Purchasing M initab software The Pioneer Bookstore on the Cal State Hayward campus has very good prices on Minitab software due to a special arrangement with Minitab Inc These are the full versions of Minitab used in businesses and by professional statisticians often at prices below those you will see elsewhere for stripped down student versions We do not recommend the student versions DOS Releases These notes were originally prepared using Minitab Release 7 for DOS but at this elementary level almost everything should work even in older versions of Minitab Read the Instructions for DOS users Windows Releases Windows menu selections are shown for each procedure you are asked to do Windows releases have the capability to make highresolution graphics versions of many of the displays we show in these notes We have purposely shown quotcharacter graphicsquot for the benefit of users who lack the hardware software bandwidth or patience required to download graphics files over the web Character graphics are pictures composed entirely of text symbols Read the Instructions for Windows users Macintosh releases Macintosh menu selections may differ somewhat from those shown here for Windows You will need to use the portable version MINDEMO MTP of the worksheet We have not tested these notes on Macintosh releases of Minitab you are pretty much on your own Part 1 IQ Scores Setup In this demonstration you will use the Minitab worksheet MINDEMO M39I39W You need to retrieve that worksheet from disk so that it is ready for use within Minitab This worksheet contains the data for Parts 16 of these notes Where memory limitations are a concern use MINDEMOl MTW here and retrieve other numbered worksheets for other parts In this demonstration you will use the following columns the contents of which will be explained as we go along 0 c2 Origl IQ 0 c3 Final IQ The Data The data for this demonstration are IQ scores of 250 high school students in the San Francisco Bay Area collected for a master s thesis in Educational Psychology at CSU Hayward Exploration of the Data Dotplots The dotplot is one of the simplest graphical devices Each observation is represented by a dot appropriately placed along a horizontal axis Ifseveral observations have the same or nearly the same value they are stacked vertically In Minitab you might make a dotplot in either of two ways First you may type the command DOTPlot followed by the column identi er here c2 or 39Origl IQ Minitab does not distinguish between capital and small letters in commands We capitalize the first four letters here to signify that they are the only ones required Ifa command name has more than four letters you need to type only the first four letters but you may type the entire command name if you like Alternatively in Windows versions of Minitab you may select the menu path GRAPH gt Character gt Dotplot and then c2 Origl IQ In these notes the menu path for Windows is shown at the beginning of each display followed by the corresponding command GRAPH gt Character gt Dotplot MTB gt dotp Orig IQ Origl IQ From this dotplot of the data we see that most of the IQ scores are between 70 and 130 with a few outside this interval on both sides However the striking thing is the extreme IQ score of almost 200 From what we know about IQ scores this is probably an error Boxplots The boxplot of a dataset is based on the quot venumber summaryquot of the observations From smallest to largest these ve numbers are The minimum The lower quartile lower end of box The median symbol within box The upper quartile upper end of box and The maximum Elk WP Notice that the quotmiddle half of the observations fall within the box of the boxplot An outlier is a value that falls relatively far away from the rest of the values in a dataset A Minitab boxplot signals probable outliers with the symbol 0 and possible ones with GRAPH gt Character gt Boxplot MTB gt boxp Orig IQ I I o Origl IQ 6O 90 120 150 180 210 Note The menu path GRAPH gt Boxplot without X Va able gives a pixelgraphic boxplot which runs vertically instead of horizontally but gives the same information as the one shown here The boxplot explicitly highlights the extreme value and labels it as a probable outlier The symbol indicates a quotpossiblequot outlier here not an error just a very bright student Numerical Descriptive Statistics Minitab makes it easy to compute a number of numerical descriptive statistics for a dataset STAT gt Basic gt Descriptive MTB gt desc Origl IQ Origl IQ Origl IQ MEAN MEDIAN TRMEAN STDEV SEMEAN 250 10032 10000 10021 1652 104 MIN MAX Q1 Q3 5800 19600 9000 11200 The crucial information here is the maximum value MAX 196 This is the exact numerical value of the outlier seen in the dotplot and the boxplot above Notes on other descriptive statistics shown above Check your textbook for the de nitions The sample size 250 Minitab uses Nhere but most texts use n for sample size and N for population size The sample MEAN 10032 Most texts use xbar or ybar for the sample mean The sample MEDIAN 10000 The sample standard deviation STDEV 1652 TRMEAN stands for the trimmed mean of the sample computed by ignoring the highest 5 and lowest 5 of the data and averaging the middle 90 this quantity is not as sensitive to erratic extreme values as is the mean Q1 and Q3 are the lower and upper quartiles of the sample SEMEAN is the estimated standard error of the mean equal to the sample standard deviation divided by the square root of the sample size this quantity is used in statistical inference In the actual situation upon which these data are based the researcher rechecked the original list of IQ scores and found that the value 196 resulted from a data input error the correct value is 96 The data in c3 Final IQ are identical to those in c2 except that this error has been corrected Now we repeat our work using the corrected data GRAPH gt Character gt Dotplot MTB gt dotp c3 Final IQ Here is a comparison of the numerical descriptive statistics for the incorrect and corrected IQ data Note that descriptive statistics can be computed for more than one column at a time STAT gt Basic gt Descriptive select both columns MTB gt desc Origl IQ Final IQ N MEAN MEDIAN TRMEAN STDEV SEMEAN Origl IQ 250 10032 10000 10021 1652 104 Final IQ 250 99920 100000 100076 15367 0972 MIN MAX Q1 Q3 Origl IQ 5800 19600 9000 11200 Final IQ 58000 150000 90000 112000 The incorrect observation changed the mean by 4 of an IQ point giving 1003 compared with a correct mean of 999 the trimmed mean by about 1 of an IQ point and the median not at all Comments Unlike quottextbookquot examples real data almost always contain some errors In beginning to study a dataset it is well to use a number of graphical and numerical devices to screen the data for unreasonable and inconsistent values Using a computer with statistical software such as Minitab we nd it easy to take such a critical look at a dataset before we try to draw conclusions from it even if the sample size is fairly large as in the present case Consider for a moment how much work would be required to duplicate the work shown in this demonstration if we had to do it using pencil graph paper and a hand calculator Part 2 Sun ower Seedlings Setup and Data In this demonstration you will use columns c12 c13 and c14 of the Minitab worksheet MINDEMOMTW or MINDEM02MTW A standard botany lab experiment at Cal State Hayward is to follow the growth of sun ower seedlings grown under various conditions As part of this experiment 30 sun ower seedlings were grown in soil that contains no nitrogen nutrients In this demonstration we will look at data on the heights of these plants measured at the end of the second week c12 third week c13 and fourth week c14 Exploration of the Data Dotplots In order to follow the progress of these seedlings grown in nitrogendeprived soil we compare dotplots of the heights for each of the three weeks For ease of interpretation it is important that these three dotplots be drawn on the same scale If we use the DOTPlot command we can accomplish this by using the subcommand SAME Notice that the command line ends with a semicolon and that the subcommand is on a separate line which ends with a period Also notice that this time we use a range of column numbers instead of column names to identify the columns either method works With menus we select the three variables and the option to put them on the same scale GRAPH gt Character gt Dotplot same scale MTB gt dotp c12 cl4 SUBCgt same NO N Wk2 NO N Wk3 NO N Wk4 8 OO 12 OO 16 00 2O 00 24 OO 28 00 We may not know how tall the nitrogendeprived seedlings could be expected to grow in three weeks However the value 26 in Week 3 not only looks suspicious in its own right it is inconsistent with the data for Week 4 Did the largest seedling shrink in size during the fourth week We print out the data in these three columns abridged here to save space The row with the questionable observation is indicated with an arrow edited in by hand MANIP gt Display Data MTB gt print 012 014 ROW NO N Wk2 NO N Wk3 NO N Wk4 1 11 15 18 2 10 14 17 8 6 8 12 9 10 12 15 10 8 13 17 11 13 26 25 lt 12 10 13 16 13 12 17 23 279121428101317297111530101723 When checked the original lab sheets showed that the 11th value in cl3 should have been 20 instead of 26 There are two ways in which to correct this error 1 Using the command LET c13 11 20 2 Going into the worksheet and changing the value directly Note that by either method only the data in the active version of the worksheet are changed For a permanent record of the edited dataset you must record it on disk You could either use the command SAVE followed by the path and name of the new le enclosed in single quotes or select FILE gt Save As from the Windows menu The data are now quotcleaned upquot and ready for statistical analysis In this case it is unlikely that the error we corrected would have prevented us from making a useful analysis of the data but it is not difficult to imagine cases in which one or more transcription errors could change the interpretation of a dataset We repeat the dotplots for the repaired data and then look at numerical descriptive statistics PLOT gt Character gt Dotplot same scale MTB gt dotp c12 cl4 SUBCgt same Ncgt N Wk2 Ncgt N Wk3 Ncgt N Wk4 700 1050 1400 1750 2100 2450 STAT gt Basic gt Descriptive MTB gt desc 012 014 N MEAN MEDIAN TRMEAN STDEV SEMEAN NO N Wk2 30 9667 10000 9692 1882 0344 NO N Wk3 30 13600 13500 13615 2660 0486 NO N Wk4 30 17500 17000 17500 3381 0617 MIN MAX Q1 Q3 No N Wk2 6000 13000 8750 11000 No N Wk3 8000 20000 12000 15250 No N Wk4 10000 25000 15000 19250 Notice that by almost any numerical criterion the seedlings continued to grow in weeks 2 through 4 even though the soil has no nitrogen Later when the nutrients in the seeds themselves had been exhausted these seedlings did very poorly compared to ones grown in properly fertilized soil Comments Ifwe have access to the original data sheets for an experiment it is sometimes possible to get rid of outliers by making corrections But in some instances it will not be possible to explain an unusual observation What if the person who made the original measurement rather than the person who entered the data into the computer had made the error In those cases hard choices must be made about whether to disregard questionable observations In general great caution must me used in throwing out data that do not fit expected patterns There is a story perhaps true perhaps not that the quotozone holequot over the South Pole might have been discovered several years earlier than it was if a computer had not been programmed in such a way that it ignored the quotobviously faultyquot low ozone readings obtained Part 3 Quality Management Setup and Data This part continues to use the Minitab worksheet MINDEMO MTW speci cally columns c22 and c23 Alternatively you may retrieve MINDEMO3 MTW The particular data shown here were collected in the early 1960s as part of the quotquality controlquot program at a factory in Illinois where electromechanical devices were manufactured However the basic story is one that has been repeated many times in many settings and one that has been used by W Edwards Deming to illustrate principles of quality management At the request of the company involved the data were rescaled slightly before they were taken offsite In order to function properly in the nished product of which they are a part metal rods must be at least 1000 cm in diameter Of course they must also not be too much larger than 1 cm but here we focus on the crucial minimum diameter speci cation A lot of 400 such rods was inspected with the results recorded in column c22 named Inspect Exploration of the Data Histograms We begin by making a histogram of these data Histograms made by many statistical packages and those usually published in printed articles and reports have the measurement scale along the horizontal aXis with rectangular bars extending vertically Before the histogram is drawn the data are sorted into intervals each of which forms the base of one ofthe bars GRAPH MTB gt Histogram of Inspect N Versions of Minitab that run under Windows make histograms with a horizontal measurement scale and vertical bars as described above select GRAPH gt Histogram In these notes for the web we prefer to use Minitab39s character graphics because they do not require downloading graphics files A character graphic histogram is plotted quotsidewaysquot with the measurement scale running vertically and with rows of asterisks instead of rectangular bars Note that in our example each asterisk represents up to 2 observations otherwise the histogram bars would run off the page In Minitab Release 7 and some later versions one can also use the GHIStogram command to make a horizontal histogram on the graphics page gt Character gt Histogram hist Inspect 400 Each represents 2 obs Midpoint Count 0996 3 0997 8 0998 0 0999 0 1000 93 1001 63 1002 72 1003 68 1007 5 1008 2 The peak in the number of observations just at 1000 cm and the absence of any observations at all just below at 999 and 998 is suspicious It appears that the inspectors have fudged the results in order to pass rods that are just a bit too small When questioned they readily admitted that they had not understood the importance of the 1000 cm lower limit and that they had indeed recorded slightly undersized rods as being 1000 cm in diameter in a misguided attempt to avoid throwing them out The rods were subsequently reinspected more honestly with the results recorded in c23 named Reinsp39 GRAPH gt Character gt Histogram MTB gt hist Reinsp Histogram of Reinsp N 400 Each represents 2 obs Midpoint Count 0996 3 0997 8 0998 22 0999 30 1000 41 1001 63 1002 72 1003 68 1004 46 1005 27 1006 13 1007 5 1008 2 Other Descriptive Methods Notice that the numerical descriptive statistics are very much the same for the dishonest and honest inspection records it is unlikely that our suspicions would have been aroused just by looking at the numerical summary of the original data in c22 39Inspect STAT gt Basic gt Descriptive MTB gt desc Inspect Reinsp N MEAN MEDIAN TRMEAN STDEV SEMEAN Inspect 400 10021 10020 10020 00020 00001 Reinsp 400 10019 10020 10019 00023 00001 MIN MAX Q1 Q3 Inspect 09960 10080 10000 10030 Reinsp 09960 10080 10000 10030 Similarly the boxplot of the dishonest data shows nothing that would have caused us to suspect their validity Boxplots are good at highlighting extreme values but not at showing peculiarities in the central part of the sample distribution Try making a boxplot of the honest data on your own it will not not be much different from the boxplot of the dishonest data shown below It is a quirk of Minitab that one cannot draw two boxplots to the same scale in a single command as one can do for dotplots GRAPH gt Character gt Boxplot MTB gt boxp Inspect I I Inspect 09950 09975 10000 10025 10050 10075 On the other hand a dotplot of the original data gives much the same impression as the histogram showing a peak adjacent to a gap Actually the dotplot is often a better bet to detect peculiarities in the quotshapequot of a sample An unfortunate grouping of the data for the histogram might have put the peak and the gap into the same bar of the histogram thus obscuring both GRAPH gt Character gt Dotplot MTB gt dotp Inspect Each dot represents 5 points Inspect 09950 09975 10000 10025 10050 10075 Comments The moral of this example and the two in preceding parts is that it is wise to look at each data set using a variety of graphical and numerical methods No one method can be guaranteed to show up the anomalies that may be present Computer analysis has a clear advantage in such a program of data exploration Using a statistical computer package such as Minitab one can quickly and easily use a variety of descriptive techniques to explore a dataset Such a thorough analysis by hand would be quite tedious and would probably seldom be done in practice Part 4 Sodium in Hot Dogs Setup and Data This part uses data from columns c32 and c33 of MINDEMO M39I39W You may also use MINDEMO4 M39I39W which contains only these two columns In 1986 researchers at Consumers Union analyzed samples of 54 brands of hot dogs for fat and sodium content and reported the results along with other information in the June 1986 issue of Consumer Reports Sodium in hot dogs comes from salt and other preservatives Guidelines vary but there is general agreement that the typical American diet is much too high in sodium Here we consider the sodium content in mgoz for two general types of hot dogs 0 Red Meat 36 brands containing either pork or beef and 0 Poultry 17 brands for which the animal content consists entirely of chicken or turkey or a combination MANIP gt Display Data MTB gt print C32 C33 ROW RedMeat Poultry 1 248 269 2 239 234 3 213 248 4 201 239 5 241 242 6 294 271 7 216 224 8 215 223 9 240 264 10 234 257 11 193 213 12 200 257 13 241 298 14 251 291 15 323 294 16 275 261 17 211 273 18 199 19 199 20 169 21 229 22 253 23 237 24 273 25 248 26 225 27 242 28 229 29 254 30 197 31 203 32 233 33 256 34 253 35 268 36 212 We derived the data given in the worksheet from information provided in Consumer Reports on milligrams of sodium per hot dog and on weights of hot dogs in ounces We eliminated one brand made of veal and having an exceptionally low sodium content Structure of the Data Notice that the numbers of observations in the two columns are not equal in technical language an unbalanced experimental design Furthermore in contrast to data we will see in Part 5 there is no connection between two observations that happen to be recorded in the same row of the worksheet The two columns of data are independent of one another Exploration and Analysis of the Data Descriptive Techniques From the above listing of the data it is dif cult to tell whether sodium content is generally higher for one kind of hot dog than for the other We begin by looking at dotplots for the two columns of data drawn on the same scale GRAPH gt Character gt Dotplot same scale MTB gt dotp RedMeat Poultry SUBCgt same RedMeat Poultry 180 210 240 270 300 330 While the samples with both the highest and the lowest concentrations of sodium are found among the meat hot dogs there seems to be a clear tendency for poultry hot dogs to have higher concentrations of sodium than meat hot dogs do The data may be summarized numerically as follows STAT gt Basic gt Descriptive MTB gt desc RedMeat Poultry N MEAN MEDIAN TRMEAN STDEV SEMEAN RedMeat 36 23372 23550 23234 3098 516 poultry 17 25635 25700 25647 2525 612 MIN MAX Q1 Q3 RedMeat 16900 32300 21125 25250 poultry 21300 29800 23650 27200 The poultry hot dogs average about 256 mg of Sodium per ounce whereas the meat hot dogs average only about 234 Compare these means with what you see in the dotplots The sample mean can be viewed as the point at which the dotplot would balance if all dots have the same weight Statistical Inference The question of interest here is whether the difference between the sample means we noticed in the dotplots and veri ed by exact computation indicates a real difference between the two types of hot dogs or whether it might just have resulted from sampling variation Ifwe were to take another sample would we expect to see a higher mean for poultry hot dogs again One inferential procedure commonly used to decide such questions is called a quottwo sample ttestquot It gives the probability that such a large difference in sample means would be due to chance alone This probability is the Pvalue given in the printout below STAT gt Basic gt 2Sample t different columns otherwise retain defaults MTB gt twos RedMeat Poultry TWOSAMPLE T FOR RedMeat VS Poultry MEAN STDEV SE MEAN RedMeat 36 2337 310 52 Poultry 17 2564 252 61 95 PCT CI FOR MU RedMeat MU Poultry 38 9 64 TTEST MU RedMeat MU Poultry VS NE T 283 P00075 DF 38 The Pvalue of 00075 indicates that there are fewer than 8 chances in 1000 that a difference in sample means of the size we found here would occur by chance We are led to conclude that poultry hot dogs as a group tend to have higher concentrations of sodium Thus the formal statistical procedure con rms what we see by eye from a comparison of the two dotplots This is as it should be A mathematical result that contradicts what one sees intuitively in a properly made graphic display should be viewed with great skepticism Comments Because of the very small Pvalue and because the data do not show any obvious aws we conclude that among hot dogs available in 1986 those made of poultry tended to have higher sodium concentrations than those made of red meat Our statistical analyses have shown this difference to be real It is another matter whether this difference is of practical importance This is something that statistical procedures cannot judge Notice that there is enough dispersion in both groups that a customer who wants to eat poultry hot dogs can find a brand with lower sodium content than for most brands with red meat On the other hand the customer could well select one of several brands of meat dogs with a higher sodium content than most poultry ones in fact the specimen showing the very highest sodium content in our dataset was a meat hot dog However the lowest levels of sodium in hot dogs are still so high that a single hot dog may contain more sodium that a person should consume in a day Perhaps even more important someone who takes healthful eating as a really serious matter will probably not be shopping for hot dogs in the first place for reasons in addition to sodium content As is often the case with real datasets there are some questions here as to how the data were selected and how seriously we should take the results of the formal statistical analysis For students who are beyond the first few weeks of an introductory statistics course we discuss brie y some technical assumptions that one must make in performing the ttest just shown We assumed that The data in each group meat and poultry are a random sample of all individuals in the group It is unlikely that the people from Consumers Union made a careful random sample either of brands or of individual hot dogs It is also hard to imagine that they had any reason deliberately to seek out poultry hot dogs with especially high sodium content to include in the study The data are normally distributed This assumption is not as important here as it is in some instances because the sample sizes are moderately large and there is no evidence of serious skewness or outliers Procedures which we shall not describe here can be used to test whether data are normal they revealed no difficulty o The data in the two groups are independent There is no reason to doubt this assumption from what we know of the data Note The ttest we used does not assume that the variances of the two groups are equal A pooled test which does require this assumption gives results similar to the ones obtained here To try it use the subcommand POOL in command mode or select the pooled test in Windows Part 5 Heart Attack Patients Setup and Data This part uses columns c42c44 ofthe same Minitab worksheet MINDEMOMTW used in the previous parts Alternatively retrieve MINDEMOS MTW which contains just these three columns The study examined here involves 28 heartattack patients admitted to a large medical center For each of the 28 patients blood cholesterol levels were taken on the second and fourth days after the heart attack These data are recorded in c42 392nd Day39 and c43 394th Day39 The purpose of the study is to see whether cholesterol levels of heartattack patients tend to change in the days immediately following the event These data are taken from a dataset provided along with Minitab software Structure of the Data As in the previous part we have collected two columns of data here However the fundamental structure of these data differs from the structure in Part 4 these are quotpaired dataquot In order to make the point more clearly we show the data for the rst five patients MANIP gt Display Data MTB gt print c42 c44 ROW 2nd day 4th day 4th 2nd 1 270 218 52 2 236 234 2 3 210 214 4 4 142 116 26 5 280 200 80 Printout abridged to save space The first patient had a cholesterol level of 270 on the second day and 218 on the fourth day For later use we record 218 270 52 in column c44 indicating that this patient s cholesterol level dropped by 52 points from day 2 to day 4 The values 270 and 218 are said to be quotpairedquot because they are a pair of measurements of the same type on the same patient If the observations 270 and 218 had not been paired in this way it would have made no sense to compute their difference For paired data the order of presentation is important the paired structure of the data would be lost if the order of the data in one column were changed without making the corresponding change in order in the other column All of the data in column c44 have been derived by computing such differences For this example the differences have already been computed and recorded in the Minitab worksheet for you The command we used to make c44 was LET c44 c43 c42 The same result could have been obtained from Windows menus CALC gt Calculator store in c44 expression c43 c42 either typing the minus sign or clicking the mouse on the minus sign on the quotcalculatorquot Then we named the new column 394th2nd You can test your understanding of this procedure by recomputing the differences and putting them into c45 naming your column of differences Diff39 and then checking the worksheet to see that your 39Diff39 is identical to our 394th2nd39 Exploration and Analysis of the Data Descriptive Techniques We could use parallel dotplots of the data for the second day and for the fourth day to try to understand whether cholesterol levels tend to change just after a heart attack Such dotplots do show a slight difference between the patterns of 2ndday levels and 4thday levels However this is not an effective or proper way to look at paired data The main difficulty is that we cannot see which dot in the first plot corresponds to which dot in the second Because of the pairing such comparisons are important INEFFECTIVE PROCEDURE GRAPH gt Character gt Dotplot same scale MTB gt dotp C42 C43 SUBCgt same 10000 15000 20000 25000 30000 35000 What we really want to show in an effective plot is the difference each patient shows in cholesterol levels between Day 2 and Day 4 Hence it is best to plot the differences in c44 GRAPH gt Character gt Dotplot MTB gt dotp C44 4th 2nd This plot shows that even though cholesterol levels increased for some patients speci cally 9 of the 28 over the two day span of time levels decreased for most of them the other 19 In social science and medical data there are seldom absolutes All we can conclude from this picture is that decreases in cholesterol seem to happen more often than increases Inferential Procedure The average difference in our sample is a decrease of about 23 units Ifour small sample is typical of the population of heartattack patients our best guess at the average population decrease is also about 23 units How much different from 23 units might the population value be A confidence interval procedure based on the t distribution and giving rise to a command called TINTerval in Minitab says that we can have reasonable confidence that the average change in the population is a decrease of between 8 and 38 units ie 23 plus or minus an error factor of 15 Because the confidence interval does not include zero or any positive values it is very likely that the actual population tendency would be for a decrease in cholesterol levels following a heart attack CALC gt Basic gt 1Samp1e t confidence interval MTB gt tint C44 MEAN STDEV SE MEAN 950 PERCENT CI 4th 2nd 28 2329 3828 723 38l3 844 Ifyou have studied onesample ttests you should try the command TTESt 0 c44 to test the null hypothesis that the population mean difference is zero In Windows select the same menu path as above but test the mean with alternative quotnot equalquot instead of computing a confidence interval The very small Pvalue indicates that the null hypothesis should be rejected Furthermore the observed average change is a decrease in cholesterol levels In plain English this means that the data show a meaningful decrease in cholesterol levels between Day 2 and Day 4 Comments Again we see that a formal inferential procedure can con rm what we see in a properly made graphic display Notice however that a clear understanding of the structure of the dataset is required in order to do a reasonable analysis graphic or numerical It was not enough to see that the data from this experiment were recorded in two columns representing experimental variables that need to be somehow quotcomparedquot An understanding of the paired structure of the data was crucial to the proper analysis of the data Contrast the analysis of this data set with the analysis of the hot dog data in Part 4 which also involved a quotcomparisonquot of two columns of data Part 6 Education and Income Setup and Data This part uses the last two columns of the Minitab worksheet MINDEMO MTW or MINDEMO6 M39I39W It consists of data for two demographic variables collected by the US Bureau of the Census in 1970 and summarized for zip codes 0 Yrs Educ Median years of education for adults 25 years or older 0 HH Incom Median household income Out of the approximately 32000 zip codes in the Us we have data for a sample of only 500 This particular sample is not random but similar results to the ones we shall see here would be obtained from using a random sample of residential zip codes As we work with these data remember that we are not dealing with characteristics of individual people but summaries for zip codes which may contain individuals with a wide variety of incomes and educational backgrounds One suspects at least hopes that there is a positive association between education and income Positive association means that an increase in one variable is associated with an increase in the other One common way to measure the degree of association is the coef cient of correlation often denoted by r We use Minitab to nd the correlation between income and education for the 500 zip codes in the sample STAT gt Basic gt Correlation MTB gt corr HH Incom Yrs Educ Correlation of HH Incom and Yrs Educ 0606 Based on this information we might be tempted to try to find the equation of a regression line that expresses income as a function of education Minitab39s computation of this line is shown below STAT gt Regression gt Regression with one predictor MTB gt regr HH Incom 1 Yrs Educ The regression equation is HH Incom 23833 4041 Yrs Educ Predictor Coef Stdev t ratio p Constant 23833 2882 827 0000 Yrs Educ 40407 2375 1701 0000 s 6523 R sq 367 R sqadj 366 Notes In a simple linear regression such as we have here we attempt to express a quotdependentquot or quotresponsequot or quotpredictedquot variable in terms of one quotindependen quot or quotexplanatoryquot or quotpredictorquot variable Minitab s REGRession command can also be used for the situation in which there are several independent variables For simple linear regression the dependent variable is mentioned first and the single independent variable follows Here the number quot1quot between variables tells Minitab that we will use only one independent variable Minitab39s quotAnalysis of Variancequot table and a long list of quotUnusual observationsquot are omitted here to save space Ifyour installation can display only one page of information at a time press quotyquot as often as necessary to see all of this information or quotnquot at any point to avoid further output In the kind of interpretation of the regression line often seen in the popular press one might say that each year of education is worth a little more than 4000 dollars in increased annual household income With this interpretation the negative term in the regression equation would surely cause trouble if we tried to use the equation to predict incomes in zip codes where the median education is less than six years The Rsquare value of about 37 says that about 37 of the variability in income can be quotexplainedquot in terms of the amount of education This quantity the square of the correlation is called the coefficient of determination For a relationship in which one variable can be perfectly predicted as a linear function of another the coefficient of determination is 100 In this demonstration we will not examine the other numbers in the Minitab printout in detail In fact it is an abuse of computer technology to grind out the numbers for correlation and regression for these data Both of these techniques rely on the assumption that the association between income and education can usefully be viewed as a linear one and the fact of the matter is that the true nature of the association is much more interesting than that The very long list of quotunusual observationsquot generated by Minitab but not reproduced above is a strong indication that the assumption of a linear relationship is not appropriate here A scatter plot of these two variables is shown below Note that the variable on the vertical aXis is mentioned first in the command The Minitab PLOT command used here makes a rather crude quotcharacter graphicsquot image which is adequate for many purposes An asterisk indicates a single data point Numbers 2 9 indicate the number of points that fall at the same plotting location The symbol indicates that more than 9 points fall at a location Release 7 of Minitab allows more detailed plotting with the command GPLOt instead of PLOT Windows menus can also generate a different kind of PLOT command in which the variables are separated by an asterisk and the resulting plot is more elegant than the one shown here If you have access to one of these versions of Minitab you should also look at the corresponding higherresolution plot GRAPH gt Character gt Scatterplot or GRAPH gt Plot MTB gt plot HH Incom Yrs Educ 75000 HH Incom 2 50000 2 342 3925 2 2 2 72 25000 2 23 23436 32 23 232542256865 3 2 2 3324222237 0 Yrs Educ We see from this plot that the true relationship between income and education is that values fall only in the triangle that lies below the diagonal running from lower left to upper right There are no zip codes in the sample of 500 having both low education and high income There may be occasional loweducation highincome individuals lurking in the zip codes but not enough ofthem to show up in zip code summary gures On the other hand high education is sometimes paired with low income Perhaps the one zip code with very high median years of education and very low median household income consists predominantly of graduate student housing Education can be viewed as providing a potential for high income but not a guarantee The problem with correlation and regression methods here is not that the points fail to lie precisely on a line nor that the coefficient of determination is only 37 In the social and biological sciences an Rsquared value of 37 sometimes indicates a meaningful association between two variables A perfect fit to a line is not required What is required is that no other relationship works substantially better This dataset shows that exploratory graphical analysis can reveal unexpected structure of practical significance Information gained from looking at simple graphic displays can be very helpful in deciding what kinds of more formal analysis is appropriate Notes for Minitab Demonstrations by Bruce E Trumbo Department of Statistics CSU Hayward Hayward CA 94542 Email btrumbocsuhamardedu Comments and corrections welcome Copyright c 1993 1996 by Bruce E Trumbo All rights reserved Permission for non commercial educational use is hereby granted


Buy Material

Are you sure you want to buy this material for

25 Karma

Buy Material

BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.


You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

Why people love StudySoup

Jim McGreen Ohio University

"Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

Janice Dongeun University of Washington

"I used the money I made selling my notes & study guides to pay for spring break in Olympia, Washington...which was Sweet!"

Steve Martinelli UC Los Angeles

"There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

Parker Thompson 500 Startups

"It's a great way for students to improve their educational experience and it seemed like a product that everybody wants, so all the people participating are winning."

Become an Elite Notetaker and start selling your notes online!

Refund Policy


All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email


StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here:

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.