APPLIED STATISTICS STATS 0110A
Popular in Course
Popular in Statistics
This 13 page Class Notes was uploaded by Isobel Stanton on Friday September 4, 2015. The Class Notes belongs to STATS 0110A at University of California - Los Angeles taught by Staff in Fall. Since its upload, it has received 105 views. For similar materials see /class/177959/stats-0110a-university-of-california-los-angeles in Statistics at University of California - Los Angeles.
Reviews for APPLIED STATISTICS
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 09/04/15
Stats 110A Applied Statistics Instructor H Xu Binomial Distributions Binn10 p02 ProbabIIIty 000 015 030 0 2 4 6 8 10 X Binn20 p02 O i g 0 E 9 g 039 EL 8 l I I I I O I I I 0 5 1o 15 20 X Binn50 p02 I g S E g 8 I 9 l I D 8 II III 0 0 1o 20 30 4o 50 X Binn100 p02 a no 395 O m 0 D 9 D O O O 0 20 4o 60 so 100 Probability Probability Probability Probability 015 000 010 000 000 006 012 000 004 008 Handout Binn10 p07 Binn20 p07 I I I I I 10 15 20 0 5 X Binn50 p07 Binn100 p07 3 wmg arm 33 My mg m o WMWW htt sta 1 439 0 Variable is the name label given to the objectbeing measured counted observed or recorded in any Way Eg ID EjectionVolume SysDiapresure etc 0Dzzta are the actual recording values Eg 12080 for the arterial pressure W mmmm OTypes ofvariables Pres entation of data Repea ted and grouped data IQualilatiVe Variables H variable are counm lVariables Withfzw repeated values are treated as continuous lVariables with many repeated values are treated as discrete O Qualitative variables aka factors or class variables describe group membership 0 List all possible categories the data is classi ed in lRepresents the frequency of occurrence of the data in each cry ca eg 0 Example Number of engineering students enrolled in different ma ors Quantitative Quall 220 and au39ve measurements counts de ne groups 3 80 Continuous Discrete Categorical Or dinal 39 4 50 fewrepeated values rnany repeatedvalues no rdea of order fall in natural order 30 50 620 O Frequcy Histograms protocol lDeteImine the RANGE ofValues a b r 3 lDeteImiIle the numbers of bars bins to plot I O 139 lm of each bar bad lCount the freguency of your data in each bin subinterVal lDraw the histogram 0 Example 30 5 5 3 7 no 2 7 o 3 s 3 5 2 o 1 o 5 o O Frequcy Histograms protocol lDeteImine the RANGE ofValues 0 130 lDeteImiIle the numbers of bars bins to plot 8 lm ofeachbar 130 08 163 lCount the freguency of your data in each bin subinteml lDraW the histong Frequencies O The height ofthe histogram bar bin f i l frequency ofthe data l on the ith interval 1 IW width ofthe ith interval n X wi 117 total number or data f 0 Then the Area ofthe histogram bar isA i n 0 And the Total area of all histogram bins is 1 100 0 Number of clear humps on the frequency histogram plot determines the modality of a histogram plot 0 Note Modality ofthe histogram is histogram parameter speci c Changing the width of the bins changes its appearance Histogram is ifit is not symmetri c the histong is heavy to the le or right or nonidentical on both sides ofthe mean For a standardized histogram 0 200 4 00 600 o The Vertical scale is Carbohydrate mgday Relative equemy Intewaliwidrh 0 Total area under histogram 1 0 Proportion of the data between a and b is the areaunder histogram between a and b VS of clear humps on the frequency histogram plot determines the modality of a histogram plot O Modality7uni Vs multimodal Why do we care 0 Symmetry 7 how skewed is the histogram 0 Center of gravity for the Histogram plot 7 does it make sense 0 If centerofgravity exists quantify the spread of the frequencies aron this point 0 Strange patterns 7 gaps atypical frequencies lying away from the center 0 Round numbers for presenlzti on 0 Maintain complete accutacy in numbers to be used in calculations If you need to roundoff this should be the Ver last operation o WW4 t cluster gap outlier Dot plot showing special features a Unbroken scale canon 15 22 a scale break b Broken scale Dot plot with and Without a scale break Units 7 l 272 Stem Leafs 3 33 23 K 0 1 z 3 4 5 Groth m GDP Forecast of percent growth in GDP for 1990 for some SouthEast Asian and Paci c countries 7 79 07 78 22 00 00 22 anuznzy mm m CuyutzLenglhsthzcm p 1 c 121 11 em 2 uyu eng s 53m 57m 5221 1115 53 245 12125 572 51m 52m 535 517 254 514 235 22m 71m 213 225 255 512 Classlntexval39fa y Frequency Stanrmileafplnt 515 2m 7075 2 714 24m 255 24m 25m 27m 22m 255 55m 27m 535 535 512 737 85D 97D 86D 0 A A A Md 555557777225 97D 95D 96D 91D 95B 845 88D 96D 96D 87D 95D IEIEIEI a a a a 2 2 2 3 3 A A A lEIlEI 96D 93B 925 95B 985 88B 813 914 889 8 A l l 6 7 7 8 8 252 1541 225 52m 51m 52m 25m 555 72m 1555 1555 51m 2 3 39 1555 255 555 255 5215 2mm 2mm eawuscapluedanWaScmacamda DatacauusyufDrVeraEaswaad slungTali r Femk Cuym Lenglhs compare c15ss1nam1 Ta y Frequency Stanrmdleafplal 75775 2 7 1 4 n 7 m1 5 2 n 1 A A A 1 5 2555 MM 12 2 5 5 5 5 5 7 7 7 7 5 1225mm 9U 95 MM 15 5 n n n n 1 1 2 2 2 5 5 4 4 4 a H1stugrzm b summamrpmmuaa 55155 M 5 5 5 7 7 2 2 mm 2 m 2 3 H1stugrzm 1me female 2me engms data 7551 AU 12 12 7 7 7 7 8 8 1 I What advantages does a sternandleaf plot have over 4 4 a 39stogram SampL Plots retum mfo on 1nd1v1dual va1ues qu1ckto n produceby hand prov1de data sortmg mechamsms But H1st s are more 25 521 mu 7U 25 55 15 with m with m I attractwe and more undasmndable II 0 al hm Ch 1 r 5 a1 5151 512551 25123335quot 03 32521531555335 W O The shape of a histogmrn can be quite drastically altered by choosing different classinter 12 2 boundaries What type ofplot does not have this 2 8 problem dens1ty trace What other factor affecw the 4 A shape ofa histogram amaze n n 7D an an m n 7D an an m 0 What Was another reason gryen for plotting data on a WSW l WW Variable apart from 1nterest Ln hoW the data on that 1 c Samevndthsd1ffa39entbuundanes a Dens1ty trace 1mwa math 5 wmdaw mm 5 Vanable behaves shows feamres clustagaps oumas as well as H1stugams and dens1ty aase uffemale saymeiengms data trends I a Ummudal 12 151mm c Tnmudal 1 2 e 2 q Sp1kempattem a Symmetric e Fusuvelyskewed o Negatively skewed lunguppertad lungluwa39tad mm mm k Outhers 1 Truncanunplusuuther g Symmem h B1mudal with gap 1 Exponential shape Features 52 12215 for m 11m yarns and stemrzndrleafpluts O The sample mean is denotedbyf The sample mean Sum ofthe observations Number of observations Mechanical construction represmting a dot lot a shows a balanced rod while b and c Show unbalanced rods P a gown Med 1 a Data symmetric about P P IIIOWII I I Med g quot1 b TWolargest pointsmovedto the right If is not a Whole number the median is the 2 The mean and the median average of the two observations on either side Grey disks in h are the guests efthepemtsthet were moved in I Beware ofinappropriate averaging I O The q39l quantlle 100 x q39l percentile 1s avaluegt 1n the range of our data so that proportion of at leastq of the data lies39 at or below it and aproportion of atleast lq lies at or above it I E x X l2345678910 The 203911 percentile 02 quartile is the value 2 since 20 of the data is below it and 80 aboveit The 70 percentile is the value 7 etc I We could have also selected a and for the 203911 and 703911 percentile above There is no agreement on the exact de nitions ofquantiles f 0 Mean Absolute Deviation MAD 7 0 Example 1 n 1 n 0 Mean AbSolute Deviation7 MAD z y yx 1 W n lr1 n 111 Variance 7 Var S7 72 y 72 Variance7 n l z 1 2 OStddD39t39 7 7quot 7 2 Var S an ar ev1a1on SD nlgyx y n MAD43l33 0 Standard Deviation 7 O X13 23 33 4 VaF53lr67 m lj l 2 3 4 SDI3 D Vars Stem7znd7leal39 of strength N 33 Leaf Unit 10 O O 10 2100233 1 5 IQR s 21 55668899 15 22000111112 6 225 pullbackunhlht 2mm 2mm 2an 23m 2400 2300 Zoom strmgth I I 2000 2100 2200 2300 2400 2500 2600 strenth Scale Consmw on of abox plot Three gaphsothe breakmgrstrmgth data for o 4 at 10 thtab 0 t nmnmm Word ngms for the First 1m 1 Sum of valuex frequency of occurrence 71 4 4 z 9 3 z 3 7 3 z 9 3 3 z 4 4 3 a 4 5 1 Sum of all observations 71 University of California Los Angeles Department of Statistics Statistics 110A Instructor Nicolas Christou Data analysis with R Some simple commands When you are in R7 the command line begins with gt To read data from a website gt sitequothttpwwwstatuclaedu nchristobodyfat tXtquot gt data lt readtablefilesite headerT Another way to read data from a website is the following data lt read table quothttp www stat ucla eduquotnchristobodyfat txtquot headerTRUE This le contains data on percentage of body fat determined by underwater weighing and various body circumference measurements for 251 men Here is the variable description Variable Description 1 Density determined from underwater weighing 2 Percent body fat from Siri s 1956 equation 3 Age years 4 Weight lbs 5 Height inches 6 Neck circumference cm 7 Chest circumference cm 8 Abdomen 2 circumference cm 9 Hip circumference cm 10 Thigh circumference cm 11 Knee circumference cm 12 Ankle circumference cm 13 Biceps extended circumference cm 14 Forearm circumference cm 15 Wrist circumference cm If the data le is on your computer eg on your desktop7 rst you need to change the working directory by clicking on Misc at the top of your screen and then read the data as follows gt data lt readtablequotfilenametxtquot headerT Note the expression lt is an assignment operator The result of a readtable is a data frame it looks like a matrix Useful commands Extracting one variable from data eg the second variable gt data2 Another way to extract one variable gt datax2 Similarly if we want to access a particular row in our data eg rst row gt data1 To list all the data simply type gt data To compute the mean of all the variables in the data set gt meandata To compute the mean ofjust one variable gt meandatax2 To compute the mean of variables 2 and 3 gt meandatac23 To compute the variance of one variable gt vardatax2 To compute summary statistics for all the variables gt summarydata To construct stem and leaf plot7 histogram7 boxplot gt stemdatax2 gt boxplotdatax2 gt histdatax2 To plot variable 2 against variable 10 gt plotdatam2datax10 And you can give names to the axes and to your plot gt plotdatax2datax10 mainquotScatterplot of percent body fat against thigh circumference xlabquotPercent body fatquot ylabquotThigh circumference To save a plot as a pdf le under the working directory eg your desktop gt pdf quotboxx2 pdf quot gt boxplotx2 gt dev off 0 If you want to read more about a speci c command for example the histogram at the command line you type the following gt hist gt boxplot On your computer Desktop this is what you get under the name boxx2pdf 0 Exercise Construct the same plots with different variables and save them on your desktop Another data set The following data were collected in the area west of the town Stein in the Netherlands near the river Meuse Dutch Maas river see map below The actual data set contains many variables but here we will use the gay coordinates and the concentration of lead and Zinc in ppm at each data point The motivation for this study was to predict the concentration of heavy metals around the banks of the Maas river in this area These heavy metals were accumulated over the years because of the river pollution Here is the area of study I r quota 39 W I 39 3182 It Hoog avean 1quot Emme n 39 39 v O i K r 39 Chi39 evilleM zire s Lar rklalar Linuxinept Einlgnausen tdeesijk Schi Opgrlmbie Emorsem Beak v I Spaubeek a l r Rekem Blkhoven ASSSTI ICFKAAII POR 39 I Nu Yquot A Neerhjarenv I Ulesualen scmmmen V Jignanc quot Aalbeek I Ineren Bund I39 Lanaken 39 Ik M Hulsberg Gems j 39 is E1 E Omharen Rotn Jroenhof immen gen eldmregell Ber en Valkenburg Be 7 T Ter quot53 vnz r r quotScrum op GE uI Bemelen Marglgaten lngber 3L Grnsveld xEckeUade r Exercise You can access these data at 93 gt soil lt readtablequothttpwwwstatuclaeduquotnchristostatisticslS soil txtquot headerTRUE E7 Construct the stem and leaf plot7 histrogram7 and boxplot for each one of the two variables lead and zinc7 and compute the summary statistics What do you observe O Transform the data in order to produce a symmetrical histrogram Here is what you can do gt loglead lt logsoillead gt logzinc lt logsoilzinc Construct the stem and leaf plot7 histrogram7 and boxplot for each one of the new variables loglead and logzinc7 and compute the summary statistics What do you observe now Here is a side by side boxplot of the variables lead and zinc First create a new data frame with only the variables lead and zinc soil1 lt soil34 Then you can construct a side by side boxplots of lead and zinc using gt boxplotsoil1 1500 ID 1000 lead zmc
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'