# Intr Stat&Data Analy SI 544

UM

GPA 3.88

This 26 page Class Notes was uploaded by Yvette Hegmann on Thursday October 29, 2015. The Class Notes belongs to SI 544 at University of Michigan taught by Lada Adamic in Fall.

Date Created: 10/29/15

in Ft Urii m SI 544 Descriptive Statistics And getting familiar with R Mick McQuaid amp Lada Adamic Schooiof Information University of Michigan Jan 81h 2007 I39llI last lecture In this course descriptive statistics simply quantitatively and visually analyze data data inferential statistics try to reach conclusions make judgements by modeling relationships within I39llI I terminology terminology population all items of interest sample portion of population that you have data for parameter summary measure about population statistic summary measure about the sample describing the distribution of one variable central tendency dispersion skew exploring data in R mmp gv Camm swam Nagy35 am Mdquot 1 usamm M M a sum a1Ls s mam mus that sum 5 mm M2 Sclence News jvsnzm B a cu Declining wmer Levels In The Great Lakes May Signal Glnhal Warm g 5m momwan 1 2075 Runnymch in anigan mum new evidence malwalerlavsls m ma GmaAszas which an quotEarls hm svs s mayhnsh nninndunlna w st axaminaswa Mcmgan and Hum cvar m 5 E clml arlavaldala furLakEs cm m n a camury a my qu mkugw u mmmnm mum LS m Canaman 4 m cm hzgwmffgmgm 33 mmmmme raz dcmu39nkxun quot W39 quotM Fquotme M v 1M W me m r a H Mm 64v mm J N41quot Arwum 4 u 107 Cram m mmmaemeazm h W W 159mm 1 a human Mum mm mczmc39mu chm m V mm mm Rclmnd Stems m i 5 5 In mumanswerwmm m n v nu my 25 ZJDD7Smnz Hg Ln mdc ymwamdm new m ivyaa39mcnm m aw Lake wanmm a m mam mm W a an Almale mxwmn mums n M y mi hey 5 mm mquot m w m 09 s waler an M mmr datasets in R gt data shows all available data sets dataset or function gt LakeHuron is a builtin data set with data on Lake Huron prompts R to tell you what it knows about a gt plotLakeHuron will plot a time series water levels from 1875 to 1972 The in front I39llI Rellurch LzKeHuran 576 577 575 575 555 551 552 l 1 E EEI 19 EIEI 19 2EI 19 4EI 19 6EI TlmE Looks like water levels were up again right around 1972 So what happened from 1973 to 2007 Is global warming having an effect Or was 1973 just a convenient place to quotstartquot to observe a drop other data set lakedata readtable quotsomedlrlakehuronl91872006txtquot headT head T instructs R to treat the first line of text as the column names Make sure you include the correct path to the file You can also have R prOmpt you to browse to the file location lakedatareadtable filefilechoose e Click on mdlvmual gauges to access water level data mussamn uma xwwmm Omanu OWW n Pam Sound I JP Onxanu Mackinaw Cw MI Du Tnur Tobemnly 0g Wlaga Ml 39 Oman a u a nu Lake a J Humn Cnlunqwoo j Omam Gwanch 0mm ESSEXVIIIB M c NCAA Gauges Lakupun Ml 0 BF Gauges Accessing columnsby index or column name gt colnames lakedata 1 quotYearquot quotLevelimetersquot gt lakedata y 1918 1919 1920 1 21 1922 1923 1924 1925 1925 1927 1928 1929 1930 1931 1932 1933 1934 1935 19361937 1938 1939 1940 1941 1942 1943 gt 1akedatab1 1 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 14 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 27 MM Adding and manipulating columns gt gt 1 8 7 15 5772240 5765678 22 H 1akedataLevellnpeet lakedataLevelimeters 8 32808 akedataLevelInFeet 5797502 57988486 5 72558 57584038 5790612 5788644 5783066 5779457 5776176 57781583 57787489 5795533 5795205 57881098 5764366 5765678 576r8631 5768303 5756334 H Accessing subsets of the data gt lakedata 10 15 Year Levelimeters LevelInFeet 10 1927 175492 5771935 11 1928 17610 5777841 12 1929 176465 5795887 13 1930 17664 5795558 14 1931 176421 5781450 15 1932 17594 5772591 gt lakedata lakedataYear gt 2002 Year Levelimeters LevelInFeet 86 2003 175482 5768654 87 2004 175487 5770295 88 2005 176408 5777185 89 2006 175488 5770623 summary statistics The mean or average is the sum of the values X1X2 X3 XN divided by their count N X X1X2X3XN 7 N gt mean lakedataLevelInFeet l 578 5177 median D The sample median is the middle valuequot Half of the data points have equal or lower value and the other half have equal or higher value E If there is an even number of sampled data points then m the median is the average of the middle two gt median lakedataLevelInFeet l 578 5715 Madame s13 quotmm The variance measures the amount of dispersion in the sample X17X2X27X2XN7X2 N 7 1 Why N 7 1 instead of N If we have only a sample of the whole population we don39t know the true mean of the population only the sample mean By dividing by N 7 1 we have a more generous estimate of the variance varX l gt var lakedataLevelInFeet 1 1557329 I mam levion l b quot512mb mm mm The standard deviation is nothing more than the square root of the variance 0X X1 7502 X273121 097302 Now we can say that the water levels gt sd lakedataLevelInFeet 1 1247930 So we can say that the lake water level wherever the measuring gage may actually be located is 57852 i 125 feet ksew We can also observe that the median is very close to the mean Frequency there is little skew Histogram of IakedataSLevellnFeem takedataiaLevettnFeet histLeve In Feet 20 LevettnFeet 578 579 580 581 t t t t 577 t 576 t t t t t 192m 194m 195m 198m znnn Veav plotYearLevenFeet i39ng a39 dataset It you are using a data set it can be tiresome to always have to type yourdatasetcoumnname Using the attach function you can start address it simply by the column name attach lakedata hlSt LevelInFeet 20 plot YearLeVelInFeet detach lakedata where there is skew income Frequency vvvv mun EIEEE ZeEI5 4eu5 Gems sawanmsawar y salaries readtablequotumlchsalarlesalltxtquotheadT h15tsalarlessalary50 mediansalariessalary 48000 meansalarlessalary 5778835 El the tails of the distribution are the very large and very small values 3 a long tailed distribution has values that are far from the mean E a left skewed distribution has a longer left tail 1 a right skewed distribution has a longer right tail Are umich salaries left or right skewed handedness In class you took a handedness survey What do you think the histogram looks like gt hand read table quothandednessZOOS txtquot headT gt hand left right specialization secondispeclallzatlon 19 lt gt 1 2 3 l 9 HCI ltNAgt 3 l 21 HCI ltNAgt 4 7 15 ARM PI mam wed 3193 Now we compute the handedness ratio r 7 lr I and plot a histogram gt handhandedness handrightihandleft handrighthandleft gt histhandhandedness gt histhandhandednessbreaks20 The second command tells the histogram function that we would like 20 bins The rest awaits in the problem set Things you shQquI fej39el comforl able wi th in R ll Entering in data I Bringing up help pages I Plotting and binning l Loading data I Selecting subsets of rows and columns in data I Attaching and detaching data sets

