Class Note for CMPSCI 691 at UMass(14)
Class Note for CMPSCI 691 at UMass(14)
Popular in Course
Popular in Department
This 24 page Class Notes was uploaded by an elite notetaker on Friday February 6, 2015. The Class Notes belongs to a course at University of Massachusetts taught by a professor in Fall. Since its upload, it has received 16 views.
Reviews for Class Note for CMPSCI 691 at UMass(14)
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 02/06/15
Exploratory Data Analysis 10 March 2008 CMPSCI 691DD Research Methods mag I gg What did Hubble see ma 6 H cg Hubble39s Law OWKI soon E Hubble 1929 A relation between distance and radial velocity among extragalactic nebulae Proceedings ofthe NationalAcademy ofSciences 153 I n 4 t 1 o I a o O O U msrmc a o o39 musics 2 1 Pages FIGURE I Where V recessional velocity H0 Hubble constant r distance mpc Hubble39s Law OWKI 500m VILDCITY O 01mm 5 o o39 mzsus t w Puss FIGURE 1 39 N a s N A T e n H a The tool that is so dull that you cannot out yourself on it is not likely to be sharp enough to be either useful or helpfulquot John W Tukey Random variables The quotembar39r39assingly dogmaTic misnomerquot They are neiTher39 random nor39 are They variables A random variable is a funcTion ThaT maps from insTances To seaes The numeric r39eSulT of a nondeTer39minisTic experimenT They can be disTinguished from quotfixed var39iablesII whose value can be seT or39 pr39edeTer39mined before The experimenT They are noT The individual values eg 592 buT r39aTher39 The process of assigning value To insTances or39 colloquially The seT of values so assigned Examples Recall of an IR sys rem given query corpus and designa red relevan r documen rs Size and speed of code produced by a compiler given source and a rarge r processor 0 Number of dafabase rows re rurned given an any ri me query processor query da rabase and Time 0 Lines of code wri en given an assignmen r language developmen r environmen r and programmer NoTes The objecTs of sTudy are usually The sysTems ThaT enable random variables eg IR sysTems r39aTher39 Than The i nsTances ThaT The measures are on eg quer39ies WhaT we define as a random variable for39 a par39Ticular39 experimenT can change as we discover39 deTer39minisTic and causal r39elaTionships in a given sysTem Representation of data instances iid instances are commonly assumed Independent Knowing something about one instance tells you nothing about another Identically distributed Drawn from the same probability distribution Examples Queries in TREC data Programs in SPEC benchmarks Data sets in UCI repository Some alternatives Time series eg users Submitting sets of slightly modified queries Relational eg router performance embedded in a network Populations and samples A population is a specified set of instances An actual finite set of instances eg the UCI data sets for machine learning research A generalization of an actual finite set eg the set of all data sets that might be produced by a particular simulator in infinite time o A purely hypothetical set which can be described mathematically eg the set of all correct Java programs Samples are finite subsets of populations Examples Populations Ac rual data samples All possible IR queries All possible programs wri r ren in Java All Java programmers ac rive in 2005 The SPECJvm98 benchmarks The TREC 2005 HARD queries The SPECJvm98 benchmarks S ruden rs Taking CMPSCI 320 in Fall 2005 A subse r of The benchmarks Four stages of defining a sample The target population eg all computer programs The sampling frame 0 Programs written in Java or C The selected sample all programs written by CS undergraduate students in ZOOlevel courses at UMass The actual sample all programs actually turned in Why is sampling important Sampling problems Populatian Sample Poorly specified populafion Nonindependenf draws Censoring Random sampling in CS Random sampling isn39T easy in CS 0 bu r i r39s no r easy in mos r sciences Answer isn39T To give up bu r To consider how To ge r closer39 To The ideal 0 Define The ideal populafion Iden rify sources of bias in sampling and in subsequen r sfeps of Sample defini rion Remove or39 mi riga re as many sources of bias as possible Modify your confidence in your39 abili ry gener39alize based on your39 assessmen r of The ma rch be rween your39 ac rual sample and your39 desired popula rion Types of scales Categorical discrete or nominal Values contain no ordering information eg multipleaccess protocols for underwater networking Ordinal Values indicate order but no arithmetic operations are meaningful eg quotnovicequot quotexperiencedquot and quotexpertquot as designations of programmers participating in an experiment Interval Distances between values are meaningful but zero point is not meaningful eg degrees Fahrenheit Ratio Distances are meaningful and a zero point is meaningful eg degrees K DaTa Transformations Downgrading Type eg inTerval To ordinal ShifTing inTervals Tukey39s quotladder of powers Trans original 1b Eg 2 gt original 3 05 gt sqr39Tor39iginal 2 gt1or39iginal Combining several variables Normalize measuremenTs eg Simsek amp Jensen 2005 normalized To opTimal Remove unwanTed facTors eg remove file read Times from ToTal compile Times 0 Consider relaTion of Two variables eg KirkpaTrick amp Selman verTexedge r39aTio Exploratory data analysis Exploratory da ra analysis EDAempoys a varie ry of Techniques To maximize insigh r in ro a da ra se r uncover underlying s rruc rure ex rrac r impor ran r variables de rec r ou rliers and anomalies res r underlying aSSump rions develop parsimonious models and de rermine op rimal fac ror se r ringsquot The EDA approach is precisely rha r an approach no r a se r of Techniques bu r an a r ri rude philosophy abou r how a da ra analysis should be carried ou rquot Source httpwwwitlnistgovdiv898handbookedasection1leda11htm Why EDA DaTa analysis Tools are Typically used for hypo Thesis Tesfing and paramefer esfimafion However much of The qualiTy of scienTific work is deTermined by The qualiTy of The hypoTheses and models used by The researcher Can daTa analysis help suggesT hypoTheses Resources Books 0 Explorafory Dafa Analysis Tukey 1977 0 Dafa Analysis and Regression Mos reller39 and Tukey 1977 Inferacfive Dafa Analysis Hoaglin 1977 o The ABC39s of EDA Velleman and Hoaglin 1981 SofTware Da ro Desk Da ro Description Fa rhom Keypr39ess XGobi ATampT Research Demo
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'