New User Special Price Expires in

Let's log you in.

Sign in with Facebook


Don't have a StudySoup account? Create one here!


Create a StudySoup account

Be part of our community, it's free to join!

Sign up with Facebook


Create your account
By creating an account you agree to StudySoup's terms and conditions and privacy policy

Already have a StudySoup account? Login here


by: Roman McCullough
Roman McCullough
GPA 3.57

David Jensen

Almost Ready


These notes were just uploaded, and will be ready to view shortly.

Purchase these notes here, or revisit this page.

Either way, we'll remind you when they're ready :)

Preview These Notes for FREE

Get a free preview of these Notes, just enter your email below.

Unlock Preview
Unlock Preview

Preview these materials now for free

Why put in your email? Get access to more of this material and other relevant free materials for your school

View Preview

About this Document

David Jensen
Class Notes
25 ?




Popular in Course

Popular in ComputerScienence

This 24 page Class Notes was uploaded by Roman McCullough on Friday October 30, 2015. The Class Notes belongs to CMPSCI 691 at University of Massachusetts taught by David Jensen in Fall. Since its upload, it has received 104 views. For similar materials see /class/232254/cmpsci-691-university-of-massachusetts in ComputerScienence at University of Massachusetts.


Reviews for S


Report this Material


What is Karma?


Karma is the currency of StudySoup.

You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 10/30/15
Exploratory Data Analysis 10 March 2008 CMPSCI 691DD Research Methods Edwin Hubble What did Hubble see hm H0652 Hubble39s Law E Hubble 1929 A relation between distance and radial velocity among extragalactic nebulae Proceedings ofthe NationalAcademy ofSciences 153 Where 1 o I quot V recessional velocity E H0 Hubble constant r distance mpc O o 8 DISTANCE o39 musics 2 w PAzsecs 0 FIGURE 1 39 h Hubble39s LOW m I O I C lt g 0 V o 39 A8 owlAME vo puzszcs I do 5 0 FIGURE 1 N D s N A T e w H a The tool that is so dull that you cannot out yourself on it is not likely to be sharp enough to be either useful or helpfulquot John W Tukey Random variables The quotembar39r39assingly dogmaTic misnomerquot They are neiTher39 random nor39 are They variables A random variable is a funcTion ThaT maps from insTances To seaes The numeric r39eSulT of a nondeTer39minisTic experimenT They can be disTinguished from quotfixed var39iablesII whose value can be seT or39 pr39edeTer39mined before The experimenT They are noT The individual values eg 592 buT r39aTher39 The process of assigning value To insTances or39 colloquially The seT of values so assigned Examples Recall of an IR sys rem given query corpus and designa red relevan r documen rs Size and speed of code produced by a compiler given source and a rarge r processor 0 Number of dafabase rows re rurned given an any ri me query processor query da rabase and Time 0 Lines of code wri en given an assignmen r language developmen r environmen r and programmer NoTes The objecTs of sTudy are usually The sysTems ThaT enable random variables eg IR sysTems r39aTher39 Than The i nsTances ThaT The measures are on eg quer39ies WhaT we define as a random variable for39 a par39Ticular39 experimenT can change as we discover39 deTer39minisTic and causal r39elaTionships in a given sysTem Representation of data instances iid instances are commonly assumed Independent Knowing something about one instance tells you nothing about another Identically distributed Drawn from the some probability distribution Examples Queries in TREC data Programs in SPEC benchmarks Data sets in UCI repository Some alternatives Time series eg users Submitting sets of slightly modified queries Relational eg router performance embedded in a network Populations and samples A population is a specified set of instances An actual finite set of instances eg the UCI data sets for machine learning research A generalization of an actual finite set eg the set of all data sets that might be produced by a particular simulator in infinite time o A purely hypothetical set which can be described mathematically eg the set of all correct Java programs Samples are finite subsets of populations Examples Populations Ac rual data samples All possible IR queries All possible programs wri r ren in Java All Java programmers ac rive in 2005 The SPECJvm98 benchmarks The TREC 2005 HARD queries The SPECJvm98 benchmarks S ruden rs Taking CMPSCI 320 in Fall 2005 A subse r of The benchmarks Four stages of defining a sample The target population eg all computer programs The sampling frame all programs written in Java or C The selected sample all programs written by CS undergraduate students in ZOOlevel courses at UMass The actual sample all programs actually turned in Why is sampling important Sampling problems Population Sample Poorly specified populafion Nonindependenf draws Censoring Random sampling in CS Random sampling isn39T easy in CS 0 bu r i r39s no r easy in mos r sciences Answer isn39T To give up bu r To consider how To ge r closer39 To The ideal 0 Define The ideal populafion Iden rify sources of bias in sampling and in subsequen r sfeps of Sample defini rion Remove or39 mi riga re as many sources of bias as possible Modify your confidence in your39 abili ry gener39alize based on your39 assessmen r of The ma rch be rween your39 ac rual sample and your39 desired popula rion Types of scales Categorical discrete or nominal Values contain no ordering information eg multipleaccess protocols for underwater networking Ordinal Values indicate order but no arithmetic operations are meaningful eg quotnovicequot quotexperiencedquot and quotexpertquot as designations of programmers participating in an experiment Interval Distances between values are meaningful but zero point is not meaningful eg degrees Fahrenheit Ratio Distances are meaningful and a zero point is meaningful eg degrees K DaTa Transformations Downgrading Type eg inTerval To ordinal ShifTing inTervals Tukey39s quotladder of powers Trans original 1b Eg 2 gt original 3 05 gt sqr39Tor39iginal 2 gt1or39iginal Combining several variables Normalize measuremenTs eg Simsek amp Jensen 2005 normalized To opTimal Remove unwanTed facTor39s eg remove file read Times from ToTal compile Times 0 Consider relaTion of Two variables eg KirkpaTrick amp Selman ver39Texedge r39aTio Exploratory data analysis Exploratory da ra analysis EDAemploys a varie ry of Techniques To maximize insigh r in ro a da ra se r uncover underlying s rruc rure ex rrac r impor ran r variables de rec r ou rliers and anomalies res r underlying aSSump rions develop parsimonious models and de rermine op rimal fac ror se r ringsquot The EDA approach is precisely rha r an approach no r a se r of Techniques bu r an a r ri rude philosophy abou r how a da ra analysis should be carried ou rquot Source httpwwwitlnistgovdiv898handbookedasection1leda11htm Why EDA DaTa analysis Tools are Typically used for hypo Thesis Tesfing and paramefer esfimafion However much of The qualiTy of scienTific work is deTermined by The qualiTy of The hypoTheses and models used by The researcher Can daTa analysis help suggesT hypoTheses Resources Books 0 Explorafory Dafa Analysis Tukey 1977 0 Dafa Analysis and Regression Mos reller39 and Tukey 1977 Inferacfive Dafa Analysis Hoaglin 1977 o The ABC39s of EDA Velleman and Hoaglin 1981 SofTware Da ra Desk Da ra Description Fa rhom Keypr39ess XGobi ATampT Research Demo


Buy Material

Are you sure you want to buy this material for

25 Karma

Buy Material

BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.


You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

Why people love StudySoup

Jim McGreen Ohio University

"Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

Jennifer McGill UCSF Med School

"Selling my MCAT study guides and notes has been a great source of side revenue while I'm in school. Some months I'm making over $500! Plus, it makes me happy knowing that I'm helping future med students with their MCAT."

Steve Martinelli UC Los Angeles

"There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."


"Their 'Elite Notetakers' are making over $1,200/month in sales by creating high quality content that helps their classmates in a time of need."

Become an Elite Notetaker and start selling your notes online!

Refund Policy


All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email


StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here:

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.