Introduction to Biostatistics
Introduction to Biostatistics STAT 541
Popular in Course
verified elite notetaker
Popular in Statistics
Mrs. Triston Collier
verified elite notetaker
This 98 page Class Notes was uploaded by Mrs. Triston Collier on Thursday September 17, 2015. The Class Notes belongs to STAT 541 at University of Wisconsin - Madison taught by Staff in Fall. Since its upload, it has received 30 views. For similar materials see /class/205085/stat-541-university-of-wisconsin-madison in Statistics at University of Wisconsin - Madison.
Reviews for Introduction to Biostatistics
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 09/17/15
Ism or Fischer 8162008 Stat 541 431 43 Problems 4 1 4 3 4 4 Formally prove that each of the following is a valid density function Note This is a rigorous mathematical exercise a mexn 1 7rquotquot x0 1 2n x 71 x e l fPoissonx x x0 12 1 re 0 fNoma1xm e ooltxltoo Formally prove each of the following using the appropriate expected value de nitions Note As the preceding problem this is a rigorous mathematical exercise a IfX Binn 7239 then Lln7239 and 0392 nn39l 7r b IfX Poisson2 then u 1 and 0392 A c IfX Na then ua and 0392 2 For any p gt 0 sketch the graph of fx px p 1 for x Z l and fx 0 for x lt l and formally show that it is a valid density function Then show the following gt If p gt 2 then f x has nite mean u and nite variance 0392 gt If llt p S 2 then fx has nite mean u but in nite ie unde ned variance gt If 0 lt p S l then f x has in nite ie unde ned mean and hence unde ned variance Note As with the preceding problems this is a rigorous mathematical exercise LetX number of Heads in n 100 independent coin tosses N Bin100 7239 for 0 S 7239 S l a Suppose the coin is fair calculate the mean u and variance 0392 b Suppose the coin is twoheaded calculate the mean u and variance 0392 c Suppose the coin is twotailed calculate the mean uand variance 0392 d Sketch the graph of the variance 0392 n7239l 7239 for n 100 and 0 S n39Sl What general conclusions about the variance of X can you draw from this In particular how does this compare with the variance of say a normally distributed variable X lsm or Fischer 8162008 4 5 De ne the piecewise uniform function f x Stat 541 4 32 Imagine that a certain disease occurs in a large population in such a way that the probability of a randomly selected individual having the disease remains constant at 7239 008 independent of any other randomly selected individual having the disease Suppose now that a sample of n 500 individuals is to be randomly selected from this population De ne the discrete random variable X the number of diseased individuals capable of assuming any value in the set 0 l 2 500 for this sample a Calculate the probability distribution function fx PX x 7 the probability that the number of diseased individuals equals x 7 for x 0 l 2 3 4 5 6 7 8 9 10 Do these computations two ways rst using the Binomial Distribution and second using the Poisson Distribution and arrange these values into a probability table For the sake of comparison record at least ve decimal places Binomial Poisson o k l 2 3 4 5 6 7 8 9 O etc etc etc b Using either the Binomial or Poisson Distribution what is the mean number of diseased individuals to be expected in the sample and what is its probability How does this probability compare with the probabilities of other numbers of diseased individuals c Suppose that after sampling n 500 individuals you nd thatX 10 of them actually have this disease Before performing any formal statistical tests what assumptions 7 if any 7 might you suspect have been violated in this scenario What is the estimate of the probability 7 of disease based on the data of this sample The uniform density function given in the notes has median and mean 35 by inspection Calculate the variance lelt3 and 0 elsewhere Prove K 3 S x S 6 that this is a valid density function sketch the cdf Fx and nd the median mean and variance i 8 1 lsm or Fischer 8162008 4 8 Stat 541 4 33 Suppose that the continuous random variable X age of juniors at the UW Iwanagoeechees campus is symmetrically distributed about its mean but piecewise linear as illustrated rather than being a normally distributed bell curve x 13 X 2120 21 22 0 18 19 For an individual selected at random from this population calculate each of the following a Verify by direct computation that P18 SX S 22 l as it should be H int Recall that the area of a triangle 12 base gtlt height b P18 SXlt 185 c P185 ltXS 19 dP195 ltXlt 205 e What symmetric interval about the mean contains exactly half the population values Express in terms of years and months f Determine the equation of the density function f x and the cumulative function F x Sketch the graph of F x Note39 This problem can be done without calculus but it helps Suppose that in a certain population of adult males the variable X total serum cholesterol level mgdL is found to be normally distributed with mean u 220 and standard deviation 039 40 For an individual selected at random what is the probability that his cholesterol level is a under 190 under 210 under 230 under 250 b over 240 over 270 over 300 over 330 c over 250 given that it is over 240 d between 214 and 276 e between 202 and 238 f What symmetric interval about the mean contains exactly half the population values Hint First nd the approximate critical value of 2 that satis es P z S Z S 2 05 then change back to X lsmor Fischer 8162008 Stat 541 4 34 4 10 A zoologist is studying a certain species of lizard whose sexes appear alike except for size It is known that in the adult male population lengthM is normally distributed with mean uM 10 cms and standard deviation O39M 2 cms while in the adult female population length F is normally distributed with mean uF 16 cms and standard deviation O39F 4 cms II N O MN10 2 FN16 4 I I I I I 5 10 15 2C 25 Length cms a Suppose that a single adult specimen of length 11 cms is captured at random E Calculate the probability that a randomly selected adult male is as large as or larger than this specimen Calculate the probability that a randomly selected adult female is as small as or smaller than this specimen What can you conclude about the sex of this specimen based on length alone b Repeat part a for a second captured adult specimen of length 12 cms c Repeat part a for a third captured adult specimen of length 13 cms lsmor Fischer 8162008 Stat 541 4 35 4 11 Al spends the majority of a certain evening in his favorite drinking establishment Eventually he decides to spend the rest of the night at the house of one of his two friends each of whom lives ten blocks away in opposite directions However being a bit intoxicated he engages in a socalled random walk of n 10 blocks where at the start of each block he walks he rst either turns to face due west with probability 04 or independently turns to face due east with probability 06 before continuing Using this information answer the following Hint Let the discrete random variable X number of east turns in n 10 blocks 01 2 3 10 E CK 39S BAR West 4 63 gt East a Calculate the probability that he ends up at Bob s house b Calculate the probability that he ends up at Carl s house c Calculate the probability that he ends up back where he started d How far and in which direction from where he started is he expected to end up on average Hint Combine the expected number of east and west turns With what probability does this occur IsrranlsoherK16IZUIB Stat 541 M36 412 A random vanahle X lo normally drtnhuted m two drrtmet populatrom wth dfferent mean but the same standard devratron That 1 X1 N011 o and X2 NW o How alone to eaah other do the mean to and 2 have to be m order for the overlap hetweenthetwo drtnhutronrto be equal to 207 50W 80W 413 Connder the two followmg modr ed Caunhy drtnhutrom a Truncated Cauohy z 2 forilgag and m U otherwrre quot 1 Show that tho 1r avalrd dennty fLLnetlon and sketah 1t gaph Frnd the odf F0 and sketah 1t graph Frnd the mean and vananoe h Onendedquot Cauchy r lj for 20 and f0 0 otherwrre a1 Show that the in awlrd dennty funehon and sketahrtr gaph Fmd the odf PU and sketah 1t graph Fmd the medran Does the mean emtl Ismor Fischer 8162008 Stat 541 437 4714 Suppose that the random variable X timetofailure yrs of a standard model of a medical implant device is known to follow a uniform distribution over ten years and therefore corresponds to the density function f1x01 for 0x10 and zero otherwise A new model of the same implant device is tested and determined to correspond to a timetofailure density function f2 x 0092 r 08x 02 for 0 g x g 10 and zero otherwise See figure c m c fz 6 f1 6 000 0 2 4 6 8 10 a Verify that flx and f2x are indeed legitimate density functions b Determine and graph the corresponding cumulative distribution functions 1710c and F2x c Calculate the probability that each model fails within the rst five years of operation d Calculate the median failure time of each model e How do 1710c and F2x compare In particular is one model always superior during the entire ten years or is there a time in 0ltxlt10 when a switch occurs in which model outperforms the other and if so when and which model is it Be as specific as possible Ismor Fischer 8122008 Stat 541 816 85 Problems 8 1 Displayed below are the survival times in months since diagnosis for 10 AIDS patients suffering from concomitant esophageal candidiasis an infection due to Candida yeast and cytomegalovirus a herpes infection that can cause serious illness Patient t1 months 1 05 10 10 10 20 50 80 90 100 120 OOONONUIAUJN O censored a Construct the KaplanMeier productlimit estimator of the survival function St and sketch its graph b Calculate the estimated lmonth and 2month survival probabilities respectively c Use the formula given in the notes to calculate the estimated lmonth and 2month hazard rates respectively 8 2 For any constants a gt 1 b gt 0 graph the hazard function ht a for IE 0 Find and graph the corresponding survival function St What happens to each function as b gt 0 b gt oo Ismor Fischer 8122008 Stat 541 81 8 Survival Analysis 81 De nition Survival Function Survival Analysis is also known as TimetoEvent Analysis TimetoFailure Analysis or Reliability Analysis especially in the engineering disciplines and requires specialized techniques Examples Event gt Cancer surgery radiotherapy chemotherapy gt Death gt Cancer remission Cancer recurrence gt Coronary artery bypass surgery gt Heart attack or death whichever comes first gt Topical application of skin rash ointment gt Disappearance of symptoms However such longitudinal data can be censored ie the event may not occur before the end of the study Patients can be lost to followup eg moved away noncompliant choose a different treatment etc as shown in the diagram below Patients 0 Death 3 O Censored 2 o 1 Time Study Study begins ends lsmor Fischer 8122008 Stat 541 82 POPULA TION Define a continuous random variable T timetoevent or in this context survival time until death From this construct the survival function The graph of the survival function is the survival curve St A k l N 0 Properties For all t 2 0 0 S St S 1 SO l and St monotonically decreases to 0 as t gets larger St is continuous 701 1 Examples St e Cgt0 1t Note that the probability of death occurring in the interval a b is PaSTSb PTgta PTgtb Sa Sb SC A 5a Sb l lsmor Fischer 8122008 Stat 541 83 SAMPLE How can we estimate St using a cohort of n individuals gt For simplicity assume no censoring for now Life Table Method Suppose that at the end of every month week year etc we record the current nmber of deaths d so far or equivalently the current number of survivors er over the duration of the study At these values I l 2 3 define and linear in between Example Twelvemonth cohort study of quot r Survival Time Patlent months KOOOOUIJgtUJNgt A 000000OOUI A O A J l 2 3 4 5 6 7 8 9 10 ll OKOOOOOkhbNi r OOO Ht NNUIOOOKOKO Patient 1 died in month 4 etc A J 09 08 07 06 05 04 03 02 01 Ismor Fischer 8122008 Stat 541 84 09 09 Time months 12 Disadvantage This method is based on calendar times not cohort times of death thereby wasting much information A more efficient method can be developed that is based on the observed times of death of the patients Ismor Fischer 8112008 Stat 541 22 22 Graphical Displays of Sample Data Dotplots StemandLeaf Diagrams Stemplots Histograms Boxplots Bar Charts Pie Charts Pareto Diagrams Example Random variable X Age years of individuals at Memorial Union Consider the following sorted random sample of 18191919 20 21 21 23 24 24 26 27 31 35 35 37 38 42 46 59 gtDotplot O O O O O O O O O O O O O O X 18192021 2223 2425262728 293031 3233 3435 363738 394041 4243 4445 464748 495051 5253 5455 565758 59 Comment Uses all of the values Simple but crude does not summarize the data gt Stemplot Stem Leaves 8999 01134467 15578 26 9 Comment Uses all of the values more effectively Grouping summarizes the data better Ismor Fischer 8112008 Stat 541 23 gt Histograms Class Interval Eii izis 10 20 4 20 30 8 30 40 5 40 50 2 50 60 1 n 20 Frequency Histogram 8 DO Lo 5 h a 4 g V L 2 N 1 C I l 10 2o 30 4o 50 60 Ages Ismor Fischer 8112008 Stat 541 24 Class Interval Absolute Frequency OCCLHTCHCCS 1020 4 2030 8 3040 5 4050 2 5060 1 n20 Relative Frequency Histogram 040 040 030 h 025 D g a 020 D L 020 E E Li 9394 010 010 005 000 10 20 30 40 50 60 A36 lsmor Fischer 8112008 Stat 541 25 5021002212200 320303 0 10 0 000 000 10 20 4 020 020 000 020 20 30 8 040 060 020 040 30 40 5 025 085 060 025 40 50 2 010 095 085 010 50 60 1 005 100 095 005 n 20 100 Often it is of interest to determine the total relative frequency up to a certain value For example we see here that 060 of the age data are under 30 years 085 are under 40 years etc The resulting cumulative distribution which always increases monotonically from 0 to 1 can be represented by the discontinuous step function or staircase function in the first graph below By connecting the midpoints of the steps we obtain a continuous polygonal graph called the ogive pronounced ojive shown in the second graph Cumulative Relatwe Frequency D O r 39 F I 00 l 395 D l c O I Cl l 3 I D 9 on U D C I 17 C 2 I E a E39 l 3 3 i E j N I EN 3 39 0 CS i D I C O C7 l l l l l 1 l 0 10 20 30 40 50 60 0 10 20 30 40 50 60 lsmor Fischer 8112008 Stat 541 26 Problem Suppose that all ages 30 and older are lumped into a single class interval 18 19 19 19 20 21 21 23 24 24 26 27 2 Class Interval Absolute Frequency Relatlve Frequency occurrences Frequency n 10 20gt 4 020 20 30 8 040 n 20 100 Relative Fre quency Relative Frequency Histogram 040 040 040 030 overreported 020 values 020 010 000 10 20 30 Ages If this outlier 59 were larger the histogram would be even more distorted lsmor Fischer 8112008 Remedy Let Therefore Area of each class rectangle Relative Frequency 39 Height of rectangle x Class Width Stat 541 27 Density Class Width Class Interval Absolute Frequency Relatlve Frequency occurrences Frequency n 10 20 width10 4 020 20 30 width10 8 040 30 60 width30 8 040 001333 E n 20 20 100 Density Histogram v 004 g Density 001 0 02 003 000 0 1 4110 Ismor Fischer 4142006 Appendix Al Basic Reviews Logarithmsl Logarith ms I What are they In a word exponents The logarithm base 10 of a speci ed positive number is the exponent to which the base 10 needs to be raised in order to obtain that speci ed positive number In effect it is the reverse or more correctly inverse process of raising 10 to an exponent Example The logarithm base 10 of 100 is equal to 2 because 102 100 or in shorthand notation 10g 10 100 2 Likewise log 10 10000 4 because 104 10000 log 10 1000 3 because 103 1000 log10100 2 because 102 100 log1010 1 because 101 10 log 101 0 because 100 l log100l 1 because 10 1110101 log 10001 2 because 10 21102001 log 10001 3 because 10 311030001 etc I How do you take the logarithm of a speci ed number that is between powers of 10 In the old days this would be done with the aid of a lookup table or slide rule for those of us who are old enough to remember slide rules Today scienti c calculators are equipped with a button labeled log log 10 or INV 10quot Examples To ve decimal places log 10 3 047712 because check this 10 047712 3 log 10 5 069897 because check this 10 069897 5 log 10 9 095424 because check this 10 095424 9 log 10 15 117609 because check this 10117609 15 There are several relations we can observe here that extend to general properties of logarithms Ismor Fischer 4142006 Appendix Al Basic Reviews Logarithms2 First notice that the values for log 10 3 and log 10 5 add up to the value for log 10 15 This is not an accident it is a direct consequence of 3 X 5 15 together with the algebraic law of exponents 10 X X 10 t 10 J and the fact that logarithms are exponents by definition Exercise Fill in the details In general we have Property 1 that is the sum of the logarithms of two numbers is equal to the logarithm of their product For example taking A 3 and B 5 yields log 10 15 log 10 3 log 105 Another relation to notice from these examples is that the value for log 10 9 is exactly double the value for log 10 3 Again not a coincidence but a direct consequence of 3 2 9 together with the algebraic law of exponents 10 5 10 H and the fact that logarithms are exponents by de nition Exercise Fill in the details In general we have Property 2 that is the logarithm of a number raised to a power is equal to the power times the logarithm of the original number For example taking A 3 and B 2 yields log 10 3 2 2 log 10 3 There are other properties of logarithms but these are the most important for our purposes In particular we can combine these properties in the following way Suppose that two variables X and Y are related by the general form Y aX for some constants a and Then taking log 10 ofboth sides log10Y log10aX 8 or by Property 1 lOgloY lOgloa lOg10X and by Property 2 logloY logloa log10X Relabeling V o l U In other words if there exists a power law relation between two variables X and Y then there exists a simple linear relation between their logarithms For this reason scatterplots of two such related variables X and Y are often replotted on a log log scale More on this later lsmor Flseher AlAZEIEI Appendx Al EuaeRmewsLogmumsz Additinnal cnmm ants log nquot to have a yal e The ddsembodded symbol I39ls meanlngless wlthout a number lnslde slmllarly wth log n There ls nothlng speclal about uslng base 10quot In pnnclple we eould us any posmve base A provlded d l whlch eauses a problem Popularcholces e 10 r sulungln s 4 led quot above A 2 someumes denoted by lgquot 1828 resulung ln natural loganthmsquot denoted by lnquot Thls last peeuhar cholce ls e to as 2 and ls known as Euler s constant Leonhard Euler pronouneed oller was a Swlss mathemauelan Thls eonstant 2 anses ln a y e appheauons lncludmg the formula for the denslty funetlon of a normal dlstnbutlon desenbed ln a preylous leeture There ls a speclal formula for eonyerung loganthms eommon loganthms l e base 10 for posmve number a and base A as de sometlmes referr d LeanhardELllex 17D7rl783 uslng any base A bank to ealeulator use or an senbed aboye Loganthms are parueularly useful ln ealeulatlng physlcal proeesses that grow or deeay Exponentmlly For example suppose that at me t 0 we have N l eell ln a eulture and that rt eonunually dwldesl two ay that the enure populauon doubles lts slze every hour At the end oft hou er e eells at me e 2 hours there 2 4 eells at me t 3 hours there are N 2 3 8 eells ete clearly at me t there wlll be ee Sm eulture Expanemzal gmeh Quesuon A w there be 500000 half a mllhon eells ln t u t log 2 500000 whlch ean be rewntten yla the ealeulator use as t log lu 500000rlo gm2 9 about 18 hours 56 mlnutes Check 2 499456 67 Whlch represents an error of only about 0 1 from 500000 the dlserepaney ls due to roundoff error m sueh a w are N2 where lo anthms are used lnclude the radloactlve lsotope daung of ther appheatlons ae kahmty of chemlcal fosslls and amf quake whlch ln turn ls ten was more powerful than one ofmagmtude 4 ete 10 Ismor Fischer 4142005 Appmdix m1 Basic Reviews Logarithms4 Supplement What Is This Number Called e Anyway The symbol 2 stands for Euler s constant and is a fundamental mathematical constant like Ir extremely important for various calculus applications It is usually de ned as n e lim1l Hm n Exercise Evaluate this expression forn 1 10 100 1000 1 05 It can be shown via rigorous mathematical proof that the limiting value formally exists and converges to the value 2718281828459 Another common expression for e is the in nite series 1 1LL L 3 1 2 3 n Exercise Add a few terms of this series How do the convergence rates of the two expressions compare The reason for its importance Of all possible bases b it is this constante 271828 that has the most natural calculus properties Spec ically if x b then it can be mathematically proved that its derivative isf x bx ln b Remember that In b lo b x x has as its derivativef x 10x In 10 10x 23026 f 2 26 ssary is something of a nuisance On the other hand ifb e that is if x 2 then f x ex n e ex 1 ex ie itsel See Figure 2 This property makes calculations involving base 2 much easier Figure 1 y 10 I Figure 2 y ex mm 10 In 10 mmln 10 mm 1039 5 In 10 BA SIC CALCULUS REFRESHER Introduction This is a very condensed and simpli ed version of basic calculus which although not a prerequsite for Stat 541 illustrates applications to probability in the last section It is absolutely not intended to be a substitute for a oneyear freshman course in differential and integral calculus You are strongly encouraged to do the included Exercises Key words are in boldface key formulas and concepts are in Exponents Basic De nitions and Properties 0 1 2 3 For any real number base x we define powers ofx x l x x x x x x x x x etc 0 The exception is 0 which is cons1dered indeterminate we Will not see this here Examples 5 1 1121 ii2 862 86 X 86 7396 103 10 X 10 X 10 1000 34 3 X 3 X 3 X 3 81 Powers are also called exponents Also we can define fractional exponents in terms of roots such as x 2 c the square 3 3 root of x Similarly xl3 ic the cube root of x x 3 c 2 etc In general we have ie the n3911 root of x raised to the mLh power Examples 64 2 xEt 8 6432 6 4 3 83 512 64 3 W 4 6423 6671 2 42 16 l 1 1 1 Finally we can define negative exponents Thus x71 T xi2 2 9512 172 etc x x x V E pl 10 1 1 0 i 7 2 1 1 36 1 1 9 52 1 1 1 xam es 101 72 49 36 6 0 5 35 243 Properties of Exponents l x xb xalb Examples x3 x2 x s c12Jc13 JCS6 x3 x 12 JCS2 etc xa a4 x5 2 993 72 993 52 2 g 7 x Examples x3ix xsix 77x etc 3 xab xab Exam leg x32 x6 x7127 72 x2357 x1021 etc Functions and Their Graphs If a quantity y depends on some other quantity x in such a way that every value of x corresponds to one anal only one value of y then we say that y is a function of x written y f x x is said to be the independent variable y is the dependent variable Example Distance traveled per hour y is a function of velocity x For a given function y f x the set of all ordered pairs of x yvalues that algebraically satisfy its equation is called the graph of the function and can be represented geometrically by a collection of points in the XY plane mmyfx7yfx2x3yfxxzyfxx 1yfxe The rst three are examples of polynomial functions In particular the rst is constant the second is linear the third is quadratic The last is an exponential function Let s consider these functions one at a time o y f x 7 If x any value then y 7 That is no matter what value of x is chosen the value of y remains at a constant level of 7 Therefore all points that satisfy this equation must have the form x 7 and thus determine a horizontal line graph 7 units up Y Exercise What would the graph of the equation x 4 look like y 0 x 0 o yfx2x3 Ifx0thenyf0203 3 sothe point03is onthe graph of this function Likewise ifx 1 then y fl 21 3 5 so the point 1 5 is also on the graph of this function However the point 1 6 does not satisfy the equation and so does not lie on the graph The set of all points x y that do satisfy this linear equation forms the graph of a line in the XY plane hence the name Y Notice that the line has the generic equation y f x mx b where b is the Y intercept in this case b 3 and m is the slope of the line in this case m 2 In general the slope of any line is de ned as the ratio of height change Ay to length change Ax that is for any two points x1 y1 and x2 yz that lie on the line For example for the two points A 0 3 and l 5 on our line the slope is m Xi m 2 which con rms our earlier observation 3 y f x x2 This is not the equation of a straight line because of the square The set of all points that satisfy this equation forms a curved parabola in the XY plane Y X Exercise How does this graph differ from that of y f x x3 x4 xl2 xl3 This is a bit more delicate Let s first restrict our attention to 1 yfxx1r 1 positive xvalues ie x gt 0 Ifx 1 theny f1 I 1 so the point 1 1 lies on the graph of this function Now from here as x grows larger e g 10 100 1000 1 the values of y although they never actually reach 0 Therefore as we continue to move to the right the graph approaches the XaXis as a horizontal asymptote without ever actually touching it Moreover from 1 1 as x gets smaller eg 01 001 0001 the 1 1 100 1000 become larger Therefore as we continue to move to the left the graph shoots upwards approaching the Y aXis 1 Ifx0theny6 becomes in nite 00 which is undefined as a real number so x 0 is not in the domain of this function A similar situation eXists for negative xvalues ie x lt 0 This is the graph of a hyperbola which has two diagonally symmetric branches We re generally only concerned with the first quadrant branch 1 1 1 E 01 m 001 m 0001 become smaller 1 1 values of y 10 x 01 as a vertical asymptote without ever actually touching it Y 1 Exercise How does this graph differ from that of y f x x72 x z Why 4 NOTE All the examples on the last page are special cases of power functions which have the general form y x p for any real value of p for x 2 0 If p gt 0 then the graph starts at the origin and continues to rise to in nity In particular if p gt 1 then the graph is concave up such as the parabola y x2 If p l the graph is the straight line y x And if 0 lt p lt 1 then the graph is concave down such as the parabola y x12 x If l p lt 0 such as y x 1 or y x 2 y then the YaX1s acts as a yertlcal asymptote for the graphand the XaXis is a horizontal asymptote See gure below pgt1 Exercise Sketch the graph of the piecewisedefined functions x f x x3 and x2 gx x 5 ifxsl ifxgtl ifxsl if xgtl This graph is the parabolay x2 up to and including the point 1 1 then picks up with the curve y x3 after that Note that this function is therefore continuous at x 1 and hence for all real values Ofx This graph is the parabolay x2 up to and including the point 1 1 but then abruptly changes over to the curve y x3 5 after that starting at 1 6 Therefore this graph has a break or jump discontinuity atx 1 Think of switching a light from off 0 to on 1 However since it is continuous before and after that value g is described as being piecewise continuous 5 o y f x ex where e 271828 is Euler s constant The graph of this function is an exponential growth curve If the exponent were x instead of x the graph would represent an exponential decay curve and would decrease from left to right Y X NOTE The exponential functions have the form y eax for any real value of a For x 2 O the graphs of these functions start at the Yintercept O 1 and either continue to increase to in nity exponential growth or decrease to the Xaxis horizontal asymptote exponential decay depending on whether a gt O eg a 1 y ex or a lt O eg a 1 y e x respectively See gure below Note If y ex then x loge y or the natural logarithm ln y by definition For example e0 47 16 implies that ln16 047 agt0 Limits and Derivatives 1 We saw above that as the values of x grow ever larger the values of become ever 1 smaller We can t actually reach 0 exactly but we can sneak up on it forcing to become as close to 0 as we like simply by making x large enough For instance we can 500 1 force lt 10 500 by making x gt 10 In this context we say that 0 is a limiting value of 1 the g values as x gets arb1trar11y large A mathemat1cally conc1se way to express th1s 1s a limit statement Many other limits are possible but we now wish to consider a special kind To motivate this consider again the parabola example y f x x2 The average rate of change between the two points P3 9 and Q4 16 on the graph can be calculated as the slope of the secant line connecting them via the previous formula msec ii 7 Now suppose that we slide to a new point Q35 1225 on the graph closer to P3 9 The average rate of change is now msec ii 65 the slope of the new secant line between P and Q If we now slide to a new point Q31 961 still closer to AZ 961 9 Ax 61 and so on As Q approaches P the slopes msec of the secant lines appear to get ever closer to 6 7 the slope mm of the tangent line to the curve y x2 at the point P3 9 7 thus measuring the instantaneous rate of change of this function at this point P3 9 We can actually verify this by an explicit computation From fixed point P3 9 to any nearby point Q3 Ax 3 Ax2 P3 9 then the new slope is msec secant lines tangent line A 3Ax2 9 96Ax Ax2 9 on the graph ofyx2 we have msec 1E Ax A 6 Ax Ax 2 Ax 6 Ax AAx l 6 Ax We can check this formula against the msec values that we already computed if Ax 1 then msec 7 ifo 05 then msec 65 ifo 01 then msec 61 As Q approaches P 7 ie as Ax approaches 0 7 this quantity msec 6 Ax approaches the quantity mm 6 as its limiting value con rming what we initially suspected Suppose now we wish to nd the instantaneous rate of change of y f x x2 at some other point P on the graph say at P4 16 or P 5 25 or even P0 0 We can use the same calculation as we did above the average rate of change of y x2 between any two A generic points Px x2 and Qx Ax x Ax2 on its graph is given by msec 1E xAx2 x2 7 x22xAxAx2 x2 7 2xAxAx2 7 Ax2xAx 7 Ax Ax Ax Ax As Q approaches P 7 ie as Ax approaches 0 7 this quantity approaches mm 2x in the limit thereby de ning the instantaneous rate of change of the function at the point P Note that if x 3 these calculations agree with those previously done for msec and mm Thus for example the instantaneous rate of change of the function y f x x2 at the point P4 16 is man 8 at P 5 25 is man 10 and at the origin P0 0 is man 0 2xAx P75 25 mm 7 710 P4 16 mm 7 8 P3 9 mm 6 P72 4 mm 74 X PO 0 mm 0 In principle there is nothing that prevents us from applying these same ideas to other functions y f x To find the instantaneous rate of change at an arbitrary point P on its graph we first calculate the average rate of change between Px fx and a nearby point A Ax Qx Ax fx Ax on 1ts graph as measured by msec 1E x M x As Q approaches P 7 ie as Ax approaches 0 7 this quantity approaches the instantaneous rate of change at P defined by This object denoted compactly by d d is also symbolized by several other interchangeable notations 532 a fx f 39x etc The process of calculating the derivative of a function is called differentiation Thus the d derivative of the function y f x x2 is the function Eli f 39x 2x This can also be at x2 d written more succinctly as dy 2x dx ix 2 2x or a x2 2x Using methods very similar to those above it is possible to show that more generally L i39yc U39fw rtii d d l at 2132C 3 x2 if y xlz then Eli E x m if y x71 then Eli x72 Examples If y x3 then d Also note that if y x x1 then Eli l x0 l as it should The line y xhas m 1 everywhere at d It can also be proved that if y fx ex then ZJ JZC f39x ex ie E ex ex just itself More generally 3 This latter result is actually a consequence of the Chain Rule discussed below x d x d 1 x 7x d 7x Examples Ifye3then1 ZC3e31fyex2then l ez1fye thengl i e NOTE In the preceding discussion a and p are constants otherwise the rules don t apply One more important case is worth noting suppose y f x 7 whose graph is a horizontal A line The average rate of change between any two pornts on this graph would be Ki 7 7 E 0 and hence the instantaneous rate of change 0 as well In the same way gunnz lclz l law all 21g mm the however that a vertical line having equation x C has an infinite 7 or unde ned 7 slope Note also that not every function has a derivative everywhere For example the functions y f x lxl x and x71 are not differentiable at x 0 all for different reasons Although the first two are continuous through the origin 0 0 the first has a Vshaped graph a uniquely defined tangent line does not exist at the comer The second graph has a vertical tangent line there hence the slope is infinite And as we ve seen the last function is undefined at the origin x 0 is not even in its domain so talk of a tangent line there is completely meaningless Properties of Derivatives 1 For any constant c and any differentiable function f x d d Elf Example If y 5 x3 then 212 5 3 x2 15 x2 Ewen c dx 4 4x 4 4 Example If y gx then i 6 l 6 Cfx39 Cf39OC Example If y 3 82x then 3 3 2 62X 6 67 For any two differentiable functions f x and g x 2 Sum and Difference Rules 1 if 518 dx fx ig 00 7 dx i dx Example If y x32 7x4 10 87k 5 then 12 x12 28 x3 30 em fxigx39 f39OC ig39OC dx 2 I 3 Product Rule fx gOC 39 f39x gOC fx g39OC Example If y x11 86 then 21 11x10e6x x11 6 86x 11 6x x10 86X 4 Quotient Rule 7 x 39xg ggx Z gxg g39pc rov1ded x 0 g x 4 g x 4 P 4 g 4 4 4 4 7 ex glyi x84ex 7xex74x 7x32ex Example If y 7m then dx 7 x 4 82 7 x 4 82 5 Chain Rule NOTE See below for a more detailed explanation fgx l39 f39gx X g 39x d Example If y xv3 28966 then 24 6 x23 267965 gxil3 1869 2 d 2 2 7x 24 then 212 7 eix 2 x eix 24 Example If y e x 7 61 Example Ify2xelquot2x then 21elquot2xln 2 2xln 2 Why mm mmquot a runeneny gx The runenem m me rst Example an be mewed as eempesmg me Hula funenem u u wnh me nnerquot wen u e gx e 3 212 Tu nd us denmve urme uuter runean 61f by me Fuwa R e gven eluur answa39 Slmlarly An al39emale mure suggesuve Way m wme me Cham Rule 15 last Wm Examples xllumte gve a General Fewer Rule and enenal fur differenuauun mans anyan du z e 22 Ify unhen pm It Thansdtzquottz du Integrals and Antiderivatives u M n n jumps xfany where are no brmks Dr jumps men me runenem is cnn mlmls as shewn beluw We msh m nd me area undEr me mph er f m an mth 11 x mm seme xadcommnr luwa wluea m any varable uppermme x y fx We fumelly de ne a new runean m Area undEr me mph er f m me mlerwl 11 x m a 7 Manama qun 315 have a strung eenneeuen Withitself Tu see what that eenneenen mun be Canada a nmrby wine x Ax wmeh men cunespunds m F x Ax Area under the graph of f in the interval a x Ax and take the difference of these two areas highlighted above in light blue F x Ax F x Area under the graph of f in the interval x x Ax Area of the rectangle with height f z and width Ax where z is some value in the interval x x Ax fz Ax Therefore we have Fx A Fx fZI Now take the limit of both sides as Ax gt 0 We see that the left hand side becomes the derivative of Fx recall its basic definition previously given and noting that z gt x we see that the right hand side becomes fx Hence ie F is an antiderivative of f Therefore we express X where the righthand side I f t alt represents the de nite integral of f from a to x In 1 this context f is called the integrand More generally if F is an antiderivative off then the two functions are related via the inde nite integral where C is an arbitrary constant Example 1 F x 11 Ox10 C where C is any constant is the general antiderivative of fx x9 because F x 11 010x9 0 x9 fx We can write this relation succinctly as I dx 11 Ox10 C Example 2 Fx 8 e C where C is any constant is the general antiderivative of fx 6 because F x 8 exS 0 e 8 fx We can write this relation succinctly as fe xS dx 8 e 8 C NOTE Integrals possess the analogues of Properties 1 and 2 for derivatives found on page 9 L L m Also the integral of a sum respectively difference of two functions is equal to the sum respectively difference of the integrals The integral analogue for products corresponds to a technique known as integration by parts 12 From these examples it is evident that the differentiation rules for power and exponential functions given on page 10 can be inverted essentially by taking the integral of both sides to the General Power Rule and General Exponential Rule for integration up p 1 C 1fp at l fupalu The proof of lhlS fact In C ifp 1 4 is beyond the scope ofthis review fe alu e C NOTE In order to use these formulas correctly alu must be present in the integrand up to a constant multiple To illustrate 5 2 10 Example 3 fx5 29 5x4 alx QCTL C HJ gv1 T f H9 du LII 0 C There are two ways to solve this problem The first is to expand out the algebraic expression in the integrand and integrate the resulting polynomial of degree 49 termbyterm Yuk The second way as illustrated is to recognize that if we substitute u x5 2 then alu 5x4 dx which is precisely the other factor in the integrand Therefore in terms of the variable u this is essentially just a power rule integration carried out above To check the answer take the derivative of the righthand function and verify that the original integrand is restored Don t forget to use the Chain Rulel Note that if the constant multiple 5 were absent from the original integrand we could introduce and compensate for it via the NOTE on the bottom of the last page This procedure is demonstrated in the next example However if the x4 were absent or were replaced by any other function then we would not be able to carry out the integration in the manner shown since we can only balance constant multiples not functions 2 l 2 Example 4 f19c312 2abc 3x21x312C 3V1 x3 C HJ I 712 du u12 l 2 C In this example letting u 1 x3 means alu 3x2 dx This is present in the original integrand except for the constant multiple of 3 which we can introduce provided we preserve the balance via multiplication by 13 on the outside of the integral sign revealing that this is again a power rule integration And again if the x2 were missing from the integrand or were replaced by any other function then we would not have been able to carry out the integration exactly in the manner shown Verify that the answer is correct via differentiation Examples fz 2 dz 7 If lz dz 7 7 z quot2 c Wm H z I a dd 7 2 C Likewise in this example ifu 7 722 then du 7 72 dz This is present in the integrand except for e constant multiple 71 which we can easily balance and perform the subsequent exponential integration Again if the 2 were missing from the integrand we would not be able to introduce and balance for it In fact it can be shown that without this factor of z 39s integral is not expressible in erms o e en ary functions which is why the values ofits corresponding dzfmrtz integral are tabulated See Example 8 Finally all these results can be summarized into one ele ant statement the Fundamental Theorem of Calculus for de nite integrals Advanced techniques of integration 7 such as integration by parts m39gonometric substitution etc 7 will not be reviewed here 1 y x 1 Example 5 far3 1 7 5 due The area under y 7E 1 7 at in the interval 0 1 I l 1 Method 1 Expand andintegrate termwise f9 1 7 476 d 7 7 2x7 x dz I I s u1uyi1 0 mm 4 8 lz 4 8 12 4 8 12 L 7 12 39 Method 2 Use the power function formula ifpossible 1m 7 1 7 x4 th 743 dx and a is indeed present in the integrand Rec 1 that the xlimits of integration should also be converted to ulimits when x 7 0 we get u 7 1 7 4 o xl u l 3 3 3 A l 2 121E11L 0L I304 W774 filwm aflm d 3 U2 3 3 12 PH i dd Probability Dens y Functions and Cumulative Distribution Functions As we have seen m that ease graph of m the interval Therefore 39T h r F r 00 the graph no jump d mwum W 7 mi Fr m Utol Dene nean y 12me funmnn at y dism39hu nn funcl39nn at N 2 n u gas 1 Examgle 7 Flame m drawnm scale X 15 By the de nition above this is a valid density function because it is clearly nonnegative and 00 10 10 since it is zero outside the interval 0 10 we have ffx dx ffx dx 0004 x3 dx 7 700 0 0 x4 10 0004 Z 0 0001 104 1 Now for any real x the distribution function PX S x is X given by Fx f f t dt by de nition However contribution from f likewise u 1 01 1 391 since there is no further area contribution t4 X fromf But 1 L quot 0004 t3 dt 0004 3939 0 0 repeating the previous calculation PX S 10 F10 0001104 1 As another example PXS 6 F6 0001 64 01296 so the difference would be P6 SXS 10 08704 Exercise Using the information above sketch the graph of F x for all x and verify that it monotonically increases from 0 to l as all cumulative distribution functions must Also calculate the median m of X by solving the equation Fm 05 do you expect its value to be less than equal to or greater than the mean of X Hint Remember skew Continuing this example let s now compute the mean of X 10 x5 10 f0004x4dx 0004 g 8 See in the gure 0 10 f x 82 0004 x3 dx Exercise Evaluate 0 10 i f0004xS dx 82 Exercise Evaluate 0 Of course both methods should lead to the same answer Which is easier to carry out 16 1 zZ2 Example 8 The standard normal density function p z V e is not explicitly 7T integrable hence the need for tabulated values of its probability distribution function z Recall Example 5 m J 393 15 00 So for instance PZg 15 c1gt15 f ptdl 09332 Note that f pZ dz 1 J 00 Exercise Sketch the graph of 132 for all 2 using tabulated values This standard normal distribution is sometimes called the probit pronounced probit function Example 9 Consider the following piecewisede ned function and accompanying graph 42 Oltx 9 fx x 0 otherwise g 126 quot 5 gt X Because the function is zero outside of 0 15 and piecewisede ned inside we have 00 15 9 15x 1 912 1 15 ffxdx fxdx f 42 dx J mdx 5 x dxm xdx 700 0 9 iZ329 Ll 15i LE 423x 01262x29 4218 0 126 2 2 139 Since it is nonnegative and the total area under its graph 1 this f x is a valid density function As before we can now calculate probabilities of events associated with it for example 12 9 12 P2 ltXlt12 ffxdx f dx Exercise Evaluate 2 2 9 W P2 ltXlt 9 P9 ltXlt 12 We can also calculate the mean ofX 9xjx 15x2 l 9 a l 15 f 42 dx f126dx 42 x3 dx126 xzdx 0 9 u E1X1 fxfxdx i2529 Ll 15L 3375amp 425x 01263x39 42 50126 3 3 35 93931439 See the balance oint in the fi ure P g Exercise Calculate the variance of X Remark One final word about definite integrals in general The most common interpretation for them is in terms of areas as we have done identifying such areas with probabilities However they are routinely used to calculate much more than that arc length surface area volume velocity distance work etc Essentially any quantity that can be calculated via a Riemann sum a process that is examined in detail in a basic calculus couse will result in a definite integral The moral Don t lock yourself into thinking that definite integrals always mean areas It s a good way to get confused N 4 Summary of Main Points The instantaneous rate of change of a function y f x at a value of x in its domain is al given by its derivative 2 f x This function is mathematically de ned in terms of a particular limiting value of average rates of change over F 39J smaller intervals when that limit exists and can be interpreted as the slope of the line tangent to the graph of y f x In particular if u is a differentiable function of x then via the Chain Rule al al The derivative of the function y up is 213 p uIH a General Power Rule u 12 14 d The derivative of the function y e is dx e E General Exponentlal Rule dF A function f x has an antiderivative Fx if its derivative a f x Equivalently this can be expressed in terms of an inde nite integral f f x dx F x C In particular up flip alu pTl C ifp at 1 General Power Rule fe alu e C General Exponential Rule b The corresponding de nite integral f f x alx Fb Fa can be interpreted as the 1 area under the graph of y f x in the interval a b though other interpretations do eXist Any nonnegative function f x continuous or piecewisecontinuous for all real x that satisfies the condition that the total area under its grap is the probability density function for some corresponding continuous random variable X Its antiderivative the cumulative distribution function is given by and corresponds to the area under the graph up to x that is the probability V The graph of Fx must therefore rise continuously and monotonically from 0 to l as x increases from oo to 00 Note that this exactly parallels the situation for discrete random variables X Namely if each f x represents the probability mass PX x then as above the probability density histogram has total mass The cumulative distribution function to x and therefore increases monotonically from 0 to l as x increases Expected value formulas for LL and 72 are the same as before but with summations instead of integrals Ismor Fischer 8212008 Stat 541 334 35 Problems 3 1 In a certain population of males the following longevity probabilities are determined 3 3 N M 39 PLive to age 60 090 39 PLive to age 70 given live to age 60 080 39 PLive to age 80 given live to age 70 075 From this information calculate the following probabilities i PLive to age 70 i PLive to age 80 i PLive to age 80 given live to age 60 Patient noncompliance is one of many potential sources of bias in medical studies Consider a study where patients are asked to take 2 tablets of a certain medication in the morning and 2 tablets at bedtime Suppose however that patients do not always fully comply and take both tablets at both times it can also occur that only 1 tablet or even none are taken at either of these times a EXplicitly construct the sample space S of all possible daily outcomes for a randomly selected patient b EXplicitly list the outcomes in the event that a patient takes at least one tablet at both times and calculate its r 39 39 quotquot assuming that the outcomes are equallv likelv J u c Construct a probability table and col r for the random variable X the daily total number of tablets taken by a random patient d Calculate the daily mean number of tablets taken e Suppose that the outcomes are not equallv likelv but vary as follows 03 03 Rework parts b d using these probabilities Assume independence between AM and PM A statistician s wife withdraws a certain amount of money X from an ATM every so often using a method that is unknown to him she randomly spins a circular wheel that is equally divided among four regions each containing a specific dollar amount as shown Bank statements reveal that over the past n 80 ATM transactions 10 was withdrawn twelve times 20 sixteen times 30 twentyeight times and 40 twentyfour times For this sample construct a relative frequency table and calculate the average amount 5 withdrawn per transaction and the variance s2 Suppose this process continues inde nitely Construct a probability table and calculate the expected amount u withdrawn per transaction and the variance 0392 Ismor Fischer 8212008 Stat 541 335 3 4 A youngster nds a broken clock on which the hour and minute hands can be randomly spun at the same time independently of one another Each hand can land in any one of the twelve equal areas below resulting in elementary outcomes in the form of ordered pairs hour hand minute hand eg 7 11 as shown Let the simple events A hour hand lands on 7 and B minute hand lands on 11 a Calculate each of the following probabilities Show all work i PA and B i PA or B b Let the discrete random variable X the product of the two numbers spun List all the elementary outcomes that belong to the event C X 36 and calculate its probability PC c After playing for a little while some of the numbers fall off creating new areas as shown For example the configuration below corresponds to the ordered pair 9 12 Now calculate PC Ismor Fischer 8212008 3 5 3 0 3 l 3 8 3 9 Stat 541 336 For any event A we know that PAC l PA Now suppose B is any other event Is it true that P Acl B l P Al B Prove in general or nd a counterexample Referring to the barking dogs problem in section 32 calculate each of the following i PAngel barks OR Brutus barks i PNEITHER Angel barks NOR Brutus barks ie PAngel does not bark AND Brutus does not bark i PAngel barks AND Brutus does not bark i PAngel does not bark AND Brutus barks i PExactly one dog barks i PBrutus barks l Angel barks i PBrutus does not bark l Angel barks i PAngel barks l Brutus does not bark Also construct a Venn diagram and a 2 X 2 probability table including marginal sums Referring to the um model in section 32 are the events A First ball is red and B Second ball is red 39 J I J in this sampling without 39 scenario Does this agree with your intuition Rework this problem in the sampling with replacement scenario After much teaching experience Professor F has come up with a conjecture about of ce hours There is a 75 probability that a random student arrives to a scheduled of ce hour within the rst fteen minutes event A from among those students who come at all event B Furthermore there is an 80 probability that no students will come to the of ce hour given that no students arrive within the rst fteen minutes Assuming this conjecture is true answer the following 1m Some algebra may be involved a Calculate PB the probability that any students come to the office hour b Calculate PA the probability that any students arrive within the rst fteen minutes of the of ce hour c Sketch a Venn diagram and label all probabilities in it SupposeA andB are anytwo events with PA a PBb and PAnB c a Sketch a Venn diagram and label all four probabilities in it Also construct the corresponding 2 X 2 probability table including the row and column marginal sums b Imagine now that A and B are independent events Repeat part a In particular what relationship exists between each value in the probability table and its corresponding row and column marginal probabilities c Verify the conclusion in b for the example of independent events Lung cancer and Coffee drinker in section 32 ofthese notes lsmor Fischer 8212008 Stat 541 337 3 10A certain medical syndrome is usually associated with two overlapping sets of symptoms A and B Suppose it is known that 39 If B occurs thenA occurs with probability 080 39 If A occurs then B occurs with probability 090 39 IfA does not occur then B does not occur with probability 085 Find the probability thatA does not occur if B does not occur Hint Use a Venn diagram some algebra may also be involved The progression of a certain disease is typically characterized by the onset of up to three distinct symptoms with the following properties 39 Each symptom occurs with 60 probability regardless of the others 39 If a single symptom occurs there is a 45 probability that the two other symptoms will also occur 39 If any two symptoms occur there is a 75 probability that the remaining symptom will also occur Answer each of the following 1M Use a Venn diagram a What is the probability that all three symptoms will occur b What is the probability that at least two symptoms occur c What is the probability that exactly two symptoms occur d What is the probability that exactly one symptom occurs e What is the probability that none of the symptoms occurs f Is the event that a symptom occurs statistically independent of the event that any other symptom occurs Ismor Fischer 8212008 Stat 541 338 3 12 An amateur game player throws darts at the dartboard shown below with each target area worth the number of points indicated However because of the player s inexperience all of the darts hit random points that are uniformly distributed on the dartboard a LetX points obtained per throw What is the sample space S of this experiment b Calculate the probability of each outcome in S H int The area of a circle is 7239r2 c What is the expected value of X as darts are repeatedly thrown at the dartboard at random d What is the standard deviation of X e Suppose that if the total number of points in three independent random throws is exactly 100 the player wins a prize With what probability does this occur Him For the random variable T total points in three throws calculate the probability of each ordered triple outcome X1 X2 X3 in the event T 100 lsmor Fischer 8212008 Stat 541 339 3 13 The Monty Hall Problem simplest version Between 1963 and 1976 a popular game show called Let s Make A Deal aired on network television starring charismatic host Monty Hall who would engage in deals 7 small games of chance 7 with randomly chosen studio audience members usually dressed in outrageous costumes for cash and prizes One of these games consisted of rst having a contestant pick one of three closed doors behind one of which was a big prize such as a car and behind the other two were zonk prizes often a goat or some other farm 52 animal Once a selection was made Hall 7 who knew what was behind each door 7 would open one of the other doors that contained a zonk At this point Hall would then offer the contestant a chance to switch their choice to the other closed door or stay with their original choice before nally revealing the contestant s chosen prize Question In order to avoid getting zonked should the optimal strategy for the contestant be to switch stay or does it not make a difference lsmor Fischer 8212008 Stat 541 340 3 14 The following data are taken from a study investigating the use of a technique called radionuclide ventriculography as a diagnostic test for detecting coronary artery disease Source Begg C B and McNeil B J Assessment of Radiologic Tests Control of Bias and Other Design Considerations Radiology Volume 167 May 1988 565569 Coronary Artery Disease Present Absent Total g Positive 302 80 382 U 0 8 39 1 Negative 179 372 551 Total 481 452 933 a Calculate the sensitivity and specificity of radionuclide ventriculography in this study b For a population in which the prevalence of coronary artery disease is 010 calculate the predictive power of a positive test and the predictive power of a negative test using radionuclide ventriculography 3 15 Recall that in a prospective cohort study exposure E or E is given so that the odds ratio is de ned as OR odds of disease given exposure PD 1E PD l E odds of disease given no exposure PD l E PD l E 39 Recall that in a retrospective case control study disease status D or D is given in this case the corresponding odds ratio is de ned as odds of exposure given disease PE 1D PE l D OR odds of exposure given no disease PE l D PE l D 39 a Show algebraically that these two de nitions are mathematically equivalent so that the same cross product ratio calculation can be used in either a cohort or casecontrol study as the following two problems demonstrate Recall the de nition of conditional probability b Why is relative risk generally not appropriate to compute in a casecontrol study unless the 39 of disease is rare in the 39 quot 7 3 16 In a case control study investigators rst identi ed women with breast cancer cases and those without controls in an effort to establish if there was any association with the previous use of oral contraceptives Source Hennekens C H Speizer F E Lipnick R J Rosner B Bain C Belanger C Stampfer M J Willett W and Peto R A CaseControl Study of Oral Contraceptive Use and Breast Cancer Journal of the National Cancer Institute Volume 72 January 1984 3942 Calculate the odds ratio for the resulting table of data and inteppret Breast Cancer Yes No Oral Yes 273 2641 Contraceptive No l 716 I 7260 I lsmor Fischer 8212008 Stat 541 341 3 17 In a c0h01t study the associations between risk factors for breast cancer were examined among women participating in the National Health and Nutrition Examination Survey Source Carter C L Jones D Y Schatzkin A and Brinton L A A Prospective Study of Reproductive Familial and Socioeconomic Risk Factors for Breast Cancer Using NHANES I Data Public Health Reports Volume 104 JanuaryFebruary 1989 4549 At the onset of the study each participant in a sample of 6165 women was measured for exposure in this case if she rst gave birth at age 25 or older They were then followed forward in time to look for subsequent occurrences of disease versus those who remained diseasefree for the duration of the study Calculate the odds ratio and relative risk for the resulting table of data and interpret Breast Cancer Yes No Age at First 25 or over I 31 I 1594 I Birth under 25 I 65 I 4475 I 3 18 An observational study investigates the connection between aspirin use and three vascular conditions 7 gastrointestinal bleeding primary stroke and cardiovascular disease 7 usin of patients exhibiting these disjoint conditions with the following my PGI bleeding 02 PStroke 03 and PCVD 05 as well as with the following conditional probabilities PAspirin I GI bleeding 009 PAspirin I Stroke 004 and PAspirin I CVD 002 a Calculate the following I t and PCVD I Aspirin t PGI bleeding I Aspirin PStroke I Aspirin b Interpret Compare the prior probability of each category with its corresponding posterior probability What conclusions can you draw Be as speci c as possible 3 19 On the basis of a retrospective study it is determined from hospital records tumor registries and death certi cates that the overall veyear survival event S of a particular form of cancer in a population has a prior probability of PS 04 Furthermore the conditional probability of having received a certain treatment event 7 among the survivors is given by PT I S 08 while the conditional probability of treatment among the nonsurvivors is only PTI SC 03 PAS T PRESENT L 5 years I I I a A cancer patient is uncertain about whether or not to undergo this treatment and consults with her oncologist who is familiar with this study Compare the prior probability of overall survival given above with each of the following posterior probabilities and interpret in context i Survival among treated individuals PS I T i Survival among untreated individuals PS I TC b Also calculate the following i Odds of survival given treatment i Odds of survival given no treatment i Odds ratio of survival for this disease Ismor Fischer 8212008 Stat 541 342 l With probability 7239 Cons1der the bmary populat1on var1able Y see gure 0 with probability 1 7239 a Construct a probability table for this random variable b Show that the population mean 1 7239 c Show that the population variance 0 7239 l 7239 39 whirl i lii quotll 1l39llf lv L1H quot lsmor Fischer 8152008 Appendix A2 Geometric Viewpoint Mean and Variancel Mean and Variance Many of the concepts we will encounter can be uni ed in a very elegant geometric way which yields additional insight and understanding If you relate to visual ideas then you might benefit from reading this First recall some basic facts from elementary vector analysis For any two column vectors V v1 v2 vnT and w W1 wz wnT in Rquot the standard Euclidean dot product V w is de ned as VTW 2v w hence is a scalar Technically the 11 dot product is a special case of a more general mathematical object known as an inner product denoted by V w and these notations are often used interchangeably The length or norm of a vector V can therefore be characterized as KV V livlz and the included angle 9 11 between two vectors V and w can be calculated via the formula WW cos 9 7 VI W From this relation it is easily seen that two vectors V and w are orthogonal ie t9 7r2 written V J w ifand only iftheir dotproduct is equal to zero ie V w 0 Now suppose we have n random sample observations x1 x2 x3 xn with mean E As shown below let x be the vector consisting ofthese n data values and E be the vector composed solely of E Note that E is simply a scalar multiple ofthe vector 1 l l l lT Finally let x E be the vector difference therefore its components are the individual deviations between the observations and the overall mean It s useful to think of E as a sample taken from an ideal population that responds exactly the same way to some treatment hence there is no variation x is the sample of actual responses and x E measures the error between them x1 x2 X xl E xz x x x xS x x NI H gtlt gtlt gtlt lsmor Fischer 8152008 Appendix A2 Geometric Viewpoint Mean and Variance2 Recall that the sum of the individual deviations is equal to zero ie ZOQ f 0 or in vector 11 notation the dot product 1 x E 0 Therefore 1 J x E and the three vectors above form a right triangle Let the scalars a b and 0 represent the lengths of the corresponding vectors respectively That is a quotKill fZOCzW b J2 WW 0 X J29 21 21 11 Therefore all b2 and c2 are all sums of squares denoted by 7 2 7 n 2 7 2 7 2 7 2 7 2 SSError 7 a 7 7x SSTreatment 7 b 7 n x SSTotal 7 C 7 x 11 1 n 2 via algebra 2 x1 Now via the Pythagorean Theorem we have c2 b2 a2 referred to in this context as a partitioning of sums of squares Note also that by de nition the sample variance is l S 1 SSTotal 7 SSTreatment This formula because it only requires one subtraction rather than n is computationally more stable than the original however it is less enlightening Exercise Verify that SSTmal SSTremmem SSE or for the sample data values 3 8 17 20 32 and calculate 32 both ways showing equality Be 39 quot careful about roundoff error STAT 541 DISCUSSION 1 September 12 2005 TA Lane Burgette O ice 1245F M807 1300 Universtiy Avenue Email burgette statwiscedu URL wwwstatwisceduquotburgette541html or naviagate from statwiscedu O ice Hours 930 1030 T7 R Summaries of Data 0 Quantiles7 Quartiles and Median To nd the p 100 h percentile7 use the following steps 7 Sort the data into ascending order 7 Find 71p7 where n is the number of data points and p is a number between 0 and 1 that corresponds to the percentile you wish to nd i If np is an integer7 take the average ofthe np th and 7110 1 th ordered data points i If np is not an integer7 always round up7 and choose that ordered observation as the desired quantile Note that this is stated in a slightly different manner from the lecture notes7 but they are equivalent de nitions 7 The rst quartile7 median and third quartile correspond to p 257 p 5 and p 757 respectively 0 Exercise 1 Find the median and the 10 h percentile of the following data set 47 27 6787575 0 Mode Valueor set of values that occurs most frequently 0 Mean V L 7 1 n i1 How do the mean and median differ as measurements of center7 particularly concerning outliers o Variance Range Difference between the largest and the smallest data values lnterquartile rangeDifference between 25 h and 75 h quantiles 2 BOX Plot In the box plot the lower bound of the box is the 25th percentile of the data the upper end is the 75th percentile The middle line is the median The whiskers are two horizontal lines are at the most extreme values outside the box that are not more than 15lQR beyond the bounding quartiles Exercise 2 Construct a box plot for the following data 1526 28 28293134394446 3 Probability o A random experiment is an experrnent for which the outcome cannot be predicted with certainty but all possible outcomes can be identi ed prior to its performance and it may be repeated under roughly the same conditions 0 The sample space 9 is the set of all possible outcomes of a random experiment 0 Event is any subset of the sample space It may be a single outcome or a set of outcomes 0 Laws of Probability 10 PA 1 PQ 1 P 0 2 LetA1A2 Ak be mutually exclusive events Then PA1 u A2 u u A1 PA1 PA2 PAk 3 A1 C A2 C S 4 For any events A1 and A2 PA1 U A2 PA1 PA2 PA1 DAZ o If C D are events the conditional probability of C given D is PO m D PMD Pm o If events C and D are independent then Help PO PO m D POPD STAT 541 DISCUSSION 11 TA Lane Burgette O ice 1245F MSC 1300 Universtiy Avenue Email burgette statwiscedu URL wwwstatwisceduburgette541html or naviagate from statwiscedu O ice Hours 930 1030 T R Notes HW 4 was graded out of 9 points A stem and leaf plot for the Exam 2 scores is attached We haven t covered much in the way of new material since the second exam so let s do some review problems since we only have a few weeks until the nal 0 One study has reported that the sensitivity of the mammogram as a screening test for de tecting breast cancer is 85 while the speci city Ptest negativean disease is 80 In a population in which the probability that a woman has breast cancer is 0025 what is the probability that she has cancer given that her mammogram is positive 0 In Wisconsin the lengths of badgers are approximately normally distributed with mean 63cm and standard deviatation 7cm lfwe select ve at random what is the probability that exactly 3 are longer than 74cm What is the probability that all 5 are less than 70cm long 0 Let s say that we sample from a Poisson distribution 55 times That Poisson has mean and variance equal to 5 What is the approximate probability that the sample mean exceeds 6 Geneticists use endonucleases to cut strands of DNA at particular places Let s say that we are using the restriction endonuclease EcoRl that cleaves the DNA whenever it encounters the sequence GAATTC read from 57 to 3 If the bases A C G and T occur randomly with equal frequency what is the expected number of cleavage sites in a 50000 base pair length of DNA What is the probability that we get fewer than 5 You will need to make some assumptions that aren t ideal but are probably pretty harmless 56 of the students in the Wisconsin class admitted as undergrads for the fall of 2004 are female 123 are considered Students of Color lf 8 of the class is female Students of Color and 1 sample from the class at random are the events choose a woman and choose a student of color statistically independent 12 of the class is from Minnesota lf 8 of the class is women from Minnesota what percentage is either female or from Minnesota In our class there are 19 grad students 46 undergrads and 7 other students These numbers aren t quite correct but they are close If 38 of these students are from Wisconsin construct the table to perform a Chi square test for the null hypothesis that the proportion is the same for each of these groups Exam 2 stem and leaf plot 0 I 8 1 I O 1 I 8 2 I O 2 I 779 3 I 01333344444 3 I 5555666778888899999 4 I 000001111122222222244444 De nitely a touch lower than the rst but I think still quite good for the most part l Thlngs you ll know or know betta to watch out torl when you leave 1139 Decemba 1 What you can and cannot mfa from graphs 2 How to conslruct ln your head and lnterpret con dence lntmals 3 How to conduct tests on populatlon parameters wlthln a populatlon and betweenacross populatlons These pammetas lnclude means yarlances odds ratcos and uthas 4 How and when to carry out a llnear ragesslun analysls and how to lnterpret the results medlamattas org clacmed that tlus was a mcsleadlng graph In presentlng the results of a CNNUSA TodayGallup poll CNN com used a ylsually dlstorted graph that falsely conveyed the unpresscon that Democrats tar uumumba Republlcans and lndependents 1n thlnkmg the Florlda state court was rlght to urda Tan Schcaws feedlng tube removed Accordlng to the poll when asked lf they quot agreed wlth the courts demslun to have the feedlng tube removedquot 62 percent of Democratlc respondents agreed compared to 54 percent of Republlcans and 54 parent of lndependents But these results wae dlsplayed along a very narrow scale of 10 parentage polnts and thm appeared to show a large gap between Democrats and RepubhcansIndependents presented In thls manna the graph suggests that the gap between the two groups 13 overwhedmcng rather than only 8 parentage polnts wlthln the pu s margcn of arm of 7 7 percentage polnts quot CNNUSA quotmy ALLUP POLL Rasulls by pany Mme 53 a7 UMSNHI unclunmmwlnsccteuu M 0 ul ll case la you me wlllc 5 lllu uuln uemlull In llavu we wanna lull cm comm 69 55 51 5 g n n 53 7 12 mammals Rruummnb lndnnendcllls REsuus BV mm slumrblowtonumclhytlzchwlmWu ms Mh smlmlllamwu u Mm moumlmumlrnle RES rs av PARTY u USA Todn Gallup poll Mam at term Quesnou Based on wnatym have heard nnezd about me use on you agree wnn me wurl s cetzslcm m have me leedlng mbe rammed a szamane vnm aurae g E Reuunllusns 5 When constructing graphs do the following and hope or maybe insist that others do the same 1 1 Clearly label x and y axes 2 Use relevant scales 3 lndicate exactly which subset of the data are being represented and how Were continuous measurements changed to binary or integer for plotting purposes Were other transformations done 7 4 BE VERY CAUTlOUS about drawing general conclusions from graphs that do not involve error bars or some way of representing error More on this later in the semester 7 Responses to questions in the Breast Cancer Consortium Questionnaire fall into these categories Ebltamples are below Nominal Data Questions 1 2 5 16 Ordinal Data Question 4 9 17 disregard never and not sure Discrete Data How many sisters have been diagnosed with breast cancer 7 Question 7b rephrased Continuous Data Questions 14 weight could be measured in arbitrarily small units 7 can take on any value in some range 7 discreteness 7 of measurements only limited by the measuring device Data can be Classi ed into a few basic types Nominal Data Numeric values that represent classes or categories 7 the categories are not ordered Magnitude of numerical value is not important Ordinal Data Numeric values that represent classes or categories 7 the categories are ordered Magnitude of numerical value is not important Ranked Dataquot Numeric values that represent the order of ranked observations By assigning ranks information about the magnitude of the values and their differences is lost however ranks still retain much useful information Convention is to rank from lowest to highest and then assign numeric values to each ranked observation starting with the lowest Pagano and Gavreau page 10 have this switched around Discrete Data Numbers that represent measurable quantities as opposed to just labels Magnitude is important Discrete data takes on speci ed values that differ by xed amounts intermediate values are not possible Continuous Data Numbers that represent measurable quantities taking on any value in some range Again magnitude is important The difference between any two values can be arbitrarily small 3 Absolute and Relative Frequencies of Mammograms for 5447140 mammograms recorded in the Breast Cancer Surveillance Consortium BCSC study from 1996 7 2004 inclusive http breastscreening cancergoVstatistics Race Number of Mammograms RF White 3785762 695 Black 28 251 5 2 Hispanic 397641 73 Asian 266910 49 American lndian 59919 11 Other 653657 120 For nominal shown here and ordinal data afrequency distribution consists of the set of classes or categories along with the numerical counts in each A ielatiue frequency distribution shows the proportion of counts that fall into each class or category A relative frequency RF value for any category is obtained by dividing the number of observations in that category by the total number of observations This can be reported as a percentage as shown by multiplying the resulting fraction by 100 This table is a bit misleading since we can t tell which populations if any are under or over represented This data does not consider the proportions of each race in the general population Number of mammograms taken m 1999 grouped by patlent s age Age Number of Mammograms RF 93 CRF 18 29 30739 27000 5 7 6 1 4049 141000 29 6 35 7 5059 135000 28 3 64 0 60769 87000 18 2 82 2 70779 65000 13 6 95 8 80789 19000 4 0 99 s QOrcver 1000 0 2 100 0 For dlscrete or contmuous data we must break down the range of values mto a senes of dlstlnct nonroverlapplng mtervals If there are too many mtervals not much of a summary ls obtarned lf there are too few mformatron can be lost Although 16 ls not necessary and ls not done m tlus BCSC study mtervals are often constructed so that they have equal wldths The cumulatwe relotwe frequency CRF for an mterval ls the proportlcn of the total number of observatrons that have a value less than or equal to the upper 1mm of the mterval Thls too can be expressed as apercentage In the table above we see that 35 7 of mammograms are performed on women at or under the age of49 11 Bar chart showmg number of mammograms m each age group for years 1996 r 2004 Tllolls and 1829 3039 4069 5059 6069 7079 3089 90 AgeGI39uIIp Graplu cal Summarres Percentage of Mammograms by Race and Ethnlclty Thls ple chart shows the racral dlstnbutlon of 5447140 mammograms recorded by the Breast Cancer Survelllanoe Consortlum from 1996 a 2004 moluslve Agarn we can t tell wluch populatrons 1f any are under or over represented Thls data does not ocnslder the proportlcns each race m the general populatron wmxeuonmsyanm 595 Black NomHisva Hispank 737 Asian Pad c lslznder 49V I Amelian ml 1 Ahskzn uni 415 Mixed romer unknown11w Rat Mean Arterial Pressure Hutogmms are used to drsplay a frequency dlstrlbutlcn for drscrete or contmuous data If relatrve frequencles proportlcns are drsplayed the hlstogram ls often called a probobtzttg hutog ram In thls case the helghts sum to one Agarn note that we must break down the range of values mto a senes of dlstlnct nonrcverlapplng mtervals If there are too many mtervals not much of a summary ls obtarned lf there are too few mformatron can be lost Numerical Summaries of Data Arithmetic Mean Sum of data values divided by the total number of values Median Value which separates data into two halves half the data values are greater than the median half are smaller than the median The median is less sensitive to outliers than the mean Mode Value or set of values that occurs most frequently More precise de nitions will be given soon o 0 E c l 8 Mean1189 m o Mgdian 1130 8 ode110 O o a o a m lt m z z e e 5 8 5 8 x o x C D n D E E 2 2 N C o c 7 m o H I 9 ll lllnlll lllm 100 140 100 140 mmHg mmHQ 8 8 lt2 1Mean1124 39 o edian 1120 8 8 Mode110 c a quot a m m z z E E 5 o 5 8 o o 3 3 3 E E z z o o o o o m o l o 90 110 130 90 110 130 mmHg mmHg Numerical Summaries of Data Let 72 data values be denoted by 1112 zn Mean Sum of data values divided by the total number of values 9 3 g at n 1 1 Mode Value or set of values that occurs most frequently Median Value which separates data into two halves half the data values are greater than the median half are smaller than the median The median is less sensitive to outliers than the mean Quantiles or percentiles The 72 Quantile percentile is the smallest value which is greater than or equal to 72 percent of the data For example the 95 percentile is the value that is greater than or equal to 95 of the data and less than or equal to the remaining 5 Quartiles The 25th and 75 h quantiles are called quartiles Range Difference between the largest and the smallest data values lnterquaitile Range Difference between the 75m and 25m percentiles Consequently it contains the middle 50 of the observations A bit of detail on Quantiles lntuitively the 1 quantile percentile is the smallest value V such that 1 percent of the sample points are less than or equal to V1 The median being the 50 percentile is a special case of a quantile Quartiles are also special cases of quantiles Note that this is not a precise de nition For example if you have a data set with n 20 values what would the median be 7 M0re Numerical Summaries of Data The samp e variance of the data set is de ned by 1 n 2 7 2 s ac 7 ac n 7 1 i1 1 gt The rationale for using nil in the denominator as opposed to n will be given in a few weeks The cae icient of variation is the standard deviation divided by the mean CV5 x it is often multiplied by 100 to give a Precise De nition of a Quantile For a data set of size n the p quantile is de ned by 1 The k 1 largest sample point if 113 is not an integer Here k is the largest integer less than 130 2 The average of the 115 and 010 1 largest observations if m 39 39 100 is an integer 20 if the 40m and 60m percentiles lie an equal distance from the midpoint and the same is true for the 30m and 70 the 20m and 80th and all other pairs of percentiles that sum to 100 the data are symmteric A symmetric distribution has the same shape on each side of the 50 percentile Shown below are histograms of MAP left data and simulated right data 8 C m N o o m w 9 m z E 58 o n E Z In o o m l c c F 202468 simdata Summarizing the Distribution The mean and standard deviation of a data set can be used to summarize characteristics of the entire distribution Empirical Rule if the data are symmetric and unimodal7 then approximately 67 of the observations lie within the interval iii a approximately 95 of the observations lie within i i 2 0 There is a more precise version of this that we will see later this semester Chebychev s Inequality Chebychev s Inequality will work even when the data are not symmetric or unimodal Chebychev s inequality For any number k that is greater than 17 at least 1 7 lie within k standard deviations of their mean 22 Vertical lines are drawn at iii a for the MAP left and simulated right data 69 of the observations fall between the lines for the MAP data 84 for the simulated data The empirical mile does not work if the data are not symmetric and unimodal O 8 m N r 59 quotn of obs m range 34 quotnnfnbs m range 0 0 L0 w 9 m x E I 8 39C o a r o E 1 Z L0 0 0 L0 0 O 100 140 2 0 2 4 6 8 mmHg simdata So7 for k 2 Chebychev s inequality tells us that at least 12 3 1 2 4 of the values lie within 2 standard deviations of the mean 2000 97quot nfnbs m range 1000 1500 Number in range 500 100 140 202468 mmHg simdata Box Plots A bazplat is used for discrete or continuous data The lower bound of the box is the 25th quartile of the data the upper end is the 75th quartile The 50th percentile median is indicated Horizontal lines are drawn at the most extreme values outside the box that are not more than 1 5 lnterquartile Range beyond either of the bounding quartiles Rat Mean Arterial Pressure MAP Data 2000 Mean 1124 Media 1120 Mode 110 1500 N um bar in range 1000 500 Generating a Box Plot To generate a box plot7 Plot the box Upper bound is 75th quantile 75Q7 lower bound is 25th quantile QBQ Draw a line at the median Note that QBQ and 75Q are NOT standard notation Calculate the lnter Quartile Range 75Q725Q Draw horizontal lines at most extreme points closest to without going outside 75Q 15lQR and QBQ 7 15lQR Draw in remaining points that fall outside the horizontal lines A box plot for the MAP pressure datashown earlier is given below 0 3 on Range Mama 6 153910R39 a 0 on o N o o o 29 Note that the horizontal lines need not be the same distance from the box Again7 the horizontal lines are drawn at the most extreme values outside the box that are NOT more than 15 lnterquartile Range beyond either of the bounding quartiles 2 an a MEanEI3EI6 0 Median ru U77 2 MudEspEIApEIZ EIEI v K B N a LO fl 0 0 All U N a 202468 simdata Simulated data from previous slide 71892 71649 71328 70918 70839 70724 70442 70415 70361 70300 70282 70224 70177 70178 70145 70078 70049 70012 0006 0124 0159 0266 0406 0467 0905 1072 1146 1858 2478 2602 8000 A boxplot function will calculate quartiles It might interpolate between values or impose the restriction that the quartiles be one of the data values For now7 let s impose the restriction that the quantiles be one of the data values Find the median and the 25m and 75m quantiles Where would the horizonal lines for the box plot be drawn 7 hint 25Q715IQR71738 and 75Q151QR1791 30 Histograms and Boxplots of MAP data upper and Simulated data lower x 120 100 1500 140 3 a Aunmmmnno 0 0 3 on o co 0 V m 2 OD 36 From Stat 541 exam Three data sets were generated The box plots and histograms of each data set are shown Match the box plot to the histogram generated by the same data a a b a a 3 b n c nsmrsznzsun nsm sznzsun nsmrsznzsun E a a 2 m I f I annxm IIIIIII n1 ax lsmor Fischer 8212008 Appendix A2 Geometric Viewpoint Least Squares Approximationl Least Squares Approximation The concepts of linear correlation and least squares regression can be viewed very elegantly from a pure geometric perspective Again recall some basic background facts from elementary vector analysis For any two column vectors V v1 vz vquot and w W1 wz wquot 1n R quot the standard Eucl1dean dot product V w is de ned as VTW 2v w hence is a scalar Technically the dot product is a 11 special case of a more general mathematical object known as an inner product denoted by V w and these notations are often used interchangeably The length or norm of a vector V can therefore be characterized as 1KV V Zvlz and the included angle 9 between two vectors V and w can be 11 calculated via the formula M lt99 llvlllw From this relation it is easily seen that two vectors V and w are orthogonal ie t9 7r 2 written V J w if and only if their dot product is equal to zero ie V w 0 More generally the orthogonal projection of the vector V onto the vector w is given by the formula shown in the gure below Think of it informally as the shadow vector that V casts in the direction of w cost9 scalar multiple ofw V w prO WV 2 W w Why are orthogonal projections so important Suppose we are given any vector V in a general inner product space and a plane or more precisely a linear subspace not containing V Of all the vectors u in this plane we wish to nd a vector V that comes closest to V in some formal mathematical sense The Best Approximation Theorem asserts that under such very general conditions such a vector does indeed exist and is uniquely determined by the orthogonal projection of V onto this plane Moreover the V 2 via the Pythagorean Theorem resulting error e V V is smallest possible with e 2 V 2 I Of all the vectors u in the plane the one that minimizes the length v uquot is the orthogonal projection V Therefore V is the least squares approximation to v yielding the least A I I I I I 2 2 A 2 squares error lsmor Fischer 8212008 Appendix A2 Geometric Viewpoint Least Squares Approximation2 Now suppose we are given n data points x y 139 l 2 n obtained from two variablesX and Y De ne the following vectors in ndimensional Euclidean space R quot 0 000 0T 1 1 1 1 1T x x1xzx3 xnT E E E E ET sothat x E x1 Ex2 Ex3 E xn ET y ylyzy3 ynT E 7 7 7 7T sothat yE 01 17yz 7y3 17yn 7T The centered data vectors x E and y y are crucial to our analysis For observe that by de nition and Now note that 1 x E 2xl E 0 therefore 1 J x E likewise 1 J y y as well 11 See the gure below showing the geometric relationships between the vector y y and the plane spanned by the orthogonal basis vectors 1 and x E y y 39D H lt lt lt Also from a previous formula we see that the general angle 9 between these two vectors is given by cos 9 m XXyy gn lgs 1 from above W n 1 S 2 n x y ie the sample linear correlation coef cient Therefore this ratio r measures the cosine of the angle 9 between the vectors x E and y y and hence is always between 1 and 1 But what is its exact connection with the original vectors x and y lsmor Fischer 8212008 Appendix A2 Geometric Viewpoint Least Squares Approximation3 IF the vectors x and y are exactly linearly correlated then by de nition it must hold that for some constants b0 and b1 and conversely A little elementary algebra take the mean of both sides then subtract the two equations from one another shows that this is equivalent to the statement y yb1x with That is the vector y y is a scalar multiple of the vector x E and therefore must lie not only in the plane but along the line spanned by x E itself If the scalar multiple b1 gt 0 then y y must point in the same direction as x E hence r cos 0 1 and the linear correlation is positive If b1 lt 0 then these two vectors point in opposite directions hence r cos 7239 l and the linear correlation is negative However if these two vectors are orthogonal then r cos7r2 0 and there is m linear correlation between x and y More generally if the original vectors x and y are not exactly linearly correlated that is l lt r lt 1 then the vector y y does not lie in the plane The unique vector y y that does lie in the plane which best approximates it in the least squares sense is its orthogonal projection onto the vector x E computed by the formula given above y i X gt X 2 1 th l x x XE i e Linear Model with Furthermore via the Pythagorean Theorem llyillz 2 llyrllz or in statistical notation Finally from this we also see that the ratio HHZ SSTotal y 2 SSReg 2 cos 9 ie the coef cient of determination is where r is the correlation coef cient Exercise Derive the previous formulas sxiyz 3x2 syz i 23xy Hint Use the Law of Cosines Remark In this analysis we have seen how the familiar formulas of linear regression follow easily and immediately from orthogonal approximation on vectors With slightly more generality interpreting vectors abstractly as functions x it is possible to develop the formulas that are used in Fourier series Ismor Fischer 8112008 Stat 541 110 15 Problems In this section we use some of the terminology that was introduced in this chapter most of which will be formally de ned and discussed in later sections of these notes 1 1 1 2 Suppose that n 100 tosses ofa coin result inX 38 Heads What can we conclude about the fairness ofthe coin at the a 05 signi cance level At the a 01 level Suppose that a given coin is known to be fair or unbiased ie the probability of Heads is 05 per toss In an experiment the coin is to be given n 10 independent tosses resulting in exactly one out of 210 possible outcomes Rank the following ve outcomes in order of which has the highest probability of occurrence to which has the lowest Outcome1 HHTHTTTHTH Outcome2 HTHTHTHTHT Outcome3 HHHHHTTTTT Outcome4 HTHHHTHTH H Outcome5 HHHHHHHHHH Suppose now that the bias of the coin is not known Rank these outcomes in order of which provides the best evidence in support of the hypothesis that the coin is fair to which provides the best evidence against it LetX Number of Heads in n 50 random independent tosses of a fair coin Then the expected value is EX 25 and the corresponding p values for this experiment can be obtained by the following probability calculations for which you are not yet responsible XS 24 or X2 26 877 PXS 15 or X2 35 00066 XS 23 or X2 27 718 PXS 14 or X2 36 00026 PG 22 or X2 28 047 PXS 13 or X2 37 00009 XS 21 or X2 29 03222 PXS 12 or X2 38 00003 XS 20 or X2 30 02026 PXS 11 or X2 39 00001 130519 or X231 189 PXS 10 or X240 00000 PXs 18 or X232 00649 PXs 17 or X233 00328 HXS 0 or X250 00000 PXs 16 or X234 00153 Now suppose that this experiment is conducted twice andX 18 Heads are obtained both times According to this chart the pvalue 00649 each time which is above the a 05 signi cance level hence both times we conclude that the sample evidence seems to support the hypothesis that the coin is fair However the two experiments taken together imply that in this random sequence of n 100 independent tosses X 36 Heads are obtained According to the chart on page l4 the corresponding p value 00066 which is much less than a 05 suggesting that the combined sample evidence tends to refute the hypothesis that the coin is fair Explain this apparent discrepancy Ismor Fischer 8112008 Stat 541 15 12 The Classical Scienti c Method and Statistical Inference T e w ole of science is not ing more t an a re nement of everyday Min ing l5ert Einstein Population of units THEQ V What actualyhappens this time What idealy must follow regardless of hypothesis Decision if hypothesis is true Accept or Reject Hypothesis Random Sample Mathematical Theorem empirical data formal proof n observations Proof If Hypothesis aboutX then Consequence aboutX QED Analysis Observed vs Expected under Hypothesis Is the difference mfin39iii iiicii quot iiiiiii Orjust due to random chance variation alone Ismor Fischer 81 22008 Stat 54 l 5 2 52 Formal Statement and Examnles CommentS o gt is called the Standard error of the msan denoted SEM or more Simply Se gt The corresponding Z scors transformation formula is Example Suppose that the ages X of a certain population are normall With mean 41 2 7 0 years and standard deviation 0 120 years The probability that the age of a single randomly selected indlvidusz is less than 30 years 30 r 2 7 is PltX lt 30 is PCZ lt 12 j A 2 7 30 PltZ lt 025 0598 7 In th pop 11 t on th NOW conSidsr all random samples of Size 72 3 6 taken probablllslty that a era om this population By the above their mean ages age f 6 random peop e is under 0 years o d is ueh X are also normally distributed With mean 41 2 7 yrs g ester then the probability a 1 2 rs that t age of one ran Om as before but With standard error T L 2 yrs person is mder 30 years 01d n 3 6 Exerciss Compare the two probabilities of being under 24 years old The probability that the mean age of a single Senths of Exercise Compare the two 72 3 6 randomly selected individuals is less than 3 O prObabilitieS of being between 24 end 30 years old years is P X lt 30 PCZ lt T PltZ lt 1 5 09332 lsm or Fischer 8122008 gt gt Stat 541 53 If X Nu 039 approximately then J u i approximately The larger the value of n the better the approximation In fact more is true IMPORTANT GENERALIZATION lntuitively perhaps there is less variation between different sample mean values than there is between different population values This formal result states that under very general conditions the sampling variability is usually much smaller than the population variability as well as gives the precise form of the limiting distribution of the statistic What if the population standard deviation 039 is unknown Then it can be replaced by the sample standard deviation s provided n is large That is J Alta approximately n if n 2 30 or so for most distributions but see example below Since the value Vi is a samplebased estimate of the true standard error se it is commonly denoted a Because the mean a of the sampling distribution is equal to the mean uX of the population distribution ie E JT uX we say that J is an unbiased estimator of uX In other words the sample mean is an unbiased estimator of the population mean A biased sample estimator is a statistic 5 whose expected value either consistently overestimates or underestimates its intended population parameter 6 Many other CLTlike convergence results exist e g Laws of Large Numbers IsmarFischer 8122008 Stat 541 54 Example Consider an infinite population of paper notes 50 of which are blank 30 are tendollar bills and the remaining 20 are twentydollar bills Experiment 1 Randomly select a single note from the population Random variable X 3 amount obtained Population Distribution of X PXX a 000 002 004 006 008 010 6 a Mean pX EX 50310220 700 a Variance 0X2 EX7 pX2 5X772 332 2032 61 Standard deviation a39X 781 Ismor Fischer 8122008 Stat 541 55 Experiment 2 Each of n 2 people randomly selects a note and split the winnings Random variable 2 sample mean amount obtained per person Probability 5X5 5gtlt3 5X2 3X5 3X3 gtlt2 2X5 2X3 025 015 010 015 009 006 010 006 Sampling Distribution of i 2ka 5 quot2 2 cs 0 25 g c g 006 0058 5 301515 cs W Q C 10 29100910 N O C 15 120606 8 O 20 04 E Mean p 250305291012150420 700 uX u E Variance a 25 72 30 22 2932 1282 04132 02 61iu 3057 n I I H E Standard deV1atlon O39X 552 V Ismor Fischer 8122008 Stat 541 56 Experiment 3 Each of n 3 people randomly selects a note and split the winnings Random variable J sample mean amount obtained per person x 165 PX E o o 000 125 00 Q 5 333 225 075 075 075 8 5 285 050 045 050 63967 045 045 050 g C 207 030 030 030 027 1000 030 030 030 114 020 018 018 1333 020 018 020 1667 036 012 012 012 2000 008 E Mean p Exercise 700 pX 1 2 61 0X2 E Var1ance 039 ExerCIse 20333 g n 1 0X E Stad dd 39t39 451 n ar eV1a1on a39X V IsmaxFlschu8IZZEIE8 5151541 577 Sampling Distribution n 10 Density 0 to 005 Xrbar The tendency toward a normal distribution becomes stronger as the sample size it gets larger despite the mild skew in the original population values This is an empirical consequence ofthe Central Limit Theorem For most such dislribu ons n 2 30 or so is sunicient for a reasonable normal approximation to the sampling distribution Recall also from the rst result in this section that if the population ix normally distiibuted 39th own Q then so will be the sampling dislribu onfur any 1 BUTBEWARE Density Ismor Fischer 8122008 Stat 541 58 However if the population distribution of X is higth skewed then the sampling distribution of X can be highly skewed as well especially if n is not very large ie relying on CLT can be risky Although sometimes using a transformation such as lnX or X can restore a bell shape to the values Later Example The two graphs on the bottom of this page are simulated sampling distributions for the highly skewed population shown below Both are density histograms based on the means of 1000 random samples the first corresponds to samples of size n 30 the second to n 100 Note that skew is still present Pwulltion Distribution 0 7 0 7 c 0 7 C V 7 N 7 c O 7 c i i i i i i 0 i 2 3 4 5 x Sampling Distribution Sampling Distribution simulated n 30 simulated n 100 c c 00 00 c c g g g ltr D st 0 c N N c c c c c c i i i i i i i i i i i 0 i 2 3 4 5 0 i 2 3 4 Y Y 1 Suppose that X is a random variable that represents height For the population of 18 to 74 year old women height is normally distributed with mean u 689 inches and standard deviation 0 26 inches If we randomly select a woman from this population what s the probability that she is between 60 and 68 inches tal 7 3 Among females in the United States between 18 and 74 years of age diastolic blood pressure is normally distributed with mean u 77 mmHG and standard deviation 0 116 mmHg 1 What is the probability that a randomly selected woman has a diastolic blood pressure less than 60 mmHg 2 What is the probability that a randomly selected woman has a diastolic blood pressure greater than 90 mmHg 3 What is the probability that among ve women selected at random from the population at least one will have a pressure outside the range 60 to 90 mmHg 7 2 Suppose serum cholesterol levels X for children in Wisconsin have mean 175mg100ml and SD SOmg100ml Suppose we want to know the limits within which 95 of the population lies From Table 3 in the Appendix we get that PZ gt 196 0025 so that P7196 g Z g 196 095 What kinds of questions can you now answer 7 Use BP as an illustration Suppose we know BP is normally distributed with a speci c mean u and variance 02 1 A person walks in and you record the BP You can tell if this person has normal 7 BP or is an outlier 2 You can tell what proportion of people lie inside or outside a given range 3 lf say 20 men come in to the of ce on a given day and each has his BP taken you can tell the probability that at least one at most 2 at least 5 etc lie outside or inside some given range To answer these questions we ve made some key assumptions We have known 7 that the populations of interest are Normally distributed We have known 7 the population mean u and the population standard deviation 0 Most of the time we don t know these things So most often we collect a random independent sample from a population and estimate population parameters of interest Statistical inference The process of drawing conclusions about an entire population based on the information in a sample is known as statistical inference 6 A random sample is a selection of some members of a population such that each member is independently chosen and has a known nonzero probability of being selected A simple random sample is a random sample in which each group member has the same probability of being selected Note that in here and most of the time in practice random sample refers to simple random sample The reference target or study population is the group we want to study We often think ofthis population as having some true characteristics or parameters eg a mean u and standard deviation 0 we take a random sample to estimate these characteristics eg a sample mean i or sample standard deviation 3 Q l happen to know 7 BP levels for all men in the US fol ows a normal distribution with mean u and standard deviation 0 You need to guess at u and a How would you do this 7 9 10 Histograms of 100 sample means for different sample sizes Q l happen to know that the number of car accidents in Madison each year follows a Poisson distribution l know the mean and so l know the variance 39 7 You need to guess at the mean How would you do this 15 20 25 15 20 25 O l l O 15 25 11 12 Histograms of 100 sample means for different sample sizes Q I happen to know that the number of successful surgeries out of 10000 follows a Binomial distribution I know the mean and Variance LO You need to guess at the mean and variance How would you do O V this 7 15 20 25 0 II II 15 20 25 15 9 0 II I 20 Histograms of 100 sample means for different sample sizes l e i l 0 ml I l I l in 15 20 25 o l l O CLCJ 15 20 25 Lo l quot nll 15 20 25 15 CENTRAL LlMlT THEOREM Let X1X2 XyL denote n independent random variables sampled from the same distribution which has a nite mean u and variance 02 lf 7 is large7 then X 7 H N Z N N O 1 0W lt gt ln other words7 X is approximately Normally distributed with mean 02 u and variance 7 Approximation gets better as 72 increases 14 The probability distribution of X is called the sampling distribution of X Understanding properties of the sampling distribution of X al ows us to make inference about population parameters based on a single sample Characteristics that we observed in histograms which approximate the sampling distribution of X 1 The mean of the sampling distribution is near the population mean from which the samples were taken sample size 72 doesn t matter 2 The variance ofthe sampling distribution gets smaller as the size of the sample n increases 3 For large sample sizes n the sampling distribution looks normal 16 NOTE 7 independent samples from the same distribution are often called independent and identically distributed iid Recall the questions stated earlier on BP ls a given BP typical 7 What proportions of BPs lie in a given range 7 If you see 20 men in a day7 what s the probability that at least one will have BP outside a given range 7 To answer these questions7 we needed to assume that BP is normally distributed with a speci c mean 1 and variance 02 Thanks to the CLT7 we can now answer similar questions without assuming a Normal distribution 13 Q Consider the distribution of cholesterol levels for US men aged 20 7 74 Assume the mean 1 211 rng100rnl and the standard deviation is U 46 rng100rnl Select a sample of size n from the population ls the sample an outlier 7 Does it have an unusually high or low sample mean 7 To answer this7 we could gure out the interval that encloses say 95 of the sample means and see if the sample mean from our sample falls within that range So for a xed 72 we want to nd I and In such that Pz g Xn g mu 095 We know P7196 g Z 196 095 For n 25 P ltXlt P7196lt lt196 21 w 1 4W 1 46 7 46 P7196 ltX7211lt196 l l 25 7 v25 P211719692 g X g 211 196 92 P19297 g X g 22903 X 7 211 25 Ismor Fischer 8122008 Stat 541 71 7 Correlation and Regression 71 Motivation POPULA TI ON Random VariablesX Y numerical Contrast with 631 How can the association betweenX and Y if any exists be 1 characterized and measured 2 mathematically modeled Via an equation ie Y f Recall X MeaHOO E X Y MeanY EY 0X2 VarltXgt EltX m2 of VarltYgt EltY m2 Definition Population Covariance of X Y 0n C0VX Y E X XY Y Equivalently E uX y 439 SAMPLE size n Recall Note Whereas sf 2 O and sy2 2 0 SW is unrestricted in sign Exercise Algebraically expand the expression X 7 uXY 7 uy and use the properties of mathematical expectation given in 31 This motivates an alternate formula for SW ismmrrscuaaizouua statSAl no For the sake of simplicity let us assume that the predictor variable Xis nonrandom ie deterministic and that the response variable Yis random Althnuoh 39 Example X fat grams Y cholesterol level mgdL a scatterplot along with some accompanying summary statistics 0 WWW g a E ll 87 a s N 87 a 60 70 80 90 100 X atgams rxy 7 51 1 60 7 802lo7 240 70 7 80200 7 240 80 7 80220 7 240 90 7 80280 7 240 100 7 80290 7 240 m s the name implies the variance measures the extent to which a single variable varies about its mean similarly the covariance measures the extent to which IQ senre y swam y and between twn vanables X and y as m lqlmdent then a seatterplbt wuuld reveal E a t n a bnnr Ricki Emma Ideally rftb m the ease ere rs g assunatmn nfa where they are i eung a a man rnean rewunse value n Y 98 6 F regardless uf ageX anee 7 n ur nearly su er nut a guud pred etur uf the rewunse Y See gures r E m n a u x X Head Circumference X Age wever m the preeeurng fat vs ebulesterblquot Exam 1 there rs a clear 1t se a mple b1 a tune wrtb pusmve slbpe and st abnear desenptmn eanbe use tb eapture sueb frrstbruer39 pmpmles uf the assberaubn between ind y The twb quesnuns we nuW ask are 1 wa ean we rneasure the tmngh bftne lmear assberaubn betweenXanu w Anmet Linmr Cnrmlatjnn Cnemeiem z wa ean we mudel tne lmear assberaubn betweende Y essenuany ma an equanun quhe farm Y mp 57 AKSWEI simple Linmr layman mumme m meme menu nrnnnaeruunrnurumen as web see Hyena mun mx smcnn lsmor Fischer 8122008 Stat 541 74 Before moving on to the next section some important details are necessary in order to provide a more formal context for this type of problem In our example the response variable of interest is cholesterol level Y which presumably has some overall probability distribution in the study 0 ulation The mean cholesterol level of this population can therefore be denoted or recall expectation L and estimated by the grand mean 7 240 Note that no information aboutX is used Now we seek to characterize the relation if any between cholesterol level Y and fat intake X in this population based on a random sample using n 5 fat intake values ie x1 60 x2 70 x3 80 x4 90 x5 100 Each of these xed x values can be regarded as representing a different amount of fat grams consumed by a subpopulation of individuals whose cholesterol levels Y conditioned on that value of X x are assumed to be normally distributed The conditional mean cholesterol level of each of these distributions could therefore be denoted v equivalently conditional expectation r r i quot 39 A for i l 2 3 4 5 See figure39 note that in addition we will assume at the variances within groups are all equal to 0392 and that they are independent of one another If Q relation betweenX and Y exists we would expect to see m organized variation in Y as X changes and all of these conditional means would either be uniformly scattered around or exactly equal to the unconditional mean uY 39 recall the discussion on the preceding page But if there is a true relation betweenX and Y then it becomes important to characterize and model the resulting nonzero variation C o x m We can consider n 5 subpopulations each of whose cholesterol levels Y are norm ally distributed and whose means are conditioned onX 60 70 Xlln 001 g 80 90 100 fat grams respectively 7 E3 3 i a a H 3933 8 E t g S O w 8 E E q B H I quot O C 5 a H H 39l 0 w 8 7 5 O LilLi 00 O H I l 60 T0 80 90 100 X fat grams STAT 541 DISCUSSION 2 September 19 2005 TA Lane Burgette Of ce 1245F MSG7 1300 Universtiy Avenue E mail burgettestatwiscedu URL wwwstatwisceduquotburgette54lhtml or naviagate from statwiscedu Of ce Hours 98071080 T7 R More Probability Law of Total Probability If the A1 are mutually exclusive and exhaustive7 then H I 133 2 PM o B Z PBAZPAi I Example 1 What is the probability of rolling a 57 if you take the sum of two sixesided dice7 I Bayes Rule If the A are mutually exclusive and exhaustive7 then PBA1PAz PM 2PltBM7gtPltAJgtgt so long as PB gt 0 I Example 2 The Monty Hall problem This is probably the most famous Bayes rule exercise around There is a game show with three closed doors Behind one closed d r is a prize7 but the other two are empty After the contestant chooses one door7 it is kept shut for the time being The host then opens one of the doors that the contestant did not choose After all7 one or maybe two of them have nothing behind them Should the contestant keep the original door7 or should he switch to the other closed door7 or does it not matter7 I Example 3 See Figure 1 Let us say that we see lect a person at random from the US population be tween the ages of 20 and 74 Given that that person is M WWH obese7 what is the probability that the person is a black 5 r w A5 5n female7 What are the odds7 What are the chances Mum that the person is a white male7 Assume that 123 We J l of the population is black and 691 of the population mm is white Also7 use the fact that 509 of the popula Wm tion is female7 and assume that is true for all races and Sum usmmmmmWWWWW age groups Data taken from wwweconomistcom and quickfactscensusgov Bi numbers Obesityamang USamulsaQevl earn mama a Figure 1 Obesity data Ismor Fischer 8162008 Stat 541 41 4 Classical Probability Distributions 41 Discrete Models Recall from 31 that for a discrete population random variable X we have Definition fx is a probability distribution function if for all x x20 AND Z x 1 allx The resulting cumulative distribution function cdf is defined as for all x Fx PXSx Z fxx allx x and is piecewise constant increasing from 0 to 1 Therefore for any two population values a lt b it follows that Pa sXs b i x Fb Fa39 Definition The mean or expected value of X is given by u EX Z x f x allx Definition The variance of X is given by either of the two equivalent forms 02 EX 2 gee i070 Total 1 Area 1 Fx3 5 x F I x I x1 x3 2 I FOCI fx g I I I A I X 0 I I I I x 1 X32 X33 I x x1 x2 X3 x xsrnar Ricki Eummz Examgl e mm m2 PREVENTING CHRONIC DISEASE PUELIC HLAUH RkSEARCH PRACIICE AND POLICY Identifying Gen 1 i l 39pun39Lie r DcLL39uuon ofBrcasl Cancel U ng rl Geographic Information System in the Among other unngs ths study estimated that the rate of breast eaneer m am 13015quot wnren rs dragnosed a random sample of 1IUU breast cancerdaagqosesJet Quesaons How ean we model are prnhahility distributinn of X and under what asmrrpnan Prnhzhilities nf events suen as PX PXZO XSZO etc 7 Fn zmdzzvadzhkun mz hsm Mean BCIS eases 7 Standard deviau39nn ofBCIS eases 7 Ismor Fischer 8162008 Stat 541 43 Binomial Distribution Paradigm model coin tosses Binary random variable 39 Probability 1 Success Heads with PSuccess 1r 0 Failure 7 Tails with PFailure 1 1 Experiment 14 5 independent coin tosses Sample Spaces H H H H H TTTTT iltH H H HH H 31wa if T HAH39H 0H H Et H T39I39LHTHx HHH H U HIHT 1815th TH H H T pr T39 H EITH Th H39fH I T39H 1H1 HT H T H HI Hi HHMSIT T HH Tim H IH TI CT H H T39T Events X 0 Exercise X 1 Exercise X 2 Exercise X 3 see above X 4 Exercise X 5 Exercise Recall For x 0 l 2 n the combinatorial symbol 7 read nchoosex 7 l is defined as the value and counts the number of ways of rearranging x objects among n objects See Appendix gt Basic Reviews gt Perms amp Combos for deta1ls n Note r 1s computed v1a the mathemat1cal funct1on nCr on most calculators Ismor Fischer 8162008 Stat 541 44 Probabilities First assume the coin is fair z 05 2 1 5r 05 ie equally likely elementary outcomes H and T on a single trial In this case the probability of any eventA above can thus be easily calculated via PA A S x s 0 132 003125 3 1 532 015625 3 2 1032 031250 3 3 1032 031250 3 4 532 015625 E 5 132 003125 i U l 2 3 4 5 Now consider the case Where the coin is biased eg x 07 2 1 x 03 Calculating PX x forx 0 l 2 3 4 5 means summing Pall its outcomes Example PX 3 outcome via independence of H T P T T 0707070303 073 032 P u T u T 0707030703 073 032 P u T T E 0707030307 073 032 P T u u T 0703070703 073 032 Via disjoint outcomesgt P T E T E 0703070307 07303z 5 07303Z P T T u E 0703030707 073 032 3 PT E u u T 0307070703 073 032 PT E u T E 0307070307 073 032 PT E T u E 0307030707 073 032 PT T 5 0303070707 073 032 Ismor Fischer 8162008 Stat 541 45 Hence we similarly have x PXx 07X035 0 07 035 000243 1 i07103 002835 2 071033 013230 3 07303z 030870 E 4 ijwm 031 036015 5 07503 016807 g 0 1 2 3 4 5 Exam 1e Suppose that a certain medical procedure is known to have a 70 successful recovery rate assuming independence In a random sample of n 5 patients the probability that three or fewer patients Will recover is Method PXg 3 PX0PX1PX2PX3 000243 002835 013230 030870 Method2 PXg3 17PX4PX5 17036015016807 17052822 Example The mean number of patients expected to recover is u Em 0 0 00243 1 002835 2 013230 3 030870 4 035015 5 016807 This makes perfect sense for n 5 patients With a x 07 recovery probability ie their product In the probability histogram above the balance point fulcrum indicates the meanvalue o 5 lsmor Fischer 8162008 Stat 541 46 General formulation Example Suppose that a certain spontaneous medical condition affects 1 ie It 001 of the population LetX number of affected individuals in a random sample of n 300 ThenX Bin300 001 ie the probability of obtaining any speci ed number x 0 l 2 300 of affected individuals is PXx 310 001quot 09930 The mean number of affected individuals is r 7r af i l l ivp f i w i expected cases with a standard deviation of 039 30000l099 1723 cases Probability Table for Binomial Dist x fx 1x1 7t x n 0 n0 Exercise In order to be a valid distribution 0 0 z 1 the sum of these probabilities must l Prove it n 1 n1 Hint First recall the Binomial Theorem 1 1 7 1 7 How do you expand the algebraic expression ab for any n 0 l 2 3 Then replace 2 2 0 zquot 2 a with 7 and b with 1 7r Voila etc etc n n n n 1 n nj 39 139 lsmor Fischer 8162008 Stat 541 47 Comments gt The assumption of independence of the trials is absolutely critical If not satisfied ie if the success probability of one trial in uences that of another then the Binomial Distribution model can fail miserably Example X number of children in a particular school infected with the flu The investigator must decide whether or not independence is appropriate which is often problematic If violated then the correlation structure between the trials may have to be considered in the model gt As in the preceding example if the sample size n is very large then the computation of for x 0 l 2 n can be intensive and impractical An approximation to the Binomial Distribution exists when n is large and 7 is small via the Poisson Distribution coming up gt Note that the standard deviation 039 ln 7rl 7 depends on the value of 7 Later lsmor Fischer 8162008 Stat 541 48 How can we estimate the parameter 139 using a samplebased statistic 7339 POPULA TION Binary random variable 1 Success with probability z 0 Failure with probability 1 1r Experiment 11 independent trials l SAMPLE 01 01 01 01 01 01 01 V19 y29 y39 y49 y59 y69 quot9 yn y1y2y3 y4y5 yn Let X Successes in 11 trials Binn 7 n X Failures in 11 trials Therefore dividing by n proportion of Successes in 11 trials j V fr p 7aswell andhence q 1 p proportion of Failures in n trials Example If in a sample of n 50 randomly selected individuals X 36 are A X 36 female then the statistic z Z E 072 IS an estmiate of the true probabihty 7239 that a randomly selected individual om the population is female The probability of selecting a male is therefore estimated by l 7 028 lsmor Fischer 8162008 Stat 541 49 Poisson Distribution Models rare events Discrete Random Variable X occurrences of a rare event E in a given interval of time or space of size T 0 l 2 3 I I 0T TTTT Assume 1 All the occurrences of E are independent in the interval 2 The mean number u of expected occurrences of E in the interval is proportional to T ie u a T This constant of proportionality a is called the rate of the resulting Poisson process Then The Poisson Distribution The probability of obtaining any specified number x 0 l 2 of occurrences of event E is given by equotl x PltXxgt where e 271828 Euler s constant We say that X has a Poisson Distribution denoted X Poisson Furthermore the mean is u a T and the variance is 0392 a T also Examples beesting fatalities per year spontaneous cancer remissions per year accidental needlestick HIV cases per year hemocytometer cell counts lsmor Fischer 8162008 Stat 541 410 Example see above Again suppose that a certain spontaneous medical conditionE affects 1 ie a 001 of the population LetX number of affected individuals in a random sample of T 300 As before the 111 number of expected occurrences of E in the sample is Lint Hence X Poisson3 and the probability that any n x 0 l 2 of individuals are affected is given by e 3 3 x x PX x which is a much easier formula to work with than the previous one This fact is sometimes referred to as the Poisson approximation to the Binomial Distribution when T respectively n is large and a respectively 7 is small Note that in this example the variance is also 039 2 3 so that the standard deviation is 039 3 1732 very close to the exact Binomial value Binomial Poisson 3 i X x PXx 310 001quot 099300 X PX x e X 0 004904 004979 1 014861 014936 2 022441 022404 I 022517 022404 4 016888 016803 5 010099 010082 6 005015 005041 7 002128 002160 8 000787 000810 9 000258 000270 10 000076 000081 etc gt 0 gt 0 020 020 Area 1 005 015 010 015 005 000 000 0123 Ism or Fischer 8162008 Stat 541 411 Why is the Poisson Distribution a good approximation to the Binomial Distribution for large n and small 7139 Rule of Thumb n 2 20 and 7Z39S 005 excellent ifn 2 100 and 7Z39S 01 71 Let fl inOC x 72 nix and fl oissonx e x 1X where l mt We wish to show formally that for xed u and x 0 1 2 we have Proof By elementary algebra it follows that fence ijm mw x n x 39 7rquot1 7rquot1 7r nn ln 2n xl 7Z39Xl Simeon Poisson 1781 1840 nd 7r ngn llln ixgn xll nx x1 nl x nzx1 n1 zz x 311 1 1 2x 1 n1 7z v i i 111 11 fpoissonoc QED lsmor Fischer 8162008 Stat 541 412 Classical Discrete Probability Distributions Binomial probability of nding x successes and n 7 x failures in n independent trials X successes each With probability 7239 in n independent Bernoulli trials 71 l 2 3 x PXx Hoar x012n Negative Binomial probability of needing x independent trials to nd k successes X independent Bernoulli trials for k successes each with probability 7239 k 1 2 3 x PXx 1 nka a x k k1 k2 Hypergeometric modi cation of Binomial to sampling without replacement from small nite populations relative to n N X successes 1n 71 random tr1a1s taken from a r 39 of s1ze N d successes n gt E 503 2V 5 x PXx x012d Multinomial generalization of Binomial to k categories rather than just two For i1 2 3 k X outcomes in category 139 each with probability ill in n independent Bernoulli trials n 1 2 3 7T17T27T37Tk1 n fx1xz xk PX1x1Xzxz Xkxk 72391 x1 x2 xk xi0l2n with x1x2xkn POiSSOH limiting case of Binomial with n gt co and 7r gt 0 such that mt A xed X occurrences of an event having mean number of occurrences L gt 0 e711 x xPXx x0 12
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'