Laboratory Data Analysis
Laboratory Data Analysis EPP 245
Popular in Course
Popular in Med Epidemiology & Prev Med
This 152 page Class Notes was uploaded by Virgie Eichmann DDS on Tuesday September 8, 2015. The Class Notes belongs to EPP 245 at University of California - Davis taught by Staff in Fall. Since its upload, it has received 34 views. For similar materials see /class/187510/epp-245-university-of-california-davis in Med Epidemiology & Prev Med at University of California - Davis.
Reviews for Laboratory Data Analysis
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 09/08/15
Some Principles for the Design and Analysis of Experiments using Gene Expression Arrays and Other High Throughput Assay Methods EPP 245298 Statistical Analysis of Laboratory Data October 1 1 2005 The Omics Revolution The advent of gene expression microarrays proteomics by mass spectrometry and metabolomics by mass spectrometry and NIVIR spectroscopy presents enormous opporunities for fundamental biological research and for applications in medicine agriculture and environmental science 39Theyz sorxesentrnany39chaHengesin deggn and analysis of laboratory experiments population studies and Clinical trials We present some lessons learned from our experience with these studies OITIiCS Data Genome Complement of all genes or of all components of genetic material in the cell mostly static Transcriptome Complement of all mRNA transcripts produced by a cell dynamic Proteome Complement of all proteins in a cell whether directly translated or produced by post translational modification dynamic Metabolome Complement of all metabolites other than proteins and mRNA eg lipids saccharides etc dynamic DNA Replication pnmmmw DNA Macaw DNA mrlm wn M Trum p m RBIA syxnhesis RNA W nucleus cytbylasm Translation Pralzixl sy39nL hzsis Pxotein Prmn u The Central Dasha o Moiecular Biolosnr Genechlps Mlcroarrays Genome PCR SAGE Transcrlploma 20 PAGE Metabolome LCMS NMR LCMS Prolein Chips Prolenma The Principles of Experimental Design Have not Changed o A design that is not adequate to measure a change in one indicator across populations is probably not adequate to measure the change in 20000 indicators o Usually biological variability within or between organisms is much larger than the technical variability of measurements 0 Thus most replications should be across organisms not repeats of the same sample 0 The measurement of difference between types of cancer between varieties of wheat or between animal populations will often require many samples W6 NGGCI Internal COI ItI OIS 0 We learned long ago that Clinical studies need internal controls to be believable Comparisons with past history are too frequently deceptive to be useful 10 o Genomics data are an obvious exception because the genetic structure of for example humans varies only a little between individuals and mostly varies not at all over time in a given individual 0 Gene expression data proteomics data and metabolomics data are more like clinical data than genomics data they vary over time and over conditions some of which are hard to measure 11 o Databases of expression proteomics etc will mostly be useful as archives of studies direct comparisons across studies will need to be interpreted cautiously o What we hope will be reproducible is differences between groups not absolute measurements 12 Detecting Statistically Significant Effects 0 Mostly we do not yet have quantitative knowledge of what changes in gene expression protein content etc are biologically significant Until we do have such knowledge we should detect all changes that we are sure have occurred without regard to size Twofold may be a large or small change A 10 change may be important 13 o If we measure 10000 things at once and test each one for significance we may have too many false positives to be useful 0 A 5 statistical test will generate an average of 500 false positives in 10000 If we have 1000 significant genes in tests for differential expression then about half will likely be false discoveries 14 0 One way to control this is to use the Bonferroni method for family wise error rates in which each gene is tested at a significance level of 510000 0000005 or one in 200000 This guarantees that there will be no genes identified in 19 of 20 studies where there are no real diffences It may lack sensitivity 15 0 With a sample of 5 in each of two groups the smallest difference that is significant at the 5 level is about 17 standard deviations With the Bonferroni adjustment on 10000 variables the detectable change is over four times as large 75 standard deviations 16 False Discovery Rate 0 There are a series of False Discovery Rate FDR methods that provide good protection but are more sensitive than the Bonferroni Method 17 o If there are 10000 genes and 500 are identified by a 5 FDR method then approximately 95 of these 500 will be really different and no more than about 5 of them will be false discoveries This means that only about 25 of the 500 will be false leads 18 Experimental Design o Often investigating multiple factors in the same experiment is better We can use a full factorial design all possible combinations or a fractional factorial Fractional factorial designs can investigate as many as 7 factors in 8 experiments each one with the full precision of a comparison of 4 vs 4 19 0 Consider a study of the response of mice to a toxic insult We can examine 2 ages of mice 2 sexes treatment and control for a total of eight conditions With 2 mice per condition we are well placed to investigate even complex relationships among the three factors 0 Two color arrays generate more complexity in the design with possible dye bias and with the most accurate comparisions being between the two samples on the same slide The Analysis of Variance 0 The standard method of analyzing designs with categorical variables is the analysis of variance ANOVA 20 o The basic principle is to compare the variability of group means with an estimate of how big the variability could be at random and conclude the difference is real if the ratio is large enough 0 Consider an example with four groups and two measurements per group 21 Example Data Group Sample 1 Sample 2 Mean A 2 4 3 B 8 10 9 C 14 16 15 D 20 22 21 0 The variabiliy among the four group means is 120 Mean Square for groups This has three deg rees of freedom o The variability within groups is 2 Mean Square Error or MSE This has four degrees of freedom 23 o The significance of the ratio uses the F distribution The more df in the MSE the more sensitive the test is o The observed F ratio of 1202 2 60 is highly significant If there were no real difference the F ratio would be near 1 24 Measurement Scales 0 Standard statistical methods are additive we compare differences of means 0 Often with gene expression data and other kinds of assay data we prefer ratios to means 25 o This is equivalent to taking logarithms and using differences ogzy Ioga Iogy o In general we often take logs of data and then use regression ANOVA and other standard additive statistical methods High throughput assay data require some alteration in this method 26 Variation in Microarry and other Omics Data Some well known properties of measurement error in gene expression microarrays include the following 0 For high gene expression the standard deviation of the response is approximately proportional to the mean response so that the CV is approximately constant 27 o For low levels of expression the CV is much higher 0 Expression is commonly analyzed on the log scale so that for high levels the SD is approximately constant but for low levels of expression it rises 28 0 Comparisons of expression are usually expressed as n fold corresponding to the ratio of responses of which the logarithm would be well behaved but only if both genes are highly expressed 0 These phenomena occur in many measurement technologies but are more important in high throughput assays like microarrays 29 o What is the fold increase when a gene goes from zero expression in the control case to positive expression in the treatment case 0 Which is biologically more important an increase in expression from O to 100 or an increase from 100 to 200 30 variance Model for Gene Expression and other OITIiCS Data At high levels the standard deviation of replicates is proportional to the mean If the mean is u then this would be SDy bu Vary b2u2 31 0 But this cannot hold for unexpressed genes or in general for assays where the true concentration is O 0 So a reasonable model for the variance of microarray data is Vary a2 b2u2 Rocke and Durbin 2001 32 Often the observed intensity peak area etc needs to be corrected for background or baseline by subtraction of the average signal Oz corresponding to genes unexpressed compounds not present in the sample This may be a single number a single number per slide or a more complex expression This can be estimated from negative controls or by more complex methods 33 So if y is the signal and z y oz is the background corrected signal our meanvariance model is Ez M Vz a2b2p2 It can be shown that Varlny 01 m 0 IEM2 34 An Example We illustrate this with one slide from an experiment on the response of male Swiss Webster mice to a toxic substance The treated animal received 015mgkg ip of Naphthoflavone while the control mouse had an injection of the carrier corn oil Genes were replicated usually eight times per slide 35 scalea 200000 300000 1 00000 2 Raw Data 51 0A5 I 1 0 6 locationm I 1510 6 36 scalea 4000 6000 8000 10000 2000 1 Raw Data at Low Expression k k it I 39A I I I 15000 20000 25000 location m 30000 37 Data Transformation o Logarithms stabilize the variance for high levels but increase the variance for low levels 0 Log expression ratios have constant variance only if both genes are expressed well above background 38 o Heterogeneity of variance is an important barrier to reliable statistical inference 0 Such heterogeneity is common in biological data including gene expression data 39 o Data transformations are a well known way of dealing with this problem c We present a new transformation family that is expressly designed for biological data and which appears to work very well on gene expression data 40 o The logarithm is designed to stabilize data when the standard deviation increases proportional to the mean 0 When the data cover a wide range down to zero or near zero this transformation performs poorly on low level data This does not mean that these data are bad or highly variable or unreliable It only means that we are using the wrong transformation or measurement scale 41 The generalized logarithm reproduces the logarithm at high levels but behaves better at low levels One way to express it is fz nz lz2 a2b2 where z is the background corrected intensity Durbin Hardin Hawkins and Rocke 2002 Hawkins 2002 Huber von Heydebreck Sijltmann Poustka and Vingron 2002 llunson 2001 42 fz lnz lz2 a2b2 o fz N lnz for large 2 o fz is approximately linear for z O o fz is monotonic does not Change the order of size of data 43 y2 Log and Glog Transformations 2000 4000 6000 8000 10000 44 y2 Log and Glog Transformations at Low Levels 4 O 20 4O 60 80 100 45 scalea 15 10 05 00 3 ogyapha 142 missing values D 5 k I k I I I I 10 12 14 locationm 46 scalea 10 08 06 04 02 00 5 New Transformation k t A W k 4 v v 5 0 K m k 39k 5 W 4 Jr w k 4 Z k r r v n quotquot ex r k k k k 4 v k k I I I ocat0nm 47 Estimation This transformation has one parameter that must be estimated as well as the background We can do this in various ways mo In y oz y 002 A 48 0 We can background correct beforehand or estimate the background and transformation parameter in the same step 0 We can estimate A a2b2 by estimating the low level variance a2 and the high level square CV b2 and take the ratio 49 0 We can estimate the parameters in the context of a model using standard statistical estimation procedures like maximum likelihood 0 We can estimate the transformation each time or use values estimated with a given technology in a given lab for further experiments 50 This helps solve the puzzle of comparing a change from 0 to 40 to a change from 1000 to 1600 Suppose that the standard deviation at 0 is 10 and the high level CV is 15 Then 0 A change from 0 to 40 is four standard deviations 4 X 10 40 40 0 o A change from 1000 to 1600 is also four standard deviations 16001000 2 160 2 increase of 4 X 15 51 0 So is a change from 10000 to 16000 1600010 000 160 2 increase of 4 X 15 o The biological significance of any of these is unknown Different transcripts can be active at vastly different levels 0 But the glog transformation makes an equal change equally statistically significant 52 Normalization and Transformation of Arrays Given a set of replicate Chips from the same biological sample we can simultaneously determine the transformation parameter and the normalization 53 The statistical model used is hA7aintensity 2 gene chip error and we can estimate the transformation the gene effects and the normalization together 54 Interquartile range 600 1000 O 200 Untransformed Data O 2000 4000 6000 8000 10000 Median 55 Interquartile range 600 1000 0 200 Untransformed Data 0 2000 4000 6000 8000 Rank of median 12000 56 Interquartile range 30 20 10 00 Log Transformed Data 0 2000 4000 6000 8000 Rank of median 57 Interquartile range 15 10 05 00 Transformed Data with no Flask Effects 0 2000 4000 6000 8000 12000 Rank of median 58 Two color Arrays We wish to model two color arrays so that after transformation the model is linear We have a red and a green reading from each spot Our transformation model is as follows 59 hA7ared intensity 2 red genechip spot red error hA7agreen intensity 2 green genechip spot green error 60 hA7ared intensity hA7agreen intensity 2 red gene green gene errors 61 o hkyoxred hA7agreen looks like the log ratio at high levels but makes sense also at low levels We call this a generalized log ratio 0 Can solve iteratively for the transformation parameters and the gene and slide effects for example by maximum likelihood 62 o Dye swap designs remove one source of bias 0 Exercise flexibility in choice of two types of samples eg loop designs 63 Replicate Standard Deviation Figure 4 Replicate Mean and Standard Devation of Differences of Transformed Observations Three Different Transformations Maximum Likelihood Transformation Data 15 7 7 Lowess Smooth 7 1 05 7 7 0 l Replicate Mean Replicate Standard Deviation Replicate Standard Deviation t Minimizing Transformation Data 7 Lowess Smooth 7 05 7 39 a I A 39 n 1 0 1 2 Replicate Mean Log Transformation Data 39 39 15 7 7 Lowess Smooth 1 05 7 0 Replicate Mean Determining Differentially Expressed Genes Consider an experiment on four types of cell lines A B C and D with two samples per type each of the eight measured with an Affymetrix U95A human gene array We have a measured intensity for each gene for each sample array in each group The measured expression is derived from the mean glog trahsformed PM probes 65 Steps in the Analysis 0 Background correct each array so that 0 expression corresponds to 0 signal 0 Transform the data to constant variance using a suitably chosen gog or alternative transformation started log hybrid log 0 Normalize the chips additively 66 0 The transformation should remove systematic dependence of the gene specific variance on the mean expresssion but the gene specific variance may still differ from a global average Estimate the gene specific variance using all the information available 0 Test each gene for differential expression against the estimate of the gene specific variance Obtain a p value for each gene 67 0 Adjust p values for multiplicity using for example the False Discovery Rate method 0 Provide list of differentially expressed genes 0 Investigate identified genes statistically and by biological follow up experiments 68 Structure Of Example Data Gene Group 1 Group 2 Group 3 Group 4 ID 1 2 3 4 5 6 7 8 1 9111 9112 9123 9124 9135 9136 9147 9148 2 9211 9212 9223 9224 9235 9236 9247 9248 3 9311 9312 9323 9324 9335 9336 9347 9348 4 9411 9412 9423 9424 9435 9436 9447 9448 5 9511 9512 9523 9524 9535 9536 9547 9548 69 7O Sun1 10000 20000 30000 40000 Difference 4000 6000 8000 Raw Data Difference 8000 6000 4000 2000 Raw Data 20000 30000 RankofSum 40000 50000 71 Difference 25 20 15 10 05 00 Glog of Data 10000 I 20000 30000 Rank of Sum I 40000 50000 72 The model we use is hA7aintensity genechipgene by grouperror For a given gene this model is z 2 group error where z is the transformed Chip normalized data for the given gene Kerr Martin and Churchill 2001 Kerr 2003 73 0 We estimate all the parameters by normal maximum likelihood including the transformation and possibly the background correction 0 Some care must be taken in the computations to avoid computer memory problems 74 0 We can test for differential expression for a given gene by analyzing the transformed normalized data in a standard one way ANOVA 0 We can use as a denominator the gene specific 4df MSE from that ANOVA This is valid but not powerful 0 We can use the overall 50493df MSE as a denominator This is powerful but risky 75 Frequency 2000 1500 1000 500 Histogram of GeneSpecific pValues 00 02 04 06 08 Raw pValues 76 Frequency 1000 1500 2000 2500 500 Histogram of Global pValues 00 02 04 06 Raw pValues 08 77 o The F statistics should be large if a significant effect exists and near 1 if no significant effect exists 0 If very small F statistics occur it means something is wrong 78 0 As an alternative we can use a model that says that the variation in different genes is similar but not identical The model that assumes the variation to be identical is not tenable in this data set Wright and Simon 2003 Churchill 2003 Rocke 2003 Smyth 2004 0 Note that we have removed any trend in the variance with the mean What is left is apparently random 79 o The posterior best estimate MSE is a weighted average of the gene specific MSE with weight 486 and the global estimate with weight 4686 and has 86 degrees of freedom The weights depend on the data set 80 Frequency 1000 1500 2000 500 Histogram of Posterior pValues 00 02 04 06 Raw pValues 08 81 5 Significant Genes by Several Methods lVISE Source TWER FWER FDR Gene Specific 2114 1 18 Global 2478 571 1516 Posterior 2350 29 508 82 Metabolomics by NMR Spectroscopy o Proton NMR spectroscopy produces a spectrum in which peaks correspond to parts of molecules 0 This can be used for single compounds to determine the structure 83 0 Compounds often have specific signatures so this can be used for compound identification particularly by 2D NMR 0 For metabolomics work one can use patterns in the spectra for discriminationclassification and to identify regions of the spectrum which carry the discrimination information 84 o Spectra need to be baseline corrected and peaks need to be aligned o The peaks are of widely varying magnitudes and some of the data are negative 0 The glog is a plausible transformation to help in the analysis of these data 85 86 Frequency Shift 10 12 e00 1 e08 I I Intensity 2 e08 3 e08 4 e08 I E NMR Specrum Variance Behavior of NMR Spectra 0 We show an example spectrum of 65536 points 0 We divide this into 8192 bins of 8 points each and compute the mean and standard deviation within each bin 87 A model for the spectrum is M bi me 6239 Where b is the baseline not presumed to be flat u is the true signal and e and n are measurement errors not necessarily independent across nearby points 88 Bin SD 1 e07 2 60 3 60 4 e07 5 60 0 e00 Standard Deviation vs Mean for Bins of Size 8 l l 2 e08 3 e08 4 e08 Bin Mean 89 Bm SD 10000 20000 30000 40000 50000 0 Standard Deviation vs Mean for Bins of Size 8 for Low Means 650000 600000 650000 500000 450000 400000 Bin Mean 90 SD vs Mean of Logs for Bins of Size 8 for High Means CIS H 20 19 18 17 16 15 14 13 Bin Mean 91 Bin SD 005 010 015 020 025 030 000 SD vs Mean of Glogs for Bins of Size 8 OQO g Gas 2000 4000 6000 Bin Mean 92 Baseline Estimation 0 The baseline in NIVIR is arbitrary and needs to be removed before analysis just as in mass SDGC 0 The baseline is less well behaved than for mass SDGC 93 40 e08 80 e08 12 e09 00 e00 Raw baseline corrected spectra J Lhul Ll 2 ppm 94 22 20 18 16 14 12 One glog transform of whole spectrum 2 ppm 95 96 ppm 36 37 38 39 40 00 e00 50 e07 10 e08 15 e08 20 e08 25 e08 30 e08 Raw spectra in a limited range 97 ppm 36 37 38 39 40 00 e00 50 e07 I I 10 e08 15 e08 20 e08 25 e08 Raw locally baseline corrected spectra 19 18 17 16 Transformed locally baseline corrected spectra 36 37 38 39 ppm 40 98 Spectrum with baseline distortion WI If i I w 39 1 99 Conclusion 0 Gene expression microarray and other omics data present many interesting challenges in design and analysis of experiments 0 Statistical lessons from years of experience with laboratory clinical and field data apply with some modification to expression data proteomics data and metabolomics data 100 0 A properly chosen transformation can stabilize the variance and improve the statistical properties of analyses 0 Slide normalization and analysis of two color arrays is made easier by this transformation 0 Other statistical calculations such as the analysis of variance that assume constant variance are also improved 101 0 After removal of systematic dependence of the variance on the mean the remaining sporadic variation in the variance can be accounted for by a simple method 0 These methods can be applied to other types of data such as proteomics by mass spec and NMR spectroscopy metabolomics The variables measured are a large number of peak heights or areas or a large number of binned spectroscopic values 102 o If your experiment needs statistics you ought to have done a better experiment Ernest Lord Rutherford 0 Lord Rutherford to the contrary notwithstanding if you need statistics you may indeed be doing the right experiment 0 Papers are available at www cipicucdavis eduNdmrocke or by mail and e mail 103 Experimental Design EPP 245298 Statistical Analysis of Laboratory Data Basic Principles of Experimental Inves ga on Sequential Experimentation Comparison Manipulation Randomization Blocking Simultaneous variation of factors Main effects and interactions Sources of variability Issues with twocolor arrays October 6 2005 EPP 245 Statistical Analysis of LaboratOIy Data Sequential Experimentation No single experiment is definitive Each experimental result suggests other experiments Scientific investigation is iterative No experiment can do everything every experiment should do something George Box October 6 2005 EPP 245 Statistical Analysis of LaboratOIy Data Analyze Data Plan from Experiment Experiment Perform Experiment October 6 2005 EPP 245 Statistical Analysis of LaboratOIy Data Comparison Usually absolute data are meaningless only comparative data are meaningful The level of mRNA in a sample of liver cells is not meaningful The comparison of the mRNA levels in samples from normal and diseased liver cells is meaningful October 6 2005 EPP 245 Statistical Analysis of LaboratOIy Data lnternal vs External Comparison Comparison of an experimental results with historical results is likely to mislead Many factors that can influence results other than the intended treatment Best to include controls or other comparisons in each experiment October 6 2005 EPP 245 Statistical Analysis of LaboratOIy Data Manipulation Different experimental conditions need to be imposed by the experimenters notjust observed if at all possible The rate of complications in cardiac artery bypass graft surgery may depend on many factors which are not controlled and may be hard to measure October 6 2005 EPP 245 Statistical Analysis of 7 LaboratOIy Data Number or Resident Starks vs Population of Oldenburg Popu atwon m Thousands Oduber znus Number of Storks Randomization Randomization limits the difference between groups that are due to irrelevant factors Such differences will still exist but can be quantified by analyzing the randomization This is a method of controlling for unknown confounding factors October 6 2005 EPP 245 Statistical Analysis of LaboratOIy Data Suppose that 50 of a patient population is female A sample of 100 patients will not generally have exactly 50 females Numbers of females between 40 and 60 would not be surprising In two groups of 100 the disparity between the number of females in the two groups can be as big as 20 simply by chance This also holds for factors we don t know about October 6 2005 EPP 245 Statistical Analysis of 10 LaboratOIy Data Randomization does not exactly balance against any specific factor To do that one should employ blocking Instead it provides a way of quantifying possible imbalance even of unknown factors Randomization even provides an automatic method of analysis that depends on the design and randomization technique October 6 2005 EPP 245 Statistical Analysis of 11 LaboratOIy Data The Farmer from Whidbey Island Visited the University of Washington with a Whalebone water douser 1O Dixie cups 5 with water 5 empty covered with plywood If he gets all 10 right is chance a reasonable explanation October 6 2005 EPP 245 Statistical Analysis of 12 LaboratOIy Data 10 252 L 004 252 The randomness is produced by the process of randomly choosing which 5 of the 10 are to contain water There are no other assumptions October 6 2005 EPP 245 Statistical Analysis of 13 LaboratOIy Data lfthe randomization had been to flip a coin for each of the 10 cups then the probability of getting all 10 right by chance is different There are 210 1024 ways for the randomization to come out only one of which is right so the chance is 11024 001 The method of randomization matters October 6 2005 EPP 245 Statistical Analysis of 14 LaboratOIy Data Randomization Inference 20 tomato plants are divided 10 groups of 2 placed next to each other in the greenhouse In each group of 2 one is chosen to receive fertilizer A and one to receive fertilizer B The yield of each plant is measured October 6 2005 EPP 245 Statistical Analysis of 15 LaboratOIy Data 1 2 3 4 5 6 8 10 A 132 82 109 143 107 66 95 108 88 133 B 140 88 112 142 118 64 98 113 93 136 diff 8 6 3 1 1 1 2 3 5 5 3 October 6 2005 EPP 245 Statistical Analysis of LaboratOIy Data Average difference is 41 Could this have happened by chance Is it statistically significant If A and B do not differ in their effects null hypothesis is true then the plants yields would have been the same either whether A or B is applied The difference would be the negative of what it was if the coin flip had come out the other way October 6 2005 EPP 245 Statistical Analysis of 17 LaboratOIy Data In pair 1 the yields were 132 and 140 The difference was 8 but it could have been 8 With 10 coin flips there are 210 1024 possible outcomes of or on the difference These outcomes are possible outcomes from our action of randomization and carry no assumptions October 6 2005 EPP 245 Statistical Analysis of 18 LaboratOIy Data Of the 1024 possible outcomes that are all equally likely under the null hypothesis only 3 had greater values of the average difference and only four including the one observed had the same value of the average difference The likelihood of this happening by chance is 3421024 005 This does not depend on any assumptions other than that the randomization was correctly done October 6 2005 EPP 245 Statistical Analysis of 19 LaboratOIy Data 1 2 3 4 5 6 8 10 A 132 82 109 143 107 66 95 108 88 133 B 140 88 112 142 118 64 98 113 93 136 diff 8 6 3 1 1 1 2 3 5 5 3 October 6 2005 EPP 245 Statistical Analysis of LaboratOIy Data 20 d41 sd3872 I 41 2 41 2 9 3872J 1224 p 0043 by ttest p 0049 by true randomization distribution 335 same range for simulation randomization distributions The ttest can be thought of as an approximation to the randomization distribution October 6 2005 EPP 245 Statistical Analysis of 21 Laboratory Data Randomization and in practice Whenever there is a choice it should be made using a formal randomization procedure such as Excel s rand function This protects against unexpected sources of variability such as day time of day operator reagent etc October 6 2005 EPP 245 Statistical Analysis of 22 LaboratOIy Data Pair Number October 6 2005 A OQmNCDChhWNA First Sample Treatment A or B A or B A or B A or B A or B A or B A or B A or B A or B A or B EPP 245 Statistical Analysis of LaboratOIy Data 23 October 6 2005 Pair Num A OQmVCDChhWNA First Sample random Treatment number Aor B 0871413 A or B 0786036 A or B 0889785 Aor B 0081120 A or B 0297614 A or B 0540483 A or B 0824491 A or B 0624133 Aor B 0913187 A or B 0001599 EPP 245 Statistical Analysis of LaboratOIy Data 24 rand in first cell Copy down the column Highlight entire column quotc EditCopy EditPaste SpecialValues This fixes the random numbers so they do not recompute each time FC3ltO5quotAquotquotBquot goes in cell CZ then copy down the column October 6 2005 EPP 245 Statistical Analysis of 25 LaboratOIy Data October 6 2005 Plant First Plant Pair Treatment moowcnmhooNA gtWWWWgtgtWWW 10 EPP 245 Statistical Analysis of LaboratOIy Data random number 0871413 0786036 0889785 0081120 0297614 0540483 0824491 0624133 0913187 0001599 26 To randomize run order insert a column of random numbers then sort on that column More complex randomizations require more care but this is quite important and worth the trouble Randomization can be done in Excel R or anything that can generate random numbers October 6 2005 EPP 245 Statistical Analysis of 27 LaboratOIy Data Blocking If some factor may interfere with the experimental results by introducing unwanted variability one can block on that factor In agricultural field trials soil and other location effects can be important so plots of land are subdivided to test the different treatments This is the origin of the idea October 6 2005 EPP 245 Statistical Analysis of 28 LaboratOIy Data If we are comparing treatments the more alike the units are to which we apply the treatment the more sensitive the comparison Within blocks treatments should be randomized Paired comparisons are a simple example of randomized blocks as in the tomato plant example October 6 2005 EPP 245 Statistical Analysis of 29 LaboratOIy Data Simultaneous Variation of Factors The simplistic idea of science is to hold all things constant except for one experimental factor and then vary that one thing This misses interactions and can be statistically inefficient Multifactor designs are often preferable October 6 2005 EPP 245 Statistical Analysis of LaboratOIy Data 30 Interactions Sometimes often the effect of one variable depends on the levels of another one This cannot be detected by onefactorat atime experiments These interactions are often scientifically the most important October 6 2005 EPP 245 Statistical Analysis of LaboratOIy Data 31 Experiment 1 I compare the room before and after I drop a liter of gasoline on the desk Result we all leave because of the odOL October 6 2005 EPP 245 Statistical Analysis of 32 LaboratOIy Data Experiment 1 I compare the room before and after I drop a liter of gasoline on the desk Result we all leave because of the odon Experiment 2 I compare the room before and after I drop a lighted match on the desk Result no effect other than a small scorch mark EPP 245 Statistical Analysis of LaboratOIy Data October 6 2005 33 October 6 2005 Experiment 1 I compare the room before and after I drop a liter of gasoline on the desk Result we all leave because of the odon Experiment 2 I compare the room before and after I drop a lighted match on the desk Result no effect other than a small scorch mark Experiment 3 I compare all four of igasoline and imatch Result we are all killed Large Interaction effect EPP 245 Statistical Analysis of 34 LaboratOIy Data Statistical Efficiency Suppose I compare the expression of a gene in a cell culture of either keratinocytes or fibroblasts confluent and nonconfluent with or without a possibly stimulating hormone with 2 cultures in each condition requiring 16 cultures October 6 2005 EPP 245 Statistical Analysis of 35 LaboratOIy Data I can compare the cell types as an average of 8 cultures vs 8 cultures I can do the same with the other two factors This is more efficient than 3 separate experiments with the same controls using 48 cultures Can also see if cell types react differently to hormone application interaction October 6 2005 EPP 245 Statistical Analysis of 36 LaboratOIy Data Fractional Factorial Designs When it is not known which of many factors may be important fractional factorial designs can be helpful With 7 factors each at 2 levels ordinarily this would require 27 128 experiments This can be done in 8 experiments instead October 6 2005 EPP 245 Statistical Analysis of 37 LaboratOIy Data EPP 245 Statistical Analysis of LaboratOIy Data October 6 2005 38 F1 L H H H L F2 F3 F4 F5 F6 F7 H H H H H H H EPP 245 Statistical Analysis of LaboratOIy Data October 6 2005 39 F1 L H H H L F2 F3 F4 F5 F6 F7 H H H H H H H EPP 245 Statistical Analysis of LaboratOIy Data October 6 2005 40 F1 L LHHH L F2 F3 F4 F5 F6 F7 HHHHHHH EPP 245 Statistical Analysis of LaboratOIy Data October 6 2005 41 F1 L LHHH L F2 F3 F4 F5 F6 F7 HHHHHHH Main Effects and Interactions Factors Cell Type C State S Hormone H Response is expression of a gene The main effect C of cell type is the difference in average gene expression level between cell types October 6 2005 EPP 245 Statistical Analysis of 42 LaboratOIy Data For the interaction between cell type and state compute the difference in average gene expression between cell types separately for confluent and nonconfluent cultures The difference of these differences is the interaction The threeway interaction CSH is the difference in the two way interactions with and without the hormone stimulant October 6 2005 EPP 245 Statistical Analysis of 43 LaboratOIy Data Sources of Variability in Laboratory Analysis Intentional sources of variability are treatments and blocks There are many other sources of variability Biological variability between organisms or within an organism Technical variability of procedures like RNA extraction labeling hybridization chips etc October 6 2005 EPP 245 Statistical Analysis of 44 LaboratOIy Data Replication Almost always biological variability is larger than technical variability so most replicates should be biologically different not just replicate analyses of the same samples technical replicates However this can depend on the cost of the experiment vs the cost of the sample 2D gels are so variable replication is required EPP 245 Statistical Analysis of LaboratOIy Data October 6 2005 45 Quality Control It is usually a good idea to identify factors that contribute to unwanted variability A study can be done in a given lab that examines the effects of day time of day operator reagents etc This is almost always useful in starting with a new technology or in a new lab EPP 245 Statistical Analysis of LaboratOIy Data October 6 2005 46 Possible QC Design Possible factors day time of day operator reagent batch At two levels each this is 16 experiments to be done over two days with 4 each in morning and afternoon with two operators and two reagent batches Analysis determines contributions to overall variability from each factor October 6 2005 EPP 245 Statistical Analysis of 47 LaboratOIy Data References Statistics for Experimenters Box Hunter and Hunter John Wiley October 6 2005 EPP 245 Statistical Analysis of LaboratOIy Data 48
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'