Popular in Course
Popular in ComputerScienence
This 74 page Class Notes was uploaded by Jonas Bartell on Monday October 26, 2015. The Class Notes belongs to CS2001 at University of Pittsburgh taught by Staff in Fall. Since its upload, it has received 53 views. For similar materials see /class/229394/cs2001-university-of-pittsburgh in ComputerScienence at University of Pittsburgh.
Reviews for RSRCHTOPICSCOMPUTRSCIENCE
Report this Material
What is Karma?
Karma is the currency of StudySoup.
Date Created: 10/26/15
Lot2757qxd 211004 2211 Page 647 Section VII New technologies Lot2757qxd 211004 2211 Page 648 Lot2757qxd 211004 2211 Page 649 Serum Proteomic Profiling and Analysis Richard Pelikan Michael T Lotzez Jim LyonsWeiler David Malehorn Milos Hauskrecht1 7Department of Computer Science University of PitEburgh 2 Translational Research MolecularMedical Institute University of Pittsburgh School of Medicine PitEburgh PA USA They killed him in Sarajevo Mr Svejk They shot him with a revolver as he was riding with that archduchess of his in an automobile Jaroslav Hasek Fateful Adventures of the Good Soldier Svejk During the World War Osudy dobr ho vojaka Svejka za sv tov valky INTRODUCTION The ability to examine serum peptides and proteins using mass spectrometry MS has recently become broadly of interest as a novel biomarker and surrogate ofdisease dis covery tooli Just as Svejk39s world was transformed by the introduction of modern technology now almost a century later the introduction of modern mass spectrometry strategies assessing datadense putative markers associ ated with inflammatory or immune endpoints have indeed changed the world of biological investigation They have become more widely applied particularly in the setting of cancer diagnostics Surfaceenhanced laser 39 I 39 i i uti ti ffligh mass y SELDlTOF MS proteomic profiling is one of these increasingly popular tools using 39shooting39 of protein mix tures with focused lasers in the search for known and sur rogate biomarkers for disease diagnosis and prognosis Current diagnostic tests such as repetitive biopsies done before or after therapy impose greater costs and risk of injury as opposed to gathering the patient39s proteomic 4 Chapter 57 profile The greatest potential promise of proteomic profiling lies in the possibility of the early detection of a malignant or chronic inflammatory condition with simple tests of the serum or urine In addition the hope is that different disease types and response to therapy pheno types might be reproducibly distinguishable using rapid and relatively inexpensive assays SELDITOFMS rapidly provides a protein expression profile from a variety of biological and clinical samples The potential efficacy of this system for serum protein profiling of cancer in human breast Paweletz et al 2001 colon Watkins et all 2001 head and neck Wadsworth et alt 2004 lung Zhukov et alt 2003 ovarian Petricoin et alt 2002 prostate cancer Adam et al 2002 Petricoin and Ornstein 2002 and hepatoma Steel et al 2003 ZeindlEberhart et al 2004 has been recently demon strated These studies describe diagnostic features of these profiles and classification algorithms based on these features which provide at least 80 per cent and in some cases gt90 per cent classification accuracy between can cer cases and controls Papers reporting high sensitivity and specificity in class prediction presented promising initial positive results Efforts to relate the proteomic profiling to changes in known protein factors including cytokines and chemokines as well as cellular elements and their signaling capacity are now being explored but it is premature to include them in this chapter The goal of this chapter is to overview the pro teome profiling technology and its potential benefits for identification and clinical assessment of progression of 4 Lotz 57 qxd 211004 2211 Page 650 650 pathological conditions The initial discussion of the SELDI TOF technology is followed by a discussion of technical limitations that affect the interpretive analysis of the pro files Next we focus on the description of some statistical methods used to decipher the profiles and their usage in diagnosing or predicting the condition of patients We illustrate the potential of these methods on the task of differentiating between individuals with and without can cer Finally the review concludes with some insight on the direction of using proteomic data analysis towards the benefit of the medical community SELDITOF MASS SPECTROMETRY The ProteinChip Biology System developed by Ciphergen Biosystems Inc uses SELDI TOF MS to ionize proteins specifically retained on a chromatographic surface which are then detected by time of flight mass spectrometry The system can be used for the mass analy sis of compounds such as proteins peptides and nucleic acids within a range of 0 200 kD The procedure begins with the reaction of a biological sample eg bodily fluid cell lysate or fraction thereof with the chromatographic surface or spot of a ProteinChip which possesses a defined affinity characteristic anioniccationic hydropho bic or metal binding or biologically derivatized eg antibody coupled The ProteinChips comprised of 8 or 16 of these spots retain only those analytes that match the surface s physical affinity characteristics non binding species are washed away using appropriate conditions The spots are then overlaid with an energy absorbing matrix compound which co crystallizes around the retained analyte molecules The spots are shot multiple times by a pulsed nitrogen laser The laser desorption Lens A Mirror Chip spots Df itf tdi Vacuum tube 0000 O ll ll AHigh voltage 4 Serum Proteomic Profiling and Analysis process results in ionization of matrix molecules and protonation of intact analyte molecules The ions produced are differentially accelerated in an electrical field and then detected after passing through a field free evacuated drift tube The time of flight across the tube is converted to provide information on the molecule s mass to charge ratio mz since heavier molecules by Newtonian physics will take longer to travel the same distance having acquired less initial velocity from the uniform acceleration force The detected ions are then represented as a spectrum with peaks of varying intensi ties and molecular weight assignments are made relative to known calibrant species Figure 571 displays a summary of the SELDl TOF MS process Early studies and first applications Paweletz et al 2001 Petricoin and Ornstein 2002 Petricoin et al 2002 assembled SELDl TOF MS proteomic profiles of patients with various types of cancer The primary goal of such studies was to determine whether it is possible to detect peptide markers of the presence of disease by analyzing the profiles contrasting those with cancer and those without For example Petricoin and coworkers April 2002 study Petricoin et al 2002 compared profiles of 200 patients in order to determine a discriminating pattern between patients with ovarian cancer and those with a variety of non cancer conditions A sample profile from this set is shown in Figure 572 It consists of intensi ties measured over 15 154 masscharge mz values LIMITATIONS OF THE SELDITOF PROFILES The profiles obtained by the SELDl TOF MS system mani fest a number of attributes which can complicate analysis Figure 573 compares two unprocessed profiles from the Spectral view Figure 571 Diagram of the Ciphergen SELDI TOF MS system Samples are reacted with ProteinChip spots coated with matrix and pulsed with a laser The ionized species created during this process are differentially accelerated in an electric field float through the vacuum tube where their arrival times and quantities intensity are measured by a detector Heavier ions demonstrate longer time of flight which is a unique indicator for each molecular species The plot of ion intensity versus specific time of flight or corresponding mass to charge value constitutes the mass spectrum for a given sample 4 Lot2757 qxd 211004 2211 Page 651 Richard Pelikan et al 651 Intensity Figure 572 A sample SELDleTOF MS profile The xeaxis plots massetorcharge ratio The y axis plots the relative intensity flux 0 02 04 06 08 1 12 14 16 18 of analyte species with the identified Masscharge ratio gtlt104 massetoecharge ratio 35 i i i i i i 9 30 r r 25 a 7 z A g 20 7 4000 4100 4200 4300 4400 4500 4600 4700 4800 4900 5000 a E vllu valvw ll l l l l l 2000 3000 4000 5000 6000 i 8 7000 7100 7150 7200 7250 7300 7350 7400 7450 7500 Figure 573 Two SELDleTOF profiles obtained for the same pooled reference serum The differences in the baseline left panel intensity measurements right top panel and mass inaccuracies right bottom panel are apparent same reference serumt Differences are readily visible and illustrate stochastic variations that obscure what would ideally be identical profiles Many causes may be respon sible for the differences variation in the SELDlTOF MS instrumentation condition over time fluctuation in the intensity of the laser and even surface irregularities on the protein chip spots or the matrix crystallization All stochastic variations in profiles show up as differ ences in intensity readings However these differences are the result of two intertwined problems mass inaccu racy and intensity measurement errors Mass inaccuracy refers to the misalignment of readings for different mz values The mass inaccuracy for Ciphergen s SELDlTOF MS system is reported to be approximately 051 per cent for externally calibrated experiments The intensity meas urement error may arise from imperfect performance of the ion detector in registering the abundance of ions at a given time point detector saturation Both types of errors are illustrated in the right panel of Figure 5753 In addition the left panel of Figure 5753 illustrates baseline variation a systematic intensity measurement error for which the measurements of the profile differfrom 0 Note that the baseline shifts between two samples differ despite the fact that the same serum is being analyzed The mass inaccuracy and intensity measurement errors can lead to significant fluctuation in profile readings In addition if we analyze samples from multiple individuals a natural biological variation in sera is observed This can show up as differences in intensity values or as the pres ence or absence of peaks in the profiles The peaks are believed to indicate the presence of peptides or their fragments These problems lead to serious challenges in 4 Lot2757 qxd 211004 2211 Page 652 652 interpretive analysis Many of the observed variations could be caused by changes in carrier proteins protein catabolism or the inherent cyclical shifts in relative abundance due to production and release from various tis sues We have considered some of these issues parti cularly during nominally chaotic states as represented in patients with various neoplasms reflecting on the notion that an individual analyte may be increased or decreased at any time interval when compared to normal sera some thing we have termed the ABA problem Patel and Lyons Weiler 2004 DATA PREPROCESSING Despite the problems presented above there are pos sible steps one can take before analyzing the data The cleansing and modifying or preprocessing of the data are intended to eliminate noise or redundant compo nents in the signal other than true biological variation in the serum The preprocessing steps include smoothing profile alignment rescaling normalization and baseline correction However any preprocessing step comes at a risk of loss of useful biological information or introduction of additional errors Therefore if any preprocessing is performed conclusions drawn from an ensuing analysis should be carefully validated Smoothing serves to eliminate a high frequency noise component in the signal Figure 574 illustrates the effect of smoothing on a SELDlTOF profile High frequency variation is eliminated by local averaging of the signal Whether smoothing removes useful signal or useless noise is not yet clear The mass inaccuracy problem see Figure 573 lower right panel can be resolved through profile alignment methods A number of strategies for performing profile alignment exist One option is to define a reference 4 Serum Proteomic Profiling and Analysis profile in terms of a set of established biomarkers that are easily identifiable in every profile Another approach is to include indicator peptides in the serum in order pur posely to populate the profile with peaks to be expected at certain mz values The intensity readings between these peaks could then be stretched or shrunk along the x axis appropriately Unfortunately due to the locality of mz errors this approach requires the addition of several thousand peptides to the serum in order to recapture properly the information lost through mass inaccuracy Using alignment algorithms directly on the profiles runs the risk of eliminating important biological variability in the data Clearly this challenge is deserving of additional attention solving the alignment problem entails the possibility of eliminating sampletosample variation which appears to be autocorrelated in terms of temporal runtorun and spatial spottospot processing A particular profile may suffer from an overall weakness in signal Variation in the sensitivity of the ion detector or amount of retained molecules on the chip surface may result in profiles which seem to be measured on a differ ent scale see Figure 573 upper right panel Rescaling normalizing these profiles allows them to be compared on the same scale This type of adjustment differs from baseline shifting which is an additive error as opposed to scaling which is a multiplicative error Correcting for baseline shifts see Figure 573 left panel involves subtracting a constant intensity value from the pro file The problem is that the baseline shift may vary over the mz range Figure 575 illustrates the process of baseline correction on the reference profile The method removed the additive component in the signal and brought the base line to O A challenge presented by baseline correction is that the noise from the intensity measurement process appears to be correlated strongly with the magnitude ofthe measurement This suggests that any signal rescaling must be performed before the baseline correction 74 7 126 127 128 129 13 131 132 133 134 135 126 127 128 129 13 131 132 133 134 135 x104 Figure 574 An example of smoothing The original profile left panel demonstrates a highefrequency component By averaging the signal locally the highefrequency component is removed Note that this results in a loss of information if this component carries useful content 4 Lot2757 qxd 211004 2211 Page 653 Richard Pelikan et al 0 l 16 18 2 X104 0 02 04 06 08 1 12 14 16 18 2 x104 0 02 04 06 08 1 12 14 Figure 575 An example of baseline correction Left panel a profile with a baseline drift Right panel the corrected profile The additive component in the signal is removed and the baseline is shifted to the zero intensity level BUILDING A PREDICTIVE MODEL One ofthe objectives of SELDlTOF MS data analysis is to build a predictive model that is able to determine the tar get condition case or control cancer or noncancer for a given patient39s profile The predictive model is built from a set of SELDlTOF profiles samples assembled during the study Each sample in the dataset is associated with a class label determining the target patient condition case or control cancer or noncancer we would like automati cally to recognize Our objective is to exploit the informa tion in the data and to construct or learn a classifier model that is able to predict accurately the label of any new profile Such a model can be then used to predict labels of new profile samples diagnosis The ultimate goal is to build learn the best possible classifier ie a model that achieves the highest possible accuracy on future yet to be seen proteomic profile samples Many types of classification models exist These include classic statistical models such as logistic regres sion Kleinbaum 1994 linear and quadratic discriminant analysis Duda et al 2000 Hastie et al 2001 or modern statistical approaches such as support vector machines Vapnik 1995 Burges 1998 Scholkopfand Smola 2002 In general the model defines a decision boundary a surface in the high dimensional space of profile measure ments that attempts to separate case and control profiles in the best possible way The left panel of Figure 576 illustrates a linear surface a hyperplane that lets us separate 193 of the 200 samples from the ovarian study using the intensity information in three profile posi tions 00735 00786 04153 mz However note that in general the perfect separation of two profiles via a linear surface using just three positions may not be possible This is illustrated in the right panel of Figure 576 where the linear surface allowing the perfect separation of case and control profiles does not exist This scenario leads to sample misclassification the assignment of incorrect profile labels to some samples relative to the decision boundary EVALUATION OF A MODEL To evaluate the quality of a classification model one must determine the ability of the model to generate accurate predictions of future unseen profile data Obviously since such data are unavailable we can simulate this scenario by splitting the available data obtained during the study into a 39training39 set and a 39testing39 set The training set is used to learnconstruct the classifiers The learning process adjusts the predictive model classifier so that the examples in the training set are classified with a high accuracy The ability of the model to predict the case and control samples in the future is evaluated on the test set Figure 577 illustrates the basic evaluation setup The quality of a binary case versus control classifica tion model may be determined among many different metrics For the purposes of this review the classification models are evaluated using statistics computed from the confusion matrix a twobytwo grid that represents the types and percentages of correct and incorrect classifications A sample confusion matrix is shown in Figure 578 The following useful measures can be derived from the confusion matrix 0 Error misclassification rate E FP FN 0 Sensitivity SN 4 LotZ 57qu 211004 2211 Page 654 654 as Serum Proteomic Profiling and Analysis Figure 576 Left panel An example of a perfect separation of case X versus control 0 using a hyperplane in a 2dimensional projection of the 3dimensional space defined by mz positions 00735 00786 04153 in the ovarian cancer study Note that all controls samples are above the hyperplane while all cases are below Right panel The perfect separation of the two groups does not exist Some case and control samples appear on the opposite side of the hyperplane Testing set 9 gt Predictive model Figure 577 The basic evaluation setup The dataset of samples case and control profiles is divided into the training and testing set The training set is used to learn fit the predictive model and testing set serves as a surrogate sample of the future unseen examples Actual Case Control 3 TP FP g 8 03 01 E 8 6 a g FN TN 8 02 04 Figure 578 Diagram of a confusion matrix The four areas are labeled as follows 0 True positive TP the percentage of profiles which the classifier predicts correctly as belonging to the group of cases cancerous 0 False positive FP the percentage of profiles which the classifier predicts as belonging to cases but truly belong to controls healthy 0 False negative FN the percentage of profiles which the classifier predicts as belonging to controls but truly belong to cases 0 True negatives TN the percentage of profiles which the classifier predicts correctly as belonging to the group of controls Note that the numbers should always sum to I An ideal classifier will have a value of 0 for both false negatives and false positives 0 Specificity 393 i TP FN i TN FN 0 Positive predictive value PPV 0 Negative predictive value N PV A confusion matrix and thus also the above measures can be generated for both the training and testing set However the evaluation of the testing set matters more it is an indicator of how well the classification model generalizes to unseen data Sensitivity and specificity measures on the testing set are very useful if we consider adopting the classification model for the purpose of clini cal screening Sensitivity reports the percentage of samples with a condition that were correctly classified as having the condition Specificity on the other hand reports the percentage of samples without a condition that were correctly classified as not having the condition Very high values ofthese statistics indicate a good screen ing test Moreover one type of error can be often reduced at the expense of the other error This is very important if we care more about one type of error For example we may want to achieve a low number of patients incorrectly classified as suffering from the disease and care a bit less about missing a patient with the disease during screening In the simplest evaluation framework statistics recorded in the confusion matrix are based on a single traintest set split To eliminate a potential bias due to a lucky or an unlucky traintest split the average of these statistics over multiple random splits is typically reported STRATEGIES FOR LEARNING A CLASSIFICATION MODEL Classification models come in different guises They may use a different decision boundary type or optimize slightly different learning criteria Deeper analyses of many existing LotzeS7 qxd 211004 2211 Page 655 Richard Pelikan et aI 9 o 3 E E m m 2 x104 100 Figure 579 Differentially g 50 7 expressed features profile 0 l r y positions Top panel g 0 l lltw L Hs JkUK Fel k Fisher score values for E 50 each mz position along 0 02 04 06 08 1 1 4 16 1 8 the profile Middle panel X10 Mean of case profiles 3 100 Lower panel Mean of E 50 control profiles Higher 8 values of Fisher score g 0 indicate areas of the g 50 r r r l profile which are 14 16 1 8 2 statistically likely to be models and their properties are beyond the scope of this work To illustrate the main steps of the analysis we will focus on one such a model the linear support vector machine Briefly the linear support vector machine models a linear decision boundary a hyperplane between different classes having the maximum ability to separate the groups of classes as illustrated in Figure 576 Learning and predicting using a classification model is often hampered by the fact that the data are presented with a small number of samples but have a naturally high dimensionality In this case there are only 200 pro files to study yet the input vector of each profile con tains 15154 mz values or 39features39 When the number of features vastly outweighs the number of samples the parameters of the classifier model are estimated with high variance This makes it difficult to achieve a model that gen eralizes well to new examples To alleviate this problem we demonstrate the use of several types of feature selection approaches to convert the highdimensional profile data into a lowdimensional dataset consisting of only a small number of mz intensities A hyperplane dependent on a smaller number of features is a more simplistic boundary and will generalize better to future unseen examples The primary goal of feature selection is to obtain a smaller set of features with high discriminative ability from a large set of inputs To assess an ability of a profile position to discriminate between the case and the control a univari ate statistical analysis can be performed Several methods exist for evaluating the dispersion between cases and con trols including ttest Fisher score or AUC score Fisher 1936 Hanley and McNeil 1982 Hastie et al 2001 etc Figure 579 displays Fisher scoresl for each position in the profile top panel Below the mean case profile middle we where pf 15 the mean a a The Fisher score is computed as Fisherh value forthe ith feature in the positive or negative profiles and 0715 the standard deviation x104 good features panel and mean control profile bottom panel are shown Higher values of the Fisher score indicate that the mz posi tion is likely represented differently in all case profiles versus all control profiles These features can be more commonly referred to as differentially expressed features Features with the highest Fisher score are very likely good feature candidates These features may be memorized and used as biomarkers in screening additional profiles Thus selecting features is a short matter of computing the Fisher score of each feature each mz value and then eliminating all but the top kfeatures where k is the desired quantity of left over39 featu res A univariate analysis of features ignores possible dependencies between the features For example the value of a feature may be highly correlated with values of other features Because of this many of the features may be redundant and their inclusion offers very little addi tional information towards building a predictive model Figure 5710 illustrates some of these problems It shows 30 out of the top 100 positions that would be selected through Fisher score We see that these 30 features accu mulate Within two areas related to two peak complexes These groups of features are highly correlated and the amount of new discriminative information the duplicates add is relatively small One way to alleviate this problem is by enforcing the maximumallowed correlation MAC threshold on the features selected via univariate analysis Hauskrecht et al unpublished observations This helps to ensure that the feature selection method increases its coverage of the signal and does not miss important infor mation that might otherwise be ignored Focusing on individual positions in the profile may narrow the field of View This is especially concerning if the discriminative signal is weak and noisy One way to alleviate this problem is to look at aggregate features or features that combine the information from many profile positions The intuition is that the useful signal can be 4 Lot2757 qxd 7 Figure 5710 Thirty out of the top 100 positions selected by the Fisher score on the mean cancer profile The positions defined by 7 the markers accumulate on only two peak complexes and many features are highly correlated thus carrying minimum additional 211004 2211 Page 656 656 Serum Proteomic Profiling and Analysis 7 6 c 5 c 4 3 c 2 1 a 3200 3300 3400 3500 3600 3700 3800 amplified over many related positions and thus it is less sensitive to random fluctuationsnoise An example of such a transformation is the principal component analysis PCA Jolliffe 1986 lntuitively the PCA identifies ortho gonal sets of correlated features and constructs aggre gate features or components which are uncorrelated yet are responsible for most of the variance in the original data Retaining the variance is important it is often helpful to explore parts of the data where the features spread out across a large space which gives more room to find a deci sion boundary between classes These methods often con struct useful features due to their inherent ability to utilize the entire signal rather than a limited number of positions Another popular solution that attempts to alleviate the problem of a noisy signal assumes that all relevant informa tion is carried by peaks Adam et al 2002 The subsequent discriminative analysis is restricted to identified peaks only andor their additional characteristics Numerous versions of peak selection strategies exist Adam et al 2002 However the utility and advantage of these strategies over other feature selection methods for the purpose of dis criminant analysis has not been demonstrated RESULTS OF INTERPRETIVE ANALYSIS The goal of this section is to illustrate the potential of the SELDlTOF profiling technology on the analysis of the ovarian cancer data from April 2002 Petricoin et al 2002 The dataset2 consists of 200 profiles samples 100 sam ples represent cancer 100 samples are controls We rely on concepts and approaches discussed in previous sections and show that it is indeed possible to build predictive models that can classify with high accuracy test samples taken from the SELDlTOF dataset The predictive model used in all our experiments is the support vector machine see above The model builds a linear decision boundary that separates cancer and control samples provided in the training set In all experiments the model is learned using a limited number offeatures 5 25 We try different feature selection approaches The process of finding a good set of features is a highly exploratory process and often remains the bottleneck of the discovery 2n r rrr r r discriminative information Table 571 Predictive statistics for the linear SVM model on the ovarian cancer dataset The features are selected according to the Fisher score criterion The maximum allowed correlation MAC threshold is 08 Test errors range in between 4 and 29 per cent Sensitivities and specificities are between 949 and 976 per cent No of features Testing error Sensitivity Specificity 5 00352 09764 0953 i0 00402 09698 09497 i5 00406 09584 09604 20 00332 0964i 09695 25 00299 09666 09736 Table 571 illustrates the quality ofthe predictive model learned using the top 5 25 mz positions according to the Fisher score criteria To remove highly correlated features we used the maximum allowed correlation MAC thresh old of 08 The table shows the test errors E sensitivities SN and specificities SP for a different number of features All statistics listed are averages over 40 different splits of the data into training and testing sets This assures that the results are not biased due to a single lucky and unlucky traintest split The split proportions are 7030 ie 70 per cent of samples are assigned to the training set and 30 per cent to the test set Using the univariate statistical analysis approach as dis cussed above the results are quite impressive The best result occurs when the classifier is allowed to use the top 25 features selected by Fisher score Under this condition the classification model achieves 966 per cent sensitivity and 9736 per cent specificity On average 299 per cent of the samples seen during the testing phase were misclassified A different number of features used can show a tradeoff in the improvement of sensitivity or speci ficity Note that sensitivity is highest when using only five features yet specificity is highest when using 25 Instead of narrowly examining a small number of indi vidual positions we can examine the effectiveness of aggregate features Table 572 illustrates the perfor mance statistics of our classification model using features constructed using PCA The results are included over a range of 5 to 25 principal component features which amplify patterns found in the profile signal Again the resulting statistics are averages over the same 4 Lot2757 qxd 211004 2211 Page 657 Table 572 Predictive statistics for the linear SVM model on the ovarian cancer dataset The features are constructed using princii pal component analysis PCA Test errors range between 499 and 89 per cent Sensitivities and specificities range between 853 and 94 4 per cent 4 Richard Pelikan et al 657 Table 573 Performance statistics of the linear SVM classifier after using peak detection A MAC threshold of 08 was enforced before selecting the top 5725 features using the Fisher score Testing error ranges from 446 to 98 per cent while sensitivity and specificity range from 85 to 934 per cent No offeatu res Testing error Sensitivity Specificity No of features Testing error Sensitivity Specificity 5 04992 08533 07477 5 04468 09452 08508 40 04444 09464 08645 40 04049 09469 0873 45 04078 08998 08846 45 04033 09209 08722 20 00926 09038 0944 20 00984 0934 08689 25 00898 09087 09448 25 04042 0934 08634 4O traintest splits as used in the univariate analysis to allow for a more direct comparison of behavior The predictive performance of our classifier falls when used in conjunction with principal component features The reason may be due to too many independencies between positions in the profile Such a condition causes a problem for PCA which attempts to find signalwide rela tionships between multiple positions If the signalwide relationships that do exist are weak then the benefits from using these features will be minimal In addition including many of these features to compensate for their weakness complicates the process of discovering biomarkers As a final example of feature selection we illustrate the behavior of the aforementioned peak selection strategy We refer to a peak as the local maximum over a region in the profile The signal is smoothed before detecting the maxima on the mean case and control profiles Peak posi tions from both mean profiles are then used as the pri mary set of features Table 573 displays performance statistics ofthe above model using peak selection prior to selecting the top Fisher score features Unfortunately the performance was below what was achieved without the peak selection strategy see Table 574 Testing error is relatively high considering the results presented earlier in Table 574 The likely reason is that informative features are not only found on peaks but in valleys as well Another reason is that the particular behavior of the peak detection algorithm is not optimal As mentioned before there exist many methods for per forming peak selection Different criteria for selecting peaks will undoubtedly yield differing results The results presented above show that it is possible to learn predictive models that can achieve a very low classi fication error on SELDITOF samples To support the sig nificance of these results in particular the fact that the sample profiles carry useful discriminative signals one may want to perform additional statistical validation The goal of one such a test the random classpermutation test is to verify that the discriminative signal captured by the classifier model is unlikely to be the result of the ran dom case versus control labeling Figure 5744 shows the result of the random permutation test for classifiers ana lyzed in Table 574 The figure plots the estimate of the mean test error one would obtain by learning the classifier on 5 25 features for randomly assigned class labels and estimates of 95 per cent and 99 per cent test error bounds The estimates are obtained using 400 random classlabel permutations of the original ovarian dataset The results illustrate a large gap between classification errors achieved on the data and classification errors under the null random classlabel hypothesis This shows that our achieved error results are not a coincidence DISCUSSION This review deals solely with clinical proteomics but the analysis techniques reported are typically those that could be used in applications to other domains including immunologic factors determined in the serum including cytokines chemokines and antibodies or microarray datatranscription profiling of the peripheral blood The profiles generated by SELDITOF MS are a rich source of information which floats to the surface after careful analy sis Although there are many ways to analyze and evaluate proteomic profile data a simple framework such as the one presented above serves as a foothold for future data analysis work Using proper feature selection techniques proteomic profiling can be a valuable discovery tool for locating protein expression patterns in separate case and control populations As seen above by comparing expected generalization results on an unseen testing set one can evaluate the performance of many feature selection strategies The resulting classification models each contribute knowledge about the profiles whether there is success or failure with subsequent test sets In the case explored above the peak selection strategy used was not effective due to important information being expressed in the valleys39 of the profile When these were taken into account the predictive ability of the classification model is higher When features were found to cluster among one another due to correlation in relation to proximity of mz values decorrelation became an important step in the process Ultimately the classification model was able to obtain a testing error of 3 per cent when using the top 25 mz intensities ranked by Fisher score and having intracorrelation coefficients less than 08 4 Lotz757 qxd 211004 2211 Page 658 658 Serum Proteomic Profiling and Analysis 07 06 7 7 E x 05 7 2 04 Q 39393939393939 quot47 7 7 7 77 77 77739777739777397397397 Figure 5711 Permutation7based validationThe top solid line indicates the mean7achieved classification error 03 MACE for the model under the null hypothesis the class 3 labels in the data are assigned to profiles randomly The 5 02 7 upper dashed line indicates the estimate of the upper 95th E percentile of the test error statistic under the null lt 01 7 7 hypothesis Likewise the lower dashed line indicates the 99th percentile The bottom solid line indicates the 0 achieved classification error ACE using the original data 0 5 10 15 20 25 Feature space size The mz values selected through the favored classifier play the most important role in clinical diagnostics Databases of peptide masses are easily queried with a list of mz values which return possible protein sources for those peptide molecules In this way it is possible to verify the statistically located biomarkers with the support of biological knowledge surrounding these indicators Even if identifying the protein is not feasible simply making a note ofthe mzvalues is often enough New patients can have their MS spectra generated and these mz values checked in order to determine the efficacy a treatment will have the progress a disease has made or to deter mine composition of antibodies during an immune response Mass spectrometry proteomic profiling is certainly a viable technique for the discovery and detection of bio markers Advances in sample preparation and instrumen tation are improving the usage of this technology With continued application and development the information derived from mass spectrometry studies will contribute substantially to what is currently known about the role variation and relative abundance of multiple proteins in the setting of normality and disease The study of cell biology is for a large part dependent on solutions to many structural problems Accurate struc tural characterization of protein peptides in femtomole levels is especially important to immunology and the studyofp 39 u a hm dlaser 39 39 ionization timeof flight mass spectrometry SELDlTOF MS is a technology which can assist biologists in studying the 39structures39 of macromolecules in complex and varying situations the ultimate goal is to utilize fully this technology for determining the function and abundance of peptides as they relate to health and disease One important potential application of this technology includes quot1 quotfquot p ptid p t in t atare 39 t 39 with immune responses to diseased cells Peptide or lipid antigens recognized by T cells in the context of MHC molecules serve as a means to identify relative changes or new proteins or pathogens residing within a cell serving labeling Only the values for the points marked are computed as a basis for T cell recognition and effector function including the ability to lyse the cell presenting the nomi nal antigen These antigens are presented as protein fragments which can be readily identified by the mass spectrometry technology Castelli et al 1995 REFERENCES Adam BL Qu Y Davis JW et al 2002 Serum protein fin7 gerprinting coupled with a pattern7matching algorithm distin7 guishes prostate cancer from benign prostate hyperplasia and healthy men Cancer Res 62 360973614 Burges C 1998 A tutorial on supportvector machines for pat7 tern recognition Data Mining Knowl Discov 21217167 Castelli C Storkus WJ et al 1995 Mass spectrometric identifi7 cation ofa naturally7processed melanoma peptide recognized by CD8 cytotoxicT lymphocytes J Exp Med 181 363y368 Duda RO Hart PE and Stork DC 2000 Pattern classifica7 tion 2nd edn John Wiley and Sons2 Fisher RA 1936 The use of multiple measurements in taxo7 nomic problems Ann Eugenics 7 797188 Hanley J and McNeil B 1982 The meaning and use of the area under a receiver operating characteristic curve Diagnost Radiol 143 29736 Hastie T Tibshirani R and Friedman J 2001 The elements of statistical learning New York Springer7Verlag Jolliffe LT 1986 Principal component analysis New York Springer7Verlag Kleinbaum DC 1994 Logistic regression a self7learning text New York Springer7Verlag Patel S and Lyons7Weiler J 2004 caGEDA a web application for the integrated analysis ofglobal gene expression patterns in cancer Appl Bioinformat in press3 Paweletz C Trock B Pennanen M et al 2001 Proteomic patterns of nipple aspirate fluids obtained by SELDLTOF Dis Markers 17 3017307 Petricoin EF Ardekani AM and Hitt BA 2002 Use ofpro7 teomic patterns in serum to identify ovarian cancer Lancet 359 5727577 Petricoin EF ancl Ornstein DK 2002 Serum proteomic pat7 terns for detection of prostate cancer J Natl Cancer Inst 94 157671578 4 Lot2757qxd 211004 2211 Page 659 Richard Pelikan et al 659 Scholkopf B and Smola A 2002 Learning with kernels MIT Watkins B Szaro R Shannon B et al 2001 Detection of 4 Press 4 early stage cancer by serum protein analysis Am Lab Steel L F Shumpert D and Trotter M 2003 A strategy for the 327365 comparativeanalysisofserum proteomesforthediscoveryof ZeindliEberhart E Haraida S and Liebmann S 2004 biomarkers for hepatocellular carcinoma Proteomics 3 Detection and identification of tumoriassociated protein 60109 variants in human hepatocellular carcinomas Hepatology 39 Vapnik VN 1995 The nature of statistical learning theory 54L549 NewYorkSpringer7Verlag Zhukov TA Johanson RA Cantor AB Clark RA and Wadsworth JT Somers K and Stack B 2004 Identification Tockman MS 2003 Discovery of distinct protein profiles of patients with head and neck cancer using serum protein specific for lung tumors and preimalignant lung lesions by profiles Arch Otolaryngol Head Neck Surg 730 987104 SELDI mass spectrometry Lung Cancer 40 2677279 CS 2001 Intro to research in CS Machine Learning in Bioinformatics Milos Hauskrecht miloscspittedu 5329 Sennott Square CS 2001 ML in Bioinformatics Who am I Milos Hauskrecht An assistant professor at CS department Secondary affiliations ISP CBMI and UPCI Research AI 7 Planning optimization and learning in the presence of uncertainty Applications 7 Biomedical informatics 7 Modeling and control of large stochastic network systems 7 Decisionmaking for patientmanagement 7 Anomaly detection in clinical databases CS 2001 ML in Bioinformatics Bioinformatics Bioinformatics 7 Application of CS and informatics to biological and clinical sciences Bioinformatics is the eld of science in which biology computer science and information technology merge to form a single discipline The ultimate goal of the eld is to enable the discovery of new biological insights as well as to create a global perspective from which unifying principles in biology can be discerned Machine Learning 7 An area that studies methods and the design of computer programs that let us learn from past experience CS 2001 ML in Bioinformatics Machine Learning in Bioinformatics Why do we need machine learning in bioinformatics 7 A large volumes of data generated in medicine patient records with clinical data special studies with new highthroughput data sources 7 DNA microarray data Protemic pro les multiplexed arrays Use the data to 7 Learn a model that let us screen for the presence ofa disease for new patients or predicts a good therapy choice 7 Identi cation of disease biomarkers CS 2001 ML in Bioinformatics This talk Overview of data sources 1 DNA microarray data 2 SELDlTOFMS proteomic data pro les 3 Luminex arrays Basics about the methods of analysis 7 Classi cation 7 Evaluation r nmm u c h Basics m fqm away mm rlmudu amusmm u m mum cszum MLmE Measuring RNA and Protein levels DNA gt RNA gt Protein Ir39tnsu ipfion firmsuric mRNA messenger rRN A ribosomal tRNA transfer Ribosome Iran altion Differences in normal and disease cells Protein CS 2001 ML in Bioinformatics Measuring RNA and Protein levels DNA H gt RNA gt Protein N39tlllSL39l39lpflOl TI CIIISIITIUII rRNA mRNA k tRNA ribosomal messenger transfert Ribosome Protein CS 2001 ML in Bioinformatics DNA microarrays Are small solid supports onto Which the sequences or subsequences from thousands of different genes are attached The supports are usually glass microscope slides or silicon chips or nylon membranes The DNA is printed spotted or actually synthesized directly onto the support The spots can be DNA cDNA or oligonucleotides muzan sumxdemng q lSJannyuzmhgvzmpnmanaausunhmiymma mmmmw m Mamray mm ma mmmm mm a mu 5 mama mm ms Extracting the information from microarrays Hybridization probing a technique that uses uorescently labeled nucleic acid molecules as quotmobile probes to identify complementary molecules sequences that are able to base pair With one another on the microarray chip A singlestranded DNA fragment is made up of four different nucleotides 7 adenine A thymine T guanine G and cytosine C that are linked end to end 7 Pairs or complements Adenine A Thymine T Guanine G Cytosine C Two complementary sequences always nd each other and lock together or hybridize 7 immobilized target DNA on the DNA microarray hybridizes With mobile probe DNA cDNA or mRNA r nmm u c Extracting the information from microarrays Assume two types of cells W 7 Normal cells m 7 Cancer cells Extracted mRNA for each type is labeled with a different orescent dye Cy RNA my 2 51h 1 M a 11111 I31 em in equa W5 WM Hybridize them on the m1croarray cumhme m equal mum Scan the array the color tells if the one cell type is over eXpressed as compared to the other type wnnmze m mmruzrrzy mseeseu tumor sample CS 2001 ML in Bioinformatics Sources of variation The signal we obtain from a microarray is not perfect 7 If we take two microarrays for the same samples we see a lot of variation Why 7 Technical artifacts uneven spotting 7 Variation in RNA purity 7 Different labeling efficiencies of ourescently labeled nucleotides 7 Variations in expression level measurements scan Preprocessin g 7 Attempts to remove unwanted sources of variation while preserving a useful information 7 Eliminate systematic biases CS 2001 ML in Bioinformatics Preprocessing Cy3 and CyS are incorporated into DNA with different efficiencies 7 Genes that are expressed at comparable levels come With different than ll dye rnixin 39 7 This creates a background signal Example solution methods 7 Global normalization assure that expression level for the array is l on average 7 Housekeeping genes genes that are known to be stable across different conditions eg betaactin Correct the signal using the mean of expressions at housekeeping genes r nmm R Preprocessing Example correct the slope so that ll on average mm 2 r nmm u c Prep ro cessing Example Loess normalization locally weighted polynomial regression Traditional analysis relies on logratios MlogRG and averages Al2 logRG as the primary data M makes gene expression measurements relative Mass Spectrometry proteomic pro ling Surface Enhanced Laser DesorptionIonization Time of Flight Mass Spectrometry SELDITOFMS A technology recently developed by Ciphergen Biosystems A mass profile of a sample serum urine cell lysates in the range of 0200K Daltons A pro le is believed to provide a rich source of information about a patient Potentially useful for Detection of a disease 7 Many promising results on various types of cancer Discovery of disease biomarkers 7 Identify species proteinpeptides responsible for the differences mm a c o SELDI TOF MS Surface Enhanced Laser DesorptionIonization Time of Flight Mass Spectrometry Mirror Lens Laser 1 Ion Detector Spectral Data Vacuum Tub e Ions with the smallest mass fly the fastest and are detected first Sample Reflects the mass of proteins peptides serum cell lysates urine nucleic acids in the Sample CS 2001 ML in Bioinforma cs Importance of SELDITOF MS data SELDI pro les are believed to be useful for Detection of a disease I case 7 Promising results on various types of cancer Discovery of disease biomarkers 7 Identify species proteinpeptides responsible for the differences CS 2001 ML in Bioinforma cs Challenges in MS data analysis Many sources of variation 7 Sample collection sample storage amp sample processing 7 Instrument conditions natural variation in protein expression levels among individuals Example pro les of two patients with pancreatic cancer Challenges in data analysis Three types of systematic instrument errors 7 Baseline shift 7 Intensity measurement error 7 Mass inaccuracy Example two pro les for the same reference serum k all 511 L J Baseline Shift A systematic error where intensity measurements differ from 0 Example 2 pro les of the same pooled reference QNQC serum DMHIL nl k hlliin jlmM 2mm auuu 4mm 5mm auuu mun M1 CS 2001 ML in Bioinforma cs Intensity measurements errors Intensity measurements uctuate randomly Example 2 pro les of the same pooled reference QNQC serum I 7400 7500 7am sum 52m an 5500 5500 Sam Mz A higher intensity values come with a higher variance CS 2001 ML in Bioinforma cs Mass inaccuracy Misalignment of readings for different mz values Example 7 2 pro les of the same pooled reference QNQC serum 7 A clear phase shift 23m 24m 25m man 2700 25m 29m 3000 3mm Mx CS 2001 ML in Bioinforma cs Profile preprocessing Aims is to remove the noise and systematic biases in the signal while preserving useful information Typical preprocessing steps 7 Smoothing Aims to eliminate the noise in the signal 7 Calibrationrescaling Attempts to eliminate differences in intensity among profiles 7 Pro le transformations variance reduction Reduces the multiplicative noise 7 Baseline correction Reduces the systematic baseline error 7 Pro le alignment Correct for mass inaccuracy CS 2001 ML in Bioinforma cs Preprocessing variance stabilization Aims is to eliminate the dependency of the noise on the intensity of the signal Example transforms used in variance stabilization 7 Log transform 7 Squareroot transform 7 Cuberoot transform appears to be the best gt EM in Bioin QHUHMnm Preprocessing Smoothing Aims is to eliminate a highfrequency noise in the signal Threat a loss of information in the high frequency signal 22 CS 2001 ML in Bioinforma cs Preprocessing Baseline correction Aims is to eliminate a systematic intensity bias by which the pro le readings differ from 0 Gambianquot mHUHMnm q 1 M CS 2001 ML in Bioinforma cs Disease detection How to use the high throughput information about gene expression levels or protein peptide expressions Detection of a disease 7 Many promising results on various types of cancer Discovery of disease biomarkers 7 Identify species proteinpeptides responsible for the differences Here the problem of building a classi cation model that is capable to determine with a high accuracy the presence or absence of the disease in future patients CS 2001 ML in Bioinforma cs Disease detection Objective Build a classi cation model that is capable to determine with a high accuracy the presence or absence of the disease in future patients Typical steps Data preprocessing Feature selection Building of a classi cation model Evaluation Validation CS 2001 ML in Bioinforma cs Analysis of expression profiles RerrpooiiSSl of systematlc instrument blases and lt ldentify relevant relevant for building a classification mode learn a classification model Determine the quality accuracy of the classification model a Check the reproducibility of detection results on independent datasets CS 2001 ML in Bioinforma cs Analysis of expression profiles CS 2001 ML in Bioinforma cs Analysis of expression profiles CS 2001 ML in Bioinforma cs Detection of a disease Objective Find a classi cation model that assigns With a high accuracy correct diseasenodisease labels to pro les generated for individual patients Profile Classifier Class label mod 1 case or control The approach use past data to learn a classification model variety of machine learning approaches exist 7 Logistic regression 7 Linear and quadratic discriminant analysis 7 Classi cation and regression trees CART r nmm u c h Learning a classi cation model Example a dataset with 2 features eg expression levels of two genes P Decision boundary control We learn the decision rule that achieves the best separation between case and control samples r nmm u c h Learning a classi cation model Logistic regression and linear SVM methods give the linear decision boundary P Decision boundary control r nmm u c h Non linear boundaries I o I I 39 Linear separator l o O 0 I I In the 1nput space I o I I I I Nonlinear separator in the input space Decision trees An alternative approach to what we have seen so far 7 Partition the input space to regions 7 classify independently in every region CS 2001 ML in Bioinforma cs Decision trees The partitioning idea is used in the decision tree model 7 Split the space recursively according to inputs in x 7 Regress or classify at the bottom of the tree Example Binary classi cation 01 B1nary attr1butes x1 x2 x3 CS 2001 ML in Bioinforma cs Analysis of expression profiles CS 2001 ML in Bioinforma cs Evaluation framework We want our classi er to generalize well to future examples Problem But we do not know future examples Solution evaluate the classi er on the test set that is withheld from the learning stage CS 2001 ML in Bioinforma cs Evaluation framework Training set Test set Learn on the Evaluate on training set The uid the test set r nmm u c h Evaluation metrics Confusion matrix Records the percentages of examples in the testing set that fall into each group Actual Misclassification error Case Cuntml E FP FN Cm TP FF Sensitivity Prediction 0393 01 SN L FN TN T P FN cmm 02 04 Speci city TN P TNFP r nmm u c h Evaluation To limit the effect of one lucky or unlucky traintest split PDAP supports averaging through 7 bootstrap 7 random sub sampling 7 k fold cross validation Example random subsampling CS 2001 ML in Bioinforma cs Validation of the predictive signal via PACE Additional assurance on the existing model Is the signal real Does it differ signi cantly from a random process Permutation test Use a random re labeling of pro les classes and evaluate the classi ers learned using such data Idea Reject the null hypothesis that the achieved classi cation error ACE on the true data is obtained purely by chance CS 2001 ML in Bioinforma cs Analysis of expression profiles CS 2001 ML in Bioinforma cs Validation Keep an independent test validation set aside 7 Taken out from the original data at the beginning the researcher holds the decoder ring 7 New data from the followup study 7 New data from different site Model CS 2001 ML in Bioinforma cs Analysis of expression profiles CS 2001 ML in Bioinforma cs Dimensionality problem One important thing before we can learn a classi er The dimension of the pro le for proteomic data is very large 60000 Mz measurements A large number of model parameters must be to learned estimated The total number of cases in the study is relatively small typically lt 100 gt Leads to a large variance of parameter estimates over t CS 2001 ML in Bioinforma cs ML dimensionality reduction Goal identify a small subset of features that achieve a good discriminatory performance Solution apply ML dimensionality reduction methods and their combinations Feature selection 7 Filter methods 7 Wrapper methods 7 Regularized classi ers Feature aggregation 7 Clustering Heuristics CS 2001 ML in Bioinforma cs Feature filtering Solution identify a small set of features that are good individual discriminators Univariate scores in PDAP 7 Fisher score 7 ttest score 7 Wilcoxon ranksum score 7 etc Univariate Scores 7 can be ordered relative to each other 7 Are often related to some statistic pvalue Features picked using a variety of criteria 7 Best k features 7 pvalue or False discovery rate FDR thresholds CS 2001 ML in Bioinforma cs Differentially expressed features Is a specific pro le position a good individual discriminator between case and control intensity 5890 Da How to measure a separation CS 2001 ML in Bioinforma cs Differentially expressed features intensity Example Measures based on the t test Used in statistics in hypothesis testing H0 Null hypothesis 7 the mean of one group case is not different from the mean of the other group controls 4 A 5890 Da H1 alternative hypothesis 4 IA A ttest see ifwe can reject a null hypothesis Relies on the tstatistic that represents a normalized distance measurement between populations of 03 t IL 70 n 14 We can compute a pValue the signi cance for the null reject Follows Student distribution CS 2001 ML in Bioinforma cs Differentially expressed features Solution Pick only features that are differentially are good individual discriminators for the case versus control Scores based on univariate statistical analysis 7 Fisher score and its variants 7 AUC Area Under ROC Curve 7 Ttest score Long amp Baldi 7 Wilcoxon ranksum score 7 etc Additional criteria may control signi cance levels and False discovery rate Benj amini amp Hochberg CS 2001 ML in Bioinforma cs Differentially expressed features A score based on Wilcoxon rank sum test Pancreatic cancer data Stalgram Wilcoxon g m7 7 5 D l m l l Ill h A I l mL Milli llu A l l Ail ul l l 00 2000 3000 A000 5000 6000 7000 B000 9000 l0000 ll 00 a if 2 E a v 7 A M m L Ah miL A A J l 00 2000 3000 A000 5000 6000 7000 B000 9000 l0000 ll 00 M2 3 E 2 E a 17 7 E L l a l M I It A A A i k n J l 00 2000 3000 A000 5000 6000 7000 B000 9000 l0000 ll 00 M2 Differentially expressed features A score based on Wilcoxon rank sum test Pancreatic cancer data Stalgram Wilcoxon i m 7 57 I 7 D 1 1 Lil h 1 I l 1 will llu u L 1 00 2000 3000 A000 5000 6000 7000 B000 9000 10000 11 00 M2 aT Tl 1 7 S E 17 7 g m M m L 1L 1L A 4 00 2000 3000 A000 5000 6000 7000 B000 9000 10000 11 00 M2 3 E 2 a 1 7 E 1 An I nt I A A A A IL a a 1 00 2000 3000 A000 5000 6000 7000 B000 9000 10000 11 00 M2 Differentially expressed features A score based on Wilcoxon rank sum test Pancreatic cancer data Stalgram Wilcoxon g m7 7 5 D 1 in Mn lh Al L ll L in llui All 1 All ul l 1 00 2000 3000 A000 5000 6000 7000 B000 9000 10000 11 00 M2 3 ll 27 7 S K 17 7 E n 1 An I m L AL MIL Ii 4 1 00 2000 3000 A000 5000 6000 7000 B000 9000 10000 11 00 M2 1 l 1 1 1 1 1 l 1 i E 2 S Q 1 E h a 1 m M l IL 0 J 1 00 2000 3000 A000 5000 6000 7000 B000 9000 10000 11 00 M2 Multivariate feature selection Univariate methods each feature profile position is considered individually May not be sufficient to capture differences in diseasecontrol variability Multivariate methods a biomarker panel a more than one feature is needed to de ne the differences in between casecontrol Subsets of features need to be considered CS 2001 ML in Bioinforma cs Multivariate feature selection Example of a multivariate method Wrapper methods uses a classi cation model that is tried with different subsets of features Idea 7 Learn a classi er on subset of samples in the training set 7 Obtain a measure of accuracy on the remaining train samples internal test set 7 Repeat 1 and 2 multiple times using crossvalidation schemes 7 Pick sets of features leading to the best average cross validation measure CS 2001 ML in Bioinforma cs Wrapper methods The problem with wrapper methods Combinatorialoptimization To nd the best combination of k features they need to evaluate quotl con gurations UV Recall n is huge Solutions Greedy wrappers pick features one by one greedily Apply heuristics to make the search more ef cient CS 2001 ML in Bioinforma cs Conclusions High throughput data sources measure the expression of genes proteins and their fragments 7 Potentially useful information for disease detection 7 Biomarker biomarker panel discovery Problems Many sources of variability 7 Steps need to be taken before the pro les can be analyzed 7 Data collection storage and sample processing confounding and reproducibility is an issue Multivariate analysis 7 More than one feature gene protein expression level used to detect the disease Encouraging good results on many diseases CS 2001 ML in Bioinforma cs Memory Hierarchy Design Issues in ManyCore Processors Sangyeun Cho Dept of Computer Science a University of Pittsburgh iiillCAST nmmE u rm Glnhzlnewsmrlhecremnrsmlechnnlnw CMP EETim aiesiNews Intel ups teraflop programmable processor Edus Mark EE 5 Manama 2 31 PM EDT arranged this chip array according cu Each ma inciudes a man care or campuie elemem with a Simple matrucmn hut n i am can suing ha cure in an unvclwip ass a 3 design consists u 80 his iaid out in an B by iUbluck iniei umpailbla The Na alsu Inciudas a neiwovk ihat imks aii ha cures In each ether and gives hem act i mamnr FEECAST Multicores are Here EM Puwev SUN UHvaSPARC W an SUN UnvaSPAR C T1 4 ECAST Tomorrow s Processors R uunMab e mgamzauun L2shce Wed mgamzanun CAST Technologyapplication trends Potential problemsconstraints Discussions are based on lTRS 200120032005 lrvtel39s Platform 2015 Whitepapers s Borkar39s MICRO 2004 kevnote presentation omer references Moore s Law 2300 transistors in Intel 4004 1971 276M transistors in Power5 2003 17B transistors 24MB L3 cache in Intel Montecito 2005 o 2016 forecast by ITRS 2005 3B transistors 22nm technologies 40GHZ local clock infeaSI Single core OoOVLIW scalability is limited Multicore is the result of natural evolution FROAST Buildingl a processor with MANY transistors not e Power Trend Powor mind unconstrained i 1 mm mum in r N 14 INF4 an pawn mi A Max allowed power 7 Faster dock frequeno 7 increased eakage power Power densitv Wcrn2 7 Reiated Witn temperature 7 Become more crmcai 7 Pert ampreHabthViSSueS v39 v v39v v39v39v v v v iiu Bandwidth Trend Today s processor bandwidth is ZNZOGBS Limited by 7 to pins 7 Bandwidth per pin Bandwidm drivers 7 x of processors 7 Faster clock frequency Two fronts Offrchip bandwidth 7 0n7chip inmrconnect bandwidth CAST OnChip Wire Speed mime sue mm I so an Scaling leads to faster devices transistors Scaling however leads to slower global Wires increased RC delav Possible implications 7 Simpler processor cores 7 0mm switched network 7 Nunmmfwm memory access latency lCAST Yield amp Reliability Issues Errors due to variations eg VTH variation Runtime dependent Reserving larger margins means lower yield Traditional test methods not enough BurninIDDQ less effective Time dependent device degradation 9 SNM degradation3yr in SRAM due to NBTI Electromigration TDDB Soft error FIT 8 degradation bitgeneration l lCAST Application Requirements Applications performance demand growing 0 RMS applications Intel39s term Re Mining Synthesis More multimedia applications ames Animations Pradip Dubey Intel The era of tera is coming quickly This will be an age when people need teraflops of computing gower terabits of communication bandwidth and tera ytes of data storage to handle the information all around themquot CAST Issues Summary We must keep scaling performance Moore39s law 0 Power consumption ry component design must redounsider power consumption 0 390 o 2 m D m r T ermaI management a must but not sufficient r Designsoftware methods for low temperature further needed Offchiponchip bandwidth requirement 7 Highrspeedlowrpower 10 7 Larger onrchip memory eg L2 Packagerlevel memory integration may become more inmresting e delay dominance Smaller cores 7 Nonruniform memory latency ie hierarchy at same level 0 Yieldreliabilit r Microarchitectural provisions for yieldreliability improvement a must 7 Dynamic selfrtestdiagnosisreconfigurationadapt F CAST Wir Memory Hierarchy Design Considerations Reduce traffic and power Offchiponchip traffic 20 of total power consump Ion Offchip traffic primarily determined by onchip capacity Onchip traffic determined by data location Are there redundant accesses Improve flexibility Da a placement in L2 Cachelinesetway isolation Help from OS nee ed 0 It doesn39t assume nonuniform memory latency in uniprocessors is a multicore a uniprocessor CAST Remaining Topics 0 An L1 cache traffic reduction technique 0 L1 cache performance sensitivity to faults o A flexible L2 cache management approach lCAST Macro Data Load An Ef cient Mechanism for Enhancing Loaded Value Reuse L Jin and S Cho ACM lnt l 5 mp Low Power Electronics and Deslgl i lSLPED Oct 2006 CAST Motivation L1 cache Essential for performance traffic reduction and power All highperf processors have both icache and dcache Energy consumption 39 memXEcachelessXEmlss Usually NrnlssltltNmernI EcacheltEmlss Conventional approac s uce N S victim cache highly setassociative cache 0 Reduce Enema filter cache cache subbanking o Redu ms Can we reduce N mem FRCAST L1 Traffic Reduction Ideas Storetoload forwarding Usually needed for correctness in 000 engine Implemented in LSQ Design pipeline in such a way that cache is not accessed if the desired value is in LSQ Loadtoload forwarding loaded value reuse A loaded value may be necessary again soon Use a separate structure or LSQ Silent stores Stores that write a same value again Identify track and eliminate silent stores Lepak and Lipasti ASPLOS 2002 MCAST StoretoLoad Forwarding Effective Addresses Result Buses LoadStore w gt Units 93 b g DualPorted gt gt Data Cache Load Store Queue LSQ Stores Reuse Data 4 v 0 Data Shift amp l Mask Logic a m lCache Load Data 39 l l ll f Basic idea Stores are kept in Load Store Queue LSQ until they are committed A load dependent on a previous store may find the value in LSQ Often a load accesses LSQ and cache together for higher performance One can redesign pipeline so that LSQ is looked up before cache is accessed How to deal with performance impact CAST LoadtoLoad Forwarding Effective Addresses Dgt LoadStore 0 Units 93 b g DualPorted D gt Data Cache Load Store Result Buses tores Queue LSQ W I g I I V I I c Reuse Data Ll Cache Load Data Data Alignment Logic a 0 Basic idea Loaded values are kept in Load Store Queue LSQ A load targeting a value previously loaded may find the value in LSQ 0 Related work Nicolaescu et al ISLPED 2003 MCAST Macro Data Load Id byte 0x100 Id byte 0x100 macro 0x100 lno overlapping I macro data loaded no reuse loaded value reused Id byte 0x102 Id byte 0x102 8 Id word 0x100 Id word 0x100 macro 0x100 I different type I macro data loaded no reuse loaded value reused Id byte 0x102 Id byte 0x102 b o G oa I Maximize loaded value reuse 0 Id ea Bring full data 64 bits regardless of load size Keep it in LSQ Use partial matching and data alignment 0 Essentially we want to exploit spatial locality present in cache line CAST 10 Macro Data Load cont d Effective Addresses Dgt LoadStore U 39t b rquot S DualPorted I 93 O a gt Data Cache Result Buses Stores 3D Load Store 039 Queue LSQ D V c Reuse Data Cache Load Data Data Alignment Logic a 0 Architectural changes Relocated data alignment logic Sequential LSQcache access 0 Net impact LSQ becomes a small fully associative cache with FIFO replacement MCAST Macro Data Load cont d source size 11 LSQ entry source Size load Size source addrO 39 load addrO source addr1 39 load addr1 00 byte source ader 01 half load addr2 Entry 10 word source addr3 hit 11 dword or macro data load addr3 O O V P SL Address Size 7 source addr30 load addr30 source addr31 I load addr31 Load Hit V address P a b 0 Architectural changes Relocated data alignment logic Sequential LSQcache access 0 Net impact LSQ becomes a small fully associative cache with FIFO replacement CAST Idealized Limit Study Memory Value Reuse Table 64 Cache stores 0 loads Processor Core MVRT Memory Value Reuse Table N entries parameter Tracks storetoload 52L loadtoload L2L and macro data load ML 0 Simple idealized processor model No branch misprediction singleissue pipeline Overall Result 100 90 80 quot Lc In 70 E H 33 quotIn quotI quot39I quot39I LIE L LL33 60 Nb 50 0 30 20 E lll IIIIII 10 quotHE 1 l l i I w t H l l l 11quot i E l 1 0 6 to to to f 0 lt9 97 63 o 00 l o a 0o 00 lt9 3 so i O 0 000 0 6 9 0 6 ltD f G 949 0 0 a0 0 Lb 0 020 0260 39 0 Q Q 9 9 g g 1 399K bQ 9 0 4 890 4 0 0 Assuming 256entry buffer size maximum in our study 0 Up to over 70 of accesses are redundant Most programs have significant reuse opportunities In certain cases reuse distance is short and data footprint is small wupwise ML consistently boosts loaded value reuse 4060 in CINT and MiBench 12 Load Size Mix Me we we CINTZk e Maw Word 32M accesses CFsz 7 Reatwew frequent ongrword 62mm accesses I MiBench a More frequent ha f 16m and bvte gem acoesses CAST PerType Reuse 8r16rbit macro dam reuse is high a Many Word 32m accesses I CFPZk e Reatwew frequent ongrword 64bet accesses I M39 ch More frequent ha f 16m and bvte gem acoesses CAST On Different Machine Width 0 We considered running a 64bit binary on a 64bit machine 0 Consider two cases Running a 32bit binary on a 32bit machine Running a 32bit binary on a 64bit machine NERCAST Sensitivity to Buffer Size 0 MN 5ka 0 When MVRT size 16 0 When MVRT size 32 0 When MVRT size 54 FRCAST Performance Study Setup 0 Processor 4 issue 000 w 64 entry o Caches L1 32kB 2 way 648 line 2 cycle access L2 2MB 4 way 1288 line 10 cycle access 0 Memory 120 cycle latency o Studied models Traffic optimized pipeline 0 Access LSQ before cache Performance optimized pipeline 0 Access cache and LSQ simultaneously EECAST Traffic Reduction LZL 82L I R39AL l Loan Tram l Slore Trams Klt L wobbebz lt lt 6904 4Qko a R Qo 6ampabb Ao o444 q lt9 as La size 0 Q96 4 Q 5 0 3 4 a 34 a e e 69 a a amp lt is 4 4 0 Significant reduction in load traffic 0 Lost opportunities Branch misspeculation LSQ drain OoO execution simultaneous LSQ accesses CAST Energy Reduction 0 szL L2L ML 52L L2L ML 52L L2L ML LSQ energy should be considered Previous works didn t Up to 35 MiBench energy reduction iCAST Performance Impact 5 52L L2L ML 82L L2L ML 82L L2L ML Performance impact should be minimized Increased time unwanted energy consumption ML performance impact is minimal EECAST Summary 0 L1 cache traffic reduction techniques evaluated 0 ML macro data load outperforms previously proposed SZL and LZL schemes 0 Smart traffic reduction techniques at all levels of memory hierarchy will become more and more important ERCAST Assessing the Impact of Defects in Cache Memory H Lee S Cho and B R Childers Submitted to 2quot 1 Workshop ori Architectural Reliability WAR Sep 2006 FrilCAST Motivation Extreme technologies suffer Process variation VTH lea age Lifetime reliability EM NBTI TDDB Deteriorated testing capability IDDQ burnin Yield may stagger without large design amp manufacturing margins Burden on keeping Moore39s law Profitability threatened Resilient designs will become critically important Faults are mas e Graceful performance degradation CAST Our Approach Our focus is in microarchitecture amp system Examine simple delete schemes Eg cache lines sets ways etc Predictable performance degradation low cost Explore remap schemes still simple Share existing resource on demand Slight performance degradation small cost Devise system level management schemes Deleting amp remapping decisions made intelligently With littleno hardware addition Synergistic architectural amp system collaboration F CAST CAFE CAFE Cache Fault Evaluation tool set Cache Profiler Program InputConfig data Systematic Fault Map Generator Access Profile starting point way 0 X way1 way 2 way 3 setO wr xxxx set2 X X set3 Analytical Performance Estimator Fault Map Generator XXXX Program lnputConfig b Performance Simulator 8 Evaluation flow Fault map 141 h lt 1 39i l z39 a 39 UJI EG on Mill Llhl39 MHz7 t w i MT tel1 39 If EILIHETPtAJF Ml LIFE 7 UIHHETRR an M1 bET E39 t I I J quot 1quot I I I I I r lI l H I quotJ quotI Ii 39 il at 39 39 Err 39 l a f 11quot rm 139quot aquot J k 4 r h t pI39E e u 11 if if a l In men t myquot r1131 11339 v E g I 3 time a I j 39 I iii 2 r 39I 39 39 quot i i If r39 r l 3 39quotl an I J 7 PJ Ila 7 I rl xiii 3 I39I i39glu l i rL 39 J A I 39 I J I 3939 I I I if I 410 PIn F I ca I3 Flt 39 I Iquot 3 Tm lquot I w min f r i 3 gr i I z I 311 ha quotI If 7 3 r I 39 39 r 1 quot 39 39 a lint VIIIIi rquot I r I I 1 I39i Il J 2 r quot39 II I I I I gag r I I F1 I I 1am 1 391 III 5quot 39tquot Hl quott Efr ME 3tquot rim1 atlas quot r39 I a Iquot a Elill39 Iquot I I III max I Fm 05quot I it 7 39 F J i atng i i it I I 11 2339 fl i2 i39 El is 1 am 1 mix 1 lit 1 irti I I I I as Ela l39ly has due lazsl 13 I I fuss L ILE 52 cele ze use I I n I I I 5519 I155 tint 1o 55 Halala cupuciq Iuss dLvSJ E mnu ualutu Summary We built CAFE a tool for cache reliability performance trade off study We characterized the performance impact of simple delete schemes We developed more sophisticated still simple cache remap schemes Results will become available soon Future directions Study L2 cache protection techniques Study interconnection network switch protection Study directory cache coherence mechanism protection schemes RCAST A Flexible L2 Cache Management Approach for Future Multicore Processors S Cho and L Jin lEEEACM lnt l Svmp Microardwitecture MICRO Dec 2006 L Jin H Lee and S Cho ACM Workshop Memory Systems Performance amp Correctness MSPC Oct 2006 FilCAST L2 Cache Basics Physically indexed physically tagged Physical address determines location within cache In the range of 256kB2MB in single core 64B2568 cache line size 48way set associative OS approach to conflict miss reduction in L2 cache Page coloring Bin hopping eCAs eSt bin CMP L2 Cache Management o Tilebased multicore 0 Conventional L2 cache management strategies Treat each cache slice private to a core rivatequot design Treat all cache slices shared by all sharedquot design onr ni orm ac e Architectures Hybrid privateshared design hybridquot design p slices lCAST Private Caching Scheme 0 Assume L1 is private 0 On an L1 cac e miss 7 Access local L2 slioe always a If hit Be happv a If miss Go to directorv See itwanted data is Onrchlp a Per nn ca erence act on and get data ltdata is missing go to mairi memorv update directorv o Low average hit latency a Dataa actionquot 0 Low onchip hit rate a Each processor core has limit d caching space e Complex coherence protocol Replication unknown data location CAST Shared Caching Scheme 0 Assume L1 is private On an L1 cac e miss 7 Determine which cache slioe no access Usuallv slice id is directh derived from address e g cache line address Wu 3 slices 7 Access data ltdata is missing go to mairi memorv o Low onchip miss rate a Fine distribution of data onto available cache slices Simple and efficient coherence protocol a Data location is deterministic Low average hit latenc 7 Wanmd data may be found in cache slioe far off 0 Current generation processors use a shared cache model 7 IBM Power a Sun Microsysmms Niagara 7 Intel Core Duo lCAST Other Variants Clusters of private caches Each private cache cluster comprises multiple shared caches Huh et al ICS 2005 Victim replication based on a shared design Victims from L1 are copied to local L2 slice Zhang and Asanovic ISCA 2005 Cooperative caching based on a private design Limit degree of sharing evict globally unused blocks s er exploit cachetocache tran f CA Sthang and Sohi ISCA 2006 Proximity vs Miss Rate 0 Different schemes make different tradeoff between proximity and onchip miss rate 0 Private cache Favors proximity 0 Shared cache Favors miss rate 0 Hybridvariant schemes Attempts to achieve both good proximity and low miss rate F CAST What About ScalabilityFlexibility Future processors may include 100 s of cores and 100 s of cache slices Private cache Scaling will not help single program it won t get more city Shared cache More caches increase overall capacity Average latency increases How can we manage so many cache slices Performance Reliability CAST Our Goal Provide a flexible scheme Good proximity private cache Good onchip hit rate shared cache Cache slice isolation Unnecessary slices can be turned off 0 Unreliable slices should be pulled out Architectural support Mapping information lookup and maintenance Lightweight performance monitor System implementation Productionquality OS Fullsystem simulation environment FflCAST Revisiting a Shared Design L1 cache is private Coherence information is distributed to L2 cache tags Each L2 tag entry keeps a bit map showing nodes keeping data L2 cache slices form a logical single cache 0 Data to cache slice mapping Cache line granularity o A very balanced way CAST Changing Granularity m m n5 Memory blacks lai it Benefit of mapping at page granularity 05 can map program s virtual pages to decide data location Mapping creation at page allocation time Mapping information size is more manageable Data access behavior eg sequential access prese F CAST Carrying Mapping Information Missed physical address Missed physical address TLB i VPN PPN NODE Bit selection logic Fixed bit selection Node 3 3909quot Addrmin Addrmax NODE number VPN PPN NODE Node number Addrmin Addrmax NODE select Node number a b C Simple bit selection method Regionbased method Page table TLB method ImCAST ProgramData Proximity 1 2 2 1 2 1 1 1 2 1 2 2 1 2 2 2 Case 1 avg distance 3 Case 2 avg distance 25 Case 3 avg distance 2 0 Program amp data location determine min latency to bridge them Tiers Page spreading Borrow cache space from nearby neighbors CAST Cache Pressure We don t want to overload a single slice Cache pressure of active pages gtlt page size cache slice snze A performance monitor to estimate cache pressure IS necessary lillCAST Virtual Multicore 0 T1 TZWI T3 G0 TOT1T2T4T5T6 T6 T7 A T5 31 T3T7 T9 32 T8T9T12T13 33 T10T11T14T15 T14 T1 Kr T L m T1 L ampJ Cluster processor amp cache slices Processors in VM share their cache slices Coherence and data transfer traffic contained within VM 27 ingle Program Performance S m n x m m m m n m r n I m n m I a r w 744444441 w 444444 N h 2537 n E x SEE a Hm r L x I I I I m c l 4 m m m m n n I I m H m m Hnuuuunnn m g III m 5 rl M w 325 5 744444 w 4444444 45 1552 Sxxxtxxx x m I I n n c u s I m I a W 44444 v 44444 n n x w m m u w u m u w u m w BEEES magma Under Different Traffic Load CAST FRCAST Parallel Workloads S x ucsw W tVI 7 a E RCAST Deleting Cache Slice R elalive L2 cache lalency u 1 2 0 When cache slices need be deleted 7 Reliability issues eg fauiis 7 Power issues 0 Conventional design suffers 0 Our approach can simply avoid using certain cache slices FROAST Summary L2 cache management becomes important in future multicore processors Many cores Many cache slices Current hardware oriented techniques are less s effective In many core situation We developed an OS microarchitecture framework for flexible L2 cache management Capitalizes on a simple shared cache organization Page level data to cache slice mapping Use page allocation for node allocation Adjustable proximity amp onchip cache miss rate tradeoff Cache slice isolation is trivial CAST Research Questions OS page mapping amp scheduling algorithms For best performance For lowest power Incorporating page coloring techniques Now it is 2dimensional node allocation amp color aSSIgnmen A very light weight cache performance monitor Cache data replication amp migration techniques Readonly data replication is relatively simple What about writeable data F CAST Homework Write a quicksort program and run it Start from 100 integer numbers 100 99 2 1 Run the quicksort to sort the numbers to 1 2 99 00 Your job is to estimate the of memory accesses of your program Refer to the description on the course 2001 website CAST