Statistical Bioinformatics STAT 5570
Utah State University
Popular in Course
Popular in Statistics
This 38 page Class Notes was uploaded by Anita Hettinger on Wednesday October 28, 2015. The Class Notes belongs to STAT 5570 at Utah State University taught by John Stevens in Fall. Since its upload, it has received 19 views. For similar materials see /class/230499/stat-5570-utah-state-university in Statistics at Utah State University.
Reviews for Statistical Bioinformatics
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 10/28/15
Ensemble Methods Bagging amp Boosting XiZhang Bioinformatics Spring 2009 UtahState University Outline V 0 Introduction e quotat i 6 Boostin Application V Discussion 39JtahStateUnive rsity Introduction t Steps of using gene expression data to predict classes Make some sort of observation of the gene expression in different states ie health versus diseased Find a way of differentiating the two conditions in terms of gene expression Which way 39JtahStateU n ive rsity Introduction 0 otrMachine Learning It is a technique helping us to differentiate the two conditions A way of combining computation statistical inference and algorithms to help predicting OSupervised learning A subdiscipline of machine learning in which some prior knowledge about the phenomena under investigation is available to guide the learning process 39JtahStateU n ive rsity Introduction otoEnsemble methods are algorithms 6 that construct a set of classifiers and then classify new data points by taking a vote of their predictions Aim at improving the predictive performance of a given statistical learning or model fitting technique The general principle of ensemble methods is to construct a linear combination of some model fitting method instead of using a single fit of the method 39JtahStateU n ive rsity Introduction Ensemble Methods Random Forests RF Bagging Boosting toWhy Better classification performance than individual classifiers More resilience to noise Reduce variance Bagging Talk it more detailed later 39JtahStateU n ive rsity 0 Introduction 0 Random Forests RF The algorithm was developed by Leo Breiman Land Adele Cutler A tree constructed model A machine learning ensemble classifier Each tree is grown on an independent bootstrap sample from the training data The best split is found on m randomly selected variables Process is repeated to create a forest of trees The prediction is found by averaging the predictions from all of the trees 39JtahStateU n ive rsity Introduction oz Random Forests RF Advantages For many data sets it produces a higth accurate classifier It handles a very large number of input variables lt estimates the importance of variables in determining classification Using the above it can be extended to unlabeled data leading to unsupervised clustering outlier detection and data views Learning is fast 39JtahStateU n ive rsity Introduction 0 Random Forests Disadvantage Highly unstable since it is a tree constructed model a small change in the training set yields large variations in the classification May overfitting mentioned in Machine Learning Benchmarks and Random Forest Regression as reference 4 39JtahStateU n ive rsity 23 0 Bagging lt0 Bagging is bootstrap aggregatmg Leo Breiman1994 Derived from bootstrap Efron 1993 A machine learning ensemble metaalgorithm to improve classification by combining classifications of randomly generated training sets and regression models in terms of stability and classification accuracy 39JtahStateU n ive rsity Q T Bagging ot What is a bootstrap sample Consider a data set D with m data points A bootstrap sample Di can be create from D by choosing m points from D randomly with replacement On average 37 of the points in D will be in D 39JtahStateU n ive rsity Bagging oz The Algorithm Draw M bootstrap samples Lm m1 M from the original sample Fit a weak learner usually some sort of tree based model for each of the bootstrap samples gm Lm and construct the classifiers f gm Lm Combine the classifiers using weights ocm1M yielding the bagging ensemble 39JtahStateU n ive rsity 23 0 Bagging 0 Bagging Properties It is much like random forests except you fit the classifiers based on the total number of variables instead of a subset Create classifiers using training sets that are bootstrapped drawn with replacement Average results for each case 39JtahStateU n ive rsity Bagging 0 Bagging Properties Each classifier is going to be generated with different training sets obtained from the original data set by using resampling techniques Some of the original observations will not appear in the bootstrap sample while other observations may appear multiple times The final prediction is made by taking an average of the training set predictions 39JtahStateU n ive rsity 0 Bagging 0 Bagging Advantages Keeps characteristics of treebased models Can deal with categorical or continuous variables Robust to outliers Can handle missing values Typically used when dealing with treebased models like RF or CART but can be used with any model Improves the estimate if the learning algorithm is unstable Reduces variance and helps to avoid overfitting Increase in accuracy 39JtahStateU n ive rsity Bagging a Bagging Disadvantages Not useful for improving linear models the method averages several predictors z Q T Computationally slower than a single tree 39JtahStateU n ive rsity Boosting H st A family of methods 6 90 Talk it later Based on the question posed by Kearns Can a set of weak learners create a single strong learner 39JtahStateU n ive rs ity z Q T Boosting lt Algorithm of Boosting AdaBoost o very popular o perhaps the most significant historically LogitBoost Gradient Boosting Machine gbm which is this project focusing on 39JtahStateU n ive rsity Boosting 2 ot The Algorithm of gbm Initialize x to be a constant x 2 argminpzil lIIg p EDl TiHl TdO 1 Compute the negative gradient as the working response gt 3 Itohim 1 1 0ftxl39 fL39XHflxz J 392 Fit 3 regreesion model 9x predicting 1 from the cmariates xi 3 Choose a gradient descent step size as N p argminZ Irtymxi pgx 2 p i1 4 Update the estimate of x as x ftx pgx 3 hugmnanrpnojeetorg webipaQkageszgbm vigme ea g m v df a StateUn IverSIty l9 23 6 Boosting 0 Boosting Properties Initially give every observation equal weiigght The training set chosen at a given time depends on the performance of earlier classifiers Examples that are incorrectly predicted in previous classifiers are chosen more often or weighted more heavily 39JtahStateU n ive rsity 20 Boosting a Boosting Properties 3 Depends more on data set than type of classifier algorithm 2 Q T As the iterations increase better prediction accuracy will be achieved 39JtahStateU n ive rsity 21 Boosting 0 Boosting Properties The error rate of the weak learner i continuously increases as the misclassifi ed observations are more heavily weighted But the ensemble error rate continues to decrease E 39 1I39 in 4 1L WIquot L It lk Juli1 aunt vlilt L r14 a I r13 6 A r 3 3 4 5 a 7 9 mi 13 H 5 mm 39 gem 39JtahStateU n we rs Ity 22 23 Boosting 0 Boosting Properties The Final classification is made by somesort of voting scheme Unlike bagging where each classifier is weighted equally with boosting the vote is dependent upon the classifier s accuracy o Just as the individual observation are weighted the individual classifier are also weighted and a weighted majority vote is used to select the final classifier i 0 39JtahStateU n ive rsity 23 6 Boosting 0 Boosting Advantages Keeps characteristics of treebased models 0 Can deal with categorical or continuous variables 0 Robust to outliers 0 Can handle missing values Increase in accuracy sometimes even better than Bagging and stability Decrease in variance in error rates 39JtahStateU n ive rsity 24 9 e Boosting 0 Boosting Disadvantages Overfitting the training data set leads to a deterioration on some data sets adaboost When there is nontrivial classification noise in the learning sets advantages lost JR Quinlan Failure happens more likely with relatively small data sets Breiman 39JtahStateU n ive rsity 25 O 09 O 00 g2 0 Application Using GSE5245 data set 16 arrays observations 045101 genes variables 6 Want to predict type of exposure based on gene expression levels Combining short and long exposure types Comparison nonexposure ie none to exposure ie short or long 39JtahStateU n ive rsity 26 Application Prepare data in R ibraryaffy ibraryrandomForest ibraryMLInterfaces ibraryMASS ibraryexactRankTests ibraryBiobase abatchReadAffycefiepathquotEGSE5245quotIoad data using rma to do preprocessing rmaesetrmaabatch exprsexprsabatch cecrep05rep111 emat2quotexprsrmaeset do non specific gene filtering ffunfiterfunvaerA020100 cv0710 tfigenefiterematffun smaesetemattfi preparing in R rmaesettypecquotnonequot quotnonequotquotnone none nonequot quotexposurequotquotexposurequotquotexposurequotquotexposu requotquotexposurequot quotexposure quot quot exposure exposure exposure exposure pData phenoDatarmaeset typeasfa ctorrma esettype ta bepData phenoDatarma esettype response rma esettype expressionsta pplyexprsrma eset 1ra nk I ncolexprsrmaeset Iindx1I bsa mpesampeIindxIrepaceTRU E exposurequot 27 39JtahStateUnive rsity I I f Application 6 varseection lt functionindx expressions response p 100 y lt switchcassresponse quotfactorquot modematrix response 1indx drop FALSE quotSurvquot matrixcscoresresponseindx ncol 1 quotnumericquot matrixrankresponseindx ncol 1 X lt expressionsindx drop FALSE n lt nrowy Iinstat lt X Ey lt matrixcoIMeansy nrow 1 Vy lt matrixrowMeansty asvectorEyquot2 nrow 1 er lt matrixrowSumsx ncol 1 er2 lt matrixrowSumsxquot2 ncol 1 E lt er Ey V lt n n 1 kroneckerVy er2 V lt V 1 n 1 kroneckerVy rSX 2 stats lt absinstat E sqrtV stats lt docaquotpmaxquot asdataframestats returnwhichstats gt sortstatsIengthstats p selected lt varseectionIindx expressions response The function varseection takes an index ector olf obsdervation between 1 and I and returns a vector of length p indicating which genes have een se ecte 39JtahStateUnive rsity 28 0 Application ot39Random Forests results Confusion matrix exposure none Classerror 11 O O O 5 0 exposure none Misclassification error of 0 Random Forests setseed1234 rfM Learntype rma esetseected randomForestl XvaISpecquotNOTESTquotsamp sizemtry3 importanceTRUE confuMatrfquottrainquot RObjectrf 39JtahStateU n ive rsity 29 Application zo Bagging results Confusion matrix 1 exposure none classerror i exposure 11 O O none 0 5 O Misclassification error of O Bagging setseed1234 rfBagg lt MLearntypermaesetselected randomForestlXvaISpecquotNOTESTquot sampsize mtrylengthselected importanceTRUE confuMatrfBaggquottrainquot RObjectrfBagg 39JtahStateU n ive rsity 30 63 Application Boos ngresuks Confusion matrix exposure none r exposure 4 0 none 0 2 Misclassification error of 33 Boosting setseed50 gbm lt gmermaesetselected quottypequot bsample nminobsinnode3 ntrees1000 shrinkage005 confuMatgbm 31 39JtahStateU n ive rsity 23 0 Application 6Friedmantest results data asmatrixperformance Friedman chisquared 1141855 df p value lt 22e16 a pvalue of 22e16 reject the null hypothesis all of the models were equal and conclude that there are global differences in the models 39JtahStateU n ive rsity 32 Application 2 oonsing Boxplot to compare four methods RF Bagging Boosting Guess over 100 bootstrap sampl s 7 o o 7 o I r I Guess CO 39i Vquot C RI GBoost Misdassr catjon Errors Bagg Methods 33 V 6 39JtahStateU n ive rsity 63 Ap pl i cati o n setseed1234 B lt 100 performance lt asdataframematrix0 nrow B ncol 4 colnamesperformance lt cquotRFquot quotBaggquot quotGBoostquot quotGuessquot for b in 18 bsample lt sampeIindX I replace TRUE selected lt varseectionbsampe expressions response rf lt MLearngKBeNrmaesetseectedrandomForestIXvaSpecquotNOTESTquotsampsizeImtry3 Importance E predicted3 lt factorrftrainPredictions eves evesresponse performanceb 1 lt meanresponse bsampe predicted3 rfBagg lt MLearntypermaesetseected randomForestIXvaSpecquotNOTESTquotsampsizeI mtryengthselectedimportanceTRUE predictedBagg lt factorrfBaggtrainPredictions eves evesresponse performanceb 2 lt meanresponse bsampe predictedBagg gbm lt gme5rmaesetseected quottypequot bsample nminobsinnode3 ntrees1000 shrinkage00 predictedgbm lt factorgbmpredLabes eves evesresponse performanceb 3 lt meanresponse bsampe predictedgbm performanceb 4 lt meanresponse bsampe evesresponsewhichmaxtabuateresponsebsample friedmantestasmatrixperformanceFriedman chi squared 1141855 df 3 p value lt 22e 16 boxplotperforma ncexa bquotMethodsquotya bquotMiscassification Errorsquotcoquot377EBBquot 34 39JtahStateUnive rsity Application QVariable Importance using gb m opar lt parnoreadonlyTRUE paras1 marc6666 plot getVa rImpgbmFALSE nip lt summary9bmR0bjeCtrCBar520 14606T2 at paropar 1416008at L 01 D 0 C MI E H 142696 Cat 5 IO 15 O Relative Influence 39JtahStateU n ive rsity 3S Discussion Bagging Boosting Individual Models are built separately A new method is influence by the simultaneously performance of a previous model iterative Take bootstrap samples of the data Uses all of the data on average only about 632 of the samples consist of unique observations while the rest are duplicates Resamples using the uniform Modifies the data by reweighting the distribution observation according to classification accuracy Each observation is weighted equally Each observation is initially weighted equally and then reweighted according to how accuracy it was classified Each classifier is weighted equally Each classifier is weighted according to how accuracy it classifies 39JtahStateUn ive rsity O 09 O O O 09 H Discussion Using Bagging and Boosting will result in a smaller misclassification rate Random Forests is a special case of bagging and it works very well For everyday people Random Forests is a great classify method being used 39JtahStateU n ive rs ity 37 Reference 1 Chapter 16 amp17 of Bioinformatics and Computational Biology Solutions Using R and Bioconductor course text Peter BijhlmannBagging Boosting and Ensemble Methods lt httpmarswiwihu berlindeebookshtmlcsanode227htmlgty Alexandre Luis quotBoosting and Baggingquot 04 May 2004 8 Apr 2008 lthttpgnomofeupptnnigpapersboobagpdfgt Mark R Segal Machine Learnin Benchmarks and Random Forest Regression University 0 California 2004 Reynolds Stacey Bagging and Boosting Statistical Computing Class presentation Fall 2008 Kusiak Andrew quotData Mining Bagging and Boostingquot The University of Iowa 1 Apr 2008 lthttpwwwicaenuiowaeducompPublicBaggingpdfgt 7 Ridgeway Greg Generalized Boosted Models A guide to the gbm package 3 Aug 2007 lthttpcranr projectorgwebpackagesgbmvignettesgbmpdfgt O KU39ILUJN 39JtahStateUn ive rsity 38
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'