New User Special Price Expires in

Let's log you in.

Sign in with Facebook


Don't have a StudySoup account? Create one here!


Create a StudySoup account

Be part of our community, it's free to join!

Sign up with Facebook


Create your account
By creating an account you agree to StudySoup's terms and conditions and privacy policy

Already have a StudySoup account? Login here

Statistical Bioinformatics

by: Anita Hettinger

Statistical Bioinformatics STAT 5570

Marketplace > Utah State University > Statistics > STAT 5570 > Statistical Bioinformatics
Anita Hettinger
Utah State University
GPA 3.98

John Stevens

Almost Ready


These notes were just uploaded, and will be ready to view shortly.

Purchase these notes here, or revisit this page.

Either way, we'll remind you when they're ready :)

Preview These Notes for FREE

Get a free preview of these Notes, just enter your email below.

Unlock Preview
Unlock Preview

Preview these materials now for free

Why put in your email? Get access to more of this material and other relevant free materials for your school

View Preview

About this Document

John Stevens
Class Notes
25 ?




Popular in Course

Popular in Statistics

This 38 page Class Notes was uploaded by Anita Hettinger on Wednesday October 28, 2015. The Class Notes belongs to STAT 5570 at Utah State University taught by John Stevens in Fall. Since its upload, it has received 19 views. For similar materials see /class/230499/stat-5570-utah-state-university in Statistics at Utah State University.

Similar to STAT 5570 at Utah State University

Popular in Statistics


Reviews for Statistical Bioinformatics


Report this Material


What is Karma?


Karma is the currency of StudySoup.

You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 10/28/15
Ensemble Methods Bagging amp Boosting XiZhang Bioinformatics Spring 2009 UtahState University Outline V 0 Introduction e quotat i 6 Boostin Application V Discussion 39JtahStateUnive rsity Introduction t Steps of using gene expression data to predict classes Make some sort of observation of the gene expression in different states ie health versus diseased Find a way of differentiating the two conditions in terms of gene expression Which way 39JtahStateU n ive rsity Introduction 0 otrMachine Learning It is a technique helping us to differentiate the two conditions A way of combining computation statistical inference and algorithms to help predicting OSupervised learning A subdiscipline of machine learning in which some prior knowledge about the phenomena under investigation is available to guide the learning process 39JtahStateU n ive rsity Introduction otoEnsemble methods are algorithms 6 that construct a set of classifiers and then classify new data points by taking a vote of their predictions Aim at improving the predictive performance of a given statistical learning or model fitting technique The general principle of ensemble methods is to construct a linear combination of some model fitting method instead of using a single fit of the method 39JtahStateU n ive rsity Introduction Ensemble Methods Random Forests RF Bagging Boosting toWhy Better classification performance than individual classifiers More resilience to noise Reduce variance Bagging Talk it more detailed later 39JtahStateU n ive rsity 0 Introduction 0 Random Forests RF The algorithm was developed by Leo Breiman Land Adele Cutler A tree constructed model A machine learning ensemble classifier Each tree is grown on an independent bootstrap sample from the training data The best split is found on m randomly selected variables Process is repeated to create a forest of trees The prediction is found by averaging the predictions from all of the trees 39JtahStateU n ive rsity Introduction oz Random Forests RF Advantages For many data sets it produces a higth accurate classifier It handles a very large number of input variables lt estimates the importance of variables in determining classification Using the above it can be extended to unlabeled data leading to unsupervised clustering outlier detection and data views Learning is fast 39JtahStateU n ive rsity Introduction 0 Random Forests Disadvantage Highly unstable since it is a tree constructed model a small change in the training set yields large variations in the classification May overfitting mentioned in Machine Learning Benchmarks and Random Forest Regression as reference 4 39JtahStateU n ive rsity 23 0 Bagging lt0 Bagging is bootstrap aggregatmg Leo Breiman1994 Derived from bootstrap Efron 1993 A machine learning ensemble metaalgorithm to improve classification by combining classifications of randomly generated training sets and regression models in terms of stability and classification accuracy 39JtahStateU n ive rsity Q T Bagging ot What is a bootstrap sample Consider a data set D with m data points A bootstrap sample Di can be create from D by choosing m points from D randomly with replacement On average 37 of the points in D will be in D 39JtahStateU n ive rsity Bagging oz The Algorithm Draw M bootstrap samples Lm m1 M from the original sample Fit a weak learner usually some sort of tree based model for each of the bootstrap samples gm Lm and construct the classifiers f gm Lm Combine the classifiers using weights ocm1M yielding the bagging ensemble 39JtahStateU n ive rsity 23 0 Bagging 0 Bagging Properties It is much like random forests except you fit the classifiers based on the total number of variables instead of a subset Create classifiers using training sets that are bootstrapped drawn with replacement Average results for each case 39JtahStateU n ive rsity Bagging 0 Bagging Properties Each classifier is going to be generated with different training sets obtained from the original data set by using resampling techniques Some of the original observations will not appear in the bootstrap sample while other observations may appear multiple times The final prediction is made by taking an average of the training set predictions 39JtahStateU n ive rsity 0 Bagging 0 Bagging Advantages Keeps characteristics of treebased models Can deal with categorical or continuous variables Robust to outliers Can handle missing values Typically used when dealing with treebased models like RF or CART but can be used with any model Improves the estimate if the learning algorithm is unstable Reduces variance and helps to avoid overfitting Increase in accuracy 39JtahStateU n ive rsity Bagging a Bagging Disadvantages Not useful for improving linear models the method averages several predictors z Q T Computationally slower than a single tree 39JtahStateU n ive rsity Boosting H st A family of methods 6 90 Talk it later Based on the question posed by Kearns Can a set of weak learners create a single strong learner 39JtahStateU n ive rs ity z Q T Boosting lt Algorithm of Boosting AdaBoost o very popular o perhaps the most significant historically LogitBoost Gradient Boosting Machine gbm which is this project focusing on 39JtahStateU n ive rsity Boosting 2 ot The Algorithm of gbm Initialize x to be a constant x 2 argminpzil lIIg p EDl TiHl TdO 1 Compute the negative gradient as the working response gt 3 Itohim 1 1 0ftxl39 fL39XHflxz J 392 Fit 3 regreesion model 9x predicting 1 from the cmariates xi 3 Choose a gradient descent step size as N p argminZ Irtymxi pgx 2 p i1 4 Update the estimate of x as x ftx pgx 3 hugmnanrpnojeetorg webipaQkageszgbm vigme ea g m v df a StateUn IverSIty l9 23 6 Boosting 0 Boosting Properties Initially give every observation equal weiigght The training set chosen at a given time depends on the performance of earlier classifiers Examples that are incorrectly predicted in previous classifiers are chosen more often or weighted more heavily 39JtahStateU n ive rsity 20 Boosting a Boosting Properties 3 Depends more on data set than type of classifier algorithm 2 Q T As the iterations increase better prediction accuracy will be achieved 39JtahStateU n ive rsity 21 Boosting 0 Boosting Properties The error rate of the weak learner i continuously increases as the misclassifi ed observations are more heavily weighted But the ensemble error rate continues to decrease E 39 1I39 in 4 1L WIquot L It lk Juli1 aunt vlilt L r14 a I r13 6 A r 3 3 4 5 a 7 9 mi 13 H 5 mm 39 gem 39JtahStateU n we rs Ity 22 23 Boosting 0 Boosting Properties The Final classification is made by somesort of voting scheme Unlike bagging where each classifier is weighted equally with boosting the vote is dependent upon the classifier s accuracy o Just as the individual observation are weighted the individual classifier are also weighted and a weighted majority vote is used to select the final classifier i 0 39JtahStateU n ive rsity 23 6 Boosting 0 Boosting Advantages Keeps characteristics of treebased models 0 Can deal with categorical or continuous variables 0 Robust to outliers 0 Can handle missing values Increase in accuracy sometimes even better than Bagging and stability Decrease in variance in error rates 39JtahStateU n ive rsity 24 9 e Boosting 0 Boosting Disadvantages Overfitting the training data set leads to a deterioration on some data sets adaboost When there is nontrivial classification noise in the learning sets advantages lost JR Quinlan Failure happens more likely with relatively small data sets Breiman 39JtahStateU n ive rsity 25 O 09 O 00 g2 0 Application Using GSE5245 data set 16 arrays observations 045101 genes variables 6 Want to predict type of exposure based on gene expression levels Combining short and long exposure types Comparison nonexposure ie none to exposure ie short or long 39JtahStateU n ive rsity 26 Application Prepare data in R ibraryaffy ibraryrandomForest ibraryMLInterfaces ibraryMASS ibraryexactRankTests ibraryBiobase abatchReadAffycefiepathquotEGSE5245quotIoad data using rma to do preprocessing rmaesetrmaabatch exprsexprsabatch cecrep05rep111 emat2quotexprsrmaeset do non specific gene filtering ffunfiterfunvaerA020100 cv0710 tfigenefiterematffun smaesetemattfi preparing in R rmaesettypecquotnonequot quotnonequotquotnone none nonequot quotexposurequotquotexposurequotquotexposurequotquotexposu requotquotexposurequot quotexposure quot quot exposure exposure exposure exposure pData phenoDatarmaeset typeasfa ctorrma esettype ta bepData phenoDatarma esettype response rma esettype expressionsta pplyexprsrma eset 1ra nk I ncolexprsrmaeset Iindx1I bsa mpesampeIindxIrepaceTRU E exposurequot 27 39JtahStateUnive rsity I I f Application 6 varseection lt functionindx expressions response p 100 y lt switchcassresponse quotfactorquot modematrix response 1indx drop FALSE quotSurvquot matrixcscoresresponseindx ncol 1 quotnumericquot matrixrankresponseindx ncol 1 X lt expressionsindx drop FALSE n lt nrowy Iinstat lt X Ey lt matrixcoIMeansy nrow 1 Vy lt matrixrowMeansty asvectorEyquot2 nrow 1 er lt matrixrowSumsx ncol 1 er2 lt matrixrowSumsxquot2 ncol 1 E lt er Ey V lt n n 1 kroneckerVy er2 V lt V 1 n 1 kroneckerVy rSX 2 stats lt absinstat E sqrtV stats lt docaquotpmaxquot asdataframestats returnwhichstats gt sortstatsIengthstats p selected lt varseectionIindx expressions response The function varseection takes an index ector olf obsdervation between 1 and I and returns a vector of length p indicating which genes have een se ecte 39JtahStateUnive rsity 28 0 Application ot39Random Forests results Confusion matrix exposure none Classerror 11 O O O 5 0 exposure none Misclassification error of 0 Random Forests setseed1234 rfM Learntype rma esetseected randomForestl XvaISpecquotNOTESTquotsamp sizemtry3 importanceTRUE confuMatrfquottrainquot RObjectrf 39JtahStateU n ive rsity 29 Application zo Bagging results Confusion matrix 1 exposure none classerror i exposure 11 O O none 0 5 O Misclassification error of O Bagging setseed1234 rfBagg lt MLearntypermaesetselected randomForestlXvaISpecquotNOTESTquot sampsize mtrylengthselected importanceTRUE confuMatrfBaggquottrainquot RObjectrfBagg 39JtahStateU n ive rsity 30 63 Application Boos ngresuks Confusion matrix exposure none r exposure 4 0 none 0 2 Misclassification error of 33 Boosting setseed50 gbm lt gmermaesetselected quottypequot bsample nminobsinnode3 ntrees1000 shrinkage005 confuMatgbm 31 39JtahStateU n ive rsity 23 0 Application 6Friedmantest results data asmatrixperformance Friedman chisquared 1141855 df p value lt 22e16 a pvalue of 22e16 reject the null hypothesis all of the models were equal and conclude that there are global differences in the models 39JtahStateU n ive rsity 32 Application 2 oonsing Boxplot to compare four methods RF Bagging Boosting Guess over 100 bootstrap sampl s 7 o o 7 o I r I Guess CO 39i Vquot C RI GBoost Misdassr catjon Errors Bagg Methods 33 V 6 39JtahStateU n ive rsity 63 Ap pl i cati o n setseed1234 B lt 100 performance lt asdataframematrix0 nrow B ncol 4 colnamesperformance lt cquotRFquot quotBaggquot quotGBoostquot quotGuessquot for b in 18 bsample lt sampeIindX I replace TRUE selected lt varseectionbsampe expressions response rf lt MLearngKBeNrmaesetseectedrandomForestIXvaSpecquotNOTESTquotsampsizeImtry3 Importance E predicted3 lt factorrftrainPredictions eves evesresponse performanceb 1 lt meanresponse bsampe predicted3 rfBagg lt MLearntypermaesetseected randomForestIXvaSpecquotNOTESTquotsampsizeI mtryengthselectedimportanceTRUE predictedBagg lt factorrfBaggtrainPredictions eves evesresponse performanceb 2 lt meanresponse bsampe predictedBagg gbm lt gme5rmaesetseected quottypequot bsample nminobsinnode3 ntrees1000 shrinkage00 predictedgbm lt factorgbmpredLabes eves evesresponse performanceb 3 lt meanresponse bsampe predictedgbm performanceb 4 lt meanresponse bsampe evesresponsewhichmaxtabuateresponsebsample friedmantestasmatrixperformanceFriedman chi squared 1141855 df 3 p value lt 22e 16 boxplotperforma ncexa bquotMethodsquotya bquotMiscassification Errorsquotcoquot377EBBquot 34 39JtahStateUnive rsity Application QVariable Importance using gb m opar lt parnoreadonlyTRUE paras1 marc6666 plot getVa rImpgbmFALSE nip lt summary9bmR0bjeCtrCBar520 14606T2 at paropar 1416008at L 01 D 0 C MI E H 142696 Cat 5 IO 15 O Relative Influence 39JtahStateU n ive rsity 3S Discussion Bagging Boosting Individual Models are built separately A new method is influence by the simultaneously performance of a previous model iterative Take bootstrap samples of the data Uses all of the data on average only about 632 of the samples consist of unique observations while the rest are duplicates Resamples using the uniform Modifies the data by reweighting the distribution observation according to classification accuracy Each observation is weighted equally Each observation is initially weighted equally and then reweighted according to how accuracy it was classified Each classifier is weighted equally Each classifier is weighted according to how accuracy it classifies 39JtahStateUn ive rsity O 09 O O O 09 H Discussion Using Bagging and Boosting will result in a smaller misclassification rate Random Forests is a special case of bagging and it works very well For everyday people Random Forests is a great classify method being used 39JtahStateU n ive rs ity 37 Reference 1 Chapter 16 amp17 of Bioinformatics and Computational Biology Solutions Using R and Bioconductor course text Peter BijhlmannBagging Boosting and Ensemble Methods lt httpmarswiwihu berlindeebookshtmlcsanode227htmlgty Alexandre Luis quotBoosting and Baggingquot 04 May 2004 8 Apr 2008 lthttpgnomofeupptnnigpapersboobagpdfgt Mark R Segal Machine Learnin Benchmarks and Random Forest Regression University 0 California 2004 Reynolds Stacey Bagging and Boosting Statistical Computing Class presentation Fall 2008 Kusiak Andrew quotData Mining Bagging and Boostingquot The University of Iowa 1 Apr 2008 lthttpwwwicaenuiowaeducompPublicBaggingpdfgt 7 Ridgeway Greg Generalized Boosted Models A guide to the gbm package 3 Aug 2007 lthttpcranr projectorgwebpackagesgbmvignettesgbmpdfgt O KU39ILUJN 39JtahStateUn ive rsity 38


Buy Material

Are you sure you want to buy this material for

25 Karma

Buy Material

BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.


You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

Why people love StudySoup

Steve Martinelli UC Los Angeles

"There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

Anthony Lee UC Santa Barbara

"I bought an awesome study guide, which helped me get an A in my Math 34B class this quarter!"

Jim McGreen Ohio University

"Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."


"Their 'Elite Notetakers' are making over $1,200/month in sales by creating high quality content that helps their classmates in a time of need."

Become an Elite Notetaker and start selling your notes online!

Refund Policy


All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email


StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here:

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.