New User Special Price Expires in

Let's log you in.

Sign in with Facebook


Don't have a StudySoup account? Create one here!


Create a StudySoup account

Be part of our community, it's free to join!

Sign up with Facebook


Create your account
By creating an account you agree to StudySoup's terms and conditions and privacy policy

Already have a StudySoup account? Login here


by: Darrion Bednar

Bioinformatics CISC636

Darrion Bednar
GPA 3.73


Almost Ready


These notes were just uploaded, and will be ready to view shortly.

Purchase these notes here, or revisit this page.

Either way, we'll remind you when they're ready :)

Preview These Notes for FREE

Get a free preview of these Notes, just enter your email below.

Unlock Preview
Unlock Preview

Preview these materials now for free

Why put in your email? Get access to more of this material and other relevant free materials for your school

View Preview

About this Document

Class Notes
25 ?




Popular in Course

Popular in Computer Information Technology

This 93 page Class Notes was uploaded by Darrion Bednar on Saturday September 19, 2015. The Class Notes belongs to CISC636 at University of Delaware taught by LiLiao in Fall. Since its upload, it has received 44 views. For similar materials see /class/207173/cisc636-university-of-delaware in Computer Information Technology at University of Delaware.

Similar to CISC636 at UD

Popular in Computer Information Technology


Reviews for Bioinformatics


Report this Material


What is Karma?


Karma is the currency of StudySoup.

You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 09/19/15
CISC 636 Intro to Bioinformatics Spring 2008 DNA Microarray 2d gel MSMS yeast 2hybrid CISC636 SOS Lecl9 Liao Gene expression How many copies of a gene its product is present in the cell For experimental reasons gene expressions are measured by numbers of mRNAs not directly by proteins See Proteomics Various cell types are due to different genes expressed The difference between diseased eg cancerous and nondiseased Diseased cells are often resulted from the abnormal levels of expression of key genes CISC667 SO7 Lec22 Liao u uauc u I v cloulq m aim II quotI 9 0 A Comparative Hybridization I Experiment L 6 397 u 1 v a a 1 num wmnm c15c557 507 Leczz Liao Microarray Oligonucleotide Affymetrix array 0 Oligo 25 bases long 0 High density lcm2 contain 100k oligos cDNA array 0 cDNA RTPCR much longer gt 1000 bases 0 Varied density of cDNA on each spot hybridization depends on length 0 Less possibility for false positives Image processing Background subtraction Normalization CISC667 807 Lec22 Liao m huge Pmmm HHIW Y Oulll Contral CI mm mm W 0 lulu Ynnlcliplun Munw mm x a Mm a m y u p m n 4 Lys n u v in NW be Ewmm Laws El Nyhndla w mumy sung Nammuuun arm tllUlIrwysw anmay 5m CISC667 so7 LecZZ Liao f39llIVH IU 39IVNHHUI 9m mus3w 397I 119mva zm meoA 39oooz msnv Journal of Bacteriology AUGUST 2000 VOLUME IBZ NUMBER I6 Published wace Monthly CISC667 SO7 Le022 Liao Applications Inferring transcription regulatory networks Understanding correlation between genotype and phenotype predicting genotype ltgt phenotype Phenotypes drugtherapy response drugdrug interactions for expression drug mechanism interacting pathways of metabolism CISC667 SO7 Lec22 Liao What is proteomics Like genomics is the study of all genes in a genome proteomics is the study of all proteins of a cell at a given t1me Three aspects Biological process Why is this being done e g movement of cell Molecular function What kind of molecule is this e g ATPase Cellular component Where is this located e g ribosome Why is it dif cult Moving target Celltocell variations Cell behavior changes With time Lack of high throughput technology Protein chips Protein sequences do not have hybridization that DNA sequences have CISC667 SO7 Lec22 Liao 8 Proleln probes Nuclerc acxd probes Drug probes Serum probes CeH lysales lemg sens Enzymes 0 O O O O O O 0 O C O 0 O O Protein expressmn eve Protein binding propertres Pmtem promrng Pamway bunde Dragnosuc Drug drscuvery PostrlranglaucHBI mum cauon mirrmrrmr 39 39 39 39 39 antinpn DNA RNA aplamers carbohydrates or small molecmes with high amnity and speci city are 39 39 muuimling rmihrtome mim r purs39urreu ur gwequumul Iha nmnr h Mama Muteill u 39 39 39 39 m 31 microarrays 39 39 39 39 4 Mlh the pl upel furrnnnzl 39 L meresl CISC667 so7 Le e22 L i39ao 2D gel electrophoresis Isoelectric points rst dimension Molecular weights second dimension Both p1 and MW are functions of amino acid sequence of a protein Some proteins do not resolve well by 2D gels Issues Detection of spots image processing Quanti cation of each spot Identi cation of each spot Mass Spectrometry CISC667 SO7 Lec22 Liao 10 e ZOOkDa 30kDa 39 39 quot 8kDa URE 618 0 Twodimensional 2D gel electrophoresis Each cotumh a42 thh a gradattoh of gray shadthg represents the tsoetectrtc focusthg get wtth a pH gradteht a A mtxture pt proteths btue drop ts apptted to the tsoetectrtc focusthg get and b exposed to an etectrtcat 1 m o m a 8 u m g m c u g lt m 2 s m i a j a n 3 w w E E m g m a 8 g m g phorests SDPAGE Proteths mtgrate thto the stab get accordthg to thetr motecutar wetghts Yeast cetts were grown m rtutt tuedta and subjected to 2D get attatysts strtg dupttcate tsuetecr trtc Tocusthg gets targe e and smatt f proteths were aha ed oh separate gets The same spots appear at the bottom of e and the top of t Motecutar wetghts are resotved oh the axts and pts oh the Xraxts Pahets e and t are from the Swtss 2D database at ExPASy CISC667 s07 Lec22 Liao 11 lanization Sepalalinn Acllvaliun M355 Analysis MS MS Lum zan on Separation Activation Mass Ana vsis MS MS Colllslon c Argnn Gus Ionization Senaraliun Anlvatlon Mass Analysis MS MS d Mass Analyzer It loll Detector DNZaUmn Separater Achalion MESS Analysis CISC667 s07 Leczz Liao S P A F D S l M A E T L K 10 iprotonated mass 14106i Mass bions yians Mass 80 811 s PAFDSIMAETLK 13230 135 2 SP AFDSIMAETLK 1225 4 11565 2563 sP FDQIMAET K i1654 60 4035 AF IMAE LK 10082 12268 518 5 SPAFD i AE 0331 8055 8925 005 0 SPAFDS IMAETLK 00 0 40 718 s SPAFDSI MAETLK 092 a 0500 s FDSIM AETLK 501 7 02 1 SPA SiMA ETLK 490 0 20 10502 SP DSIMAE TLK 361 5 11113 SPAFDSIMAET LK 2004 12044 SPAFDSIMAETL K 1472 250 500 750 1000 1250 ml FIG U RE 620 0 Protein identi cation through peptide lragment formation and sepa ration When a group or identicai proteins is broken into its peptide pieces more than one pair or b and y peptides wiii be formed a One protein sequence and its caicuiated mass on top with the b peptidesmasses gray and the y peptidesmasses biue beiow b An experir mentaiiy determined masscharge spectrum from the peptide in a Notice that some peaks are higher than others which means that some by peptide pieces were more abundant than others The spectrum is used to determine the peptides amino acid sequence and protein identity CISC667 so7 Leczz Liao 13 Cell population 1 Cell population 2 Combine Run 2 D gel Excise protein Spots digest peptides Analyze by MS Identify proteins II II II gtProteinA II II ll I I I gtProteinC u I I lmenstv MasSr charge FIG un E s 24 Quantifying di erences in proxeomes Two Cell pools are grown in the presence blue cell popula tion 2 or absence gray cell population 1 of heavy nitrogen N The proteins are extracted pooled and subjected to 2D gel analysis spots excised and proteins identified and quantified by MSMS The relative areas under I e pairs of neavy and Hg peptide peaks indicate relative abundance of each protein pair CISC667 so7 Leczz Liao 14 a Cell population 1 Cell population 2 Combine Purify target proteins P30 ol From cell population 1 Target protein 100 From cell population 2 Digest Extract peptides Analyze by MS 3 e we a E Massi charge x Unphosphorylated peptide x a Xp Phosphorylated peptide x E m E A B c X D quot5 Xp 2 a CC FIGURE 615 Quantifying relative levels of phospho rylatlcvn Bar graphs A B C and D lndicate no difference for proteins ArD in the two populations Compared to population ii unpnospnorylated protein X was reduced in population 2 Pnosphorylated protein X Xp was increased in population 2 CISC667 07 Lec22 Liao 15 Unclear classification Ionic homeostasis Cellular organization Unclassified proteins Metabolism Cell rescue cell defense Energy and cell aging Signal transduction V Ce growth division Cellular biugenesis DNA Symhesis Intracellular transport T ansport facilitation Protein destinatian Protein synthesis Transcription FIG UR E 69 Bar code analysis of biological processes Dlstrlbution of functional classes of essentlal lnner cllcle and nonessential outer clrcle genes uslng clitena from the Munlcn lnformatlon Center for Protein Sequences MlPS CISC667 so7 Lec22 Liao 16 1 Lam 2 la Lsm Dcp1 Lana VLRQBQC Lsmli 615 t Proteomics circuit showing the interactions 01 RNA splicing pro 5 The proteins are Indicated b their rhteractrohs wrth hhes The dots oh the hhes hetp ou to he that were detected by trad39rtrohat Y2H array 5 eehs and gray from htarature and array scree h used as bait W the screens SmaH gray nodes hr39ghhghted here ack from nuttrpte h39rghrthroughp hs Th rrows y from the protet eract39rohs hot 9 mack porht awa Thdrcate other DFOIeT rpTOIQTH rht CISC667 07 Lec22 Liao Membrane Nucleus l742715l Cytoplasm 155mm FIG URE 617 Protein interaclions grouped by cellular compartments Numbers rn parentheses rnd39roate the number or rnteraot39rons tn the clrcurt dr agram among proteins of thrs compartmentthe total number of proter39ns rn t e 5 me compartment L39rnes connecan come partments rndr cate the number of protern lnteractrons by the thickness of each lrne numbers or connectrons are near each lrne For example there are 7 rnteractrons between tre 48 membrane proter ns and the 72 plasma membrane proterns rn Benno Frgure t CISC667 07 Lec22 Liao 18 id Bv ltn Ti Y i 13 I V polymerase 7 39e reporter enzyme activ39 e eg blue colonie crscsm 07 Leczz ano Yeast two hybrid System 2 I Gal4 protein the two domains of the protein do not need to be transcribed in a single protein I Just as long as they come to interact Two other protein domains interact Actrvating am domain interacts interacts Wlth Wlth promotor polymerase Measure reporter enzyme actiyity as eg blue colonies DavidB Cullinge KV39I Yeast two hybrid System 3 This is achieved using gene fusions m a Bindjng domain as a Activating domain as a translational fusion translational fusion With With the gene the gene encoding a third encoding another protein in a second protein in one 39 DavidB Cullinge KV39I Mun x nun l l E E J w 55 393 DH line Screened V Screened against l iiiiil Screened Screened a gai nst against 22 CISC667 SO7 LecZZ Liao CISC 636 Intro to Bioinformatics Spring 2008 Hidden Markov Models 11 The model likelihood Forward algorithm backward algorithm Posterior decoding CISC636 SOS LeclO Liao The probability that sequence X is emitted by a state path 71 is PX 75 Hil to L eni a mi nil i l 2 3 4 5 67 8 9 X TGCGCGTAC HC PX 71 0338 X 070 X 0112 X 030 X 0368 X 065 X 0274 X 065 X 0368 X 065 X 0274 X 035 X 0338 X 070 X 0372 X 070 X 0198 Then the probability to observe sequence X in the model is PX Zn PX at which is also called the likelihood of the model CISC636 SOS LeelO Liao How to calculate the probability to observe sequence X in the model PX Zn PX 7 Let fki be the probability contributed by all paths from the beginning up to and include position i With the state at position i being k The the following recurrence is true fki 2339 fji391 ajkl ekXi Graphically X1 X2 X3 Again a silent state 0 is introduced for better presentation CISC636 SOS LeclO Liao Forward algorithm Initialization f00 1 fk0 0 for k gt 0 Recursion fki ekxi ijjil ajk Termination PX ZkfkL ako Time complexity ON2L Where N is the number of states and L is the sequence length CISC636 SOS LeclO Liao Let bki be the probability contributed by all paths that pass state k at position i bki PXi12 XL 750 k Backward algorithm Initialization bkL ako for all k Recursion i Ll l bki Zj akj CkXi1 bjil Termination PX 2k 210k CkX1bkl Time complexity ON2L Where N is the number of states and L is the sequence length CISC636 SOS LeclO Liao Posterior decoding P1ti k x PX xi k PX fkibki PX Algorithm fori l to L do argmax k P7ti k X Notes 1 Posterior decoding may be useful when there are multiple almost most probable paths or when a function is de ned on the states 2 The state path identi ed by posterior decoding may not be most probable overall or may not even be a Viable path CISC636 SOS LeclO Liao CISC 636 Intro to Bioinformatics Spring 2008 Pairwise sequence alignment SmithWaterman local alignment CISC636 SOS Lec6 Liao Local pairwise optimal alignment Why need local alignment vs global mosaic structure functioning domains of proteins which may be caused by inframe exchange of Whole exons or alternative splicing e g are these three sequences similar or not 31 s2 I S3 CISC636 SOS Lec6 Liao Local alignment 0 Naive algorithm there are n2 m2 pairs of substrings to align each pair as a global alignment problem will take Onm the optimal local alignment will therefore take On3 m3 0 SmithWaterman algorithm dynamic programming recurrence relationship Fij max 0 Fi1 SXi Fi1j d Fi j 1 d Notes 1 For this to work the random match model must have a negative score Why 2 The time complexity of SmithWaterman is n m CISC636 SOS Lec6 Liao Example Align HEAGAWGHEE and PAWHEAE Use BLOSUM 50 for substitution matrix and d8 for gap penalty H E A G A W G H E E 0 0 0 0 0 0 0 0 0 0 0 P 0 0 0 0 0 0 0 0 0 A 0 0 0 5 0 0 0 0 0 W 0 0 0 0 quot 0 0 0 H 0 10 2 0 6 E 0 2 16 8 A 0 0 8 21 E 0 0 0 13 AWGHE AW HE CISC636 SOS Lec6 Liao Gap penalties Linear 7g g d Where g is the gap length and d is the penalty for a gap of one base Affine 7g d g1e Where d is gapopen penalty and e typically smaller than 1 is gapextension penalty Such a distinction is mainly to simulate the observation in alignments gaps tend to be in a stretch Note gap penalty is a sort of gray area due to less knowledge about gap distribution CISC636 SOS Lec6 Liao General algorithm to handle Af ne gap penalty To align two sequences Xln and ylm i if X at z39 aligns with y at j a score SXi yj is added if either Xi 0r yj is a gap a score of 7g is subtracted penalty ii The best score up to ij will be Fij maX Filjl SXi yj Fkj yik k O Fi k yGk k O jl This algorithm is On3 for nm CISC636 SOS Lec6 Liao Example Align GAT and A using the following scoring scheme identity 4 transition 2 transversion 4 gap penalty op 9 ex l G A T 0 9 10 11 A 9 2 5 GAT GAT A A CISC636 SOS Lec6 Liao Gotoh algorithm Af ne gap 7g d gle F0J 70 F070 71 Fij max Filjl sxi yj Pi j gap in sequence y Qi j gap in sequence x P0j 00 so as to always take F0j Pij max Filj d open a gap Pilj e extend a gap Qi0 00 so as to always take Fi0 Qij max Fi jl d 0pen a gap Qijl e extend a gap This algorithm is On2 CISC636 s08 Lec6 Liao CISC 636 Intro to Bioinformatics Spring 2008 Whole genome sequencing Mapping amp Assembly CISC636 SOS Lec4 Liao As of May 2005 225 completed microbial genomes source TIGR CMR A Bacteria 16 Mb 1600 genes Science 269 496 B 1997 Eukaryote 13 Mb 6K genes Nature 387 1 Science 282 1945 D 2001 Agrobacteria 567 Mb 5419 genes Science 2942317 EF 2001 Human 3 Gb 35K genes Science 291 1304 Nature 409 860 G 2005 Chimpanzee CISC636 s08 Lec4 Liao Sequencing Technologies and Strategies Sanger method gelbased sequencing TopDown or BACtoBAC Whole Genome Shotgun Sequencing by Hybridization Pyrosequencing sequencing by synthesis Science 1998 Polony sequencing George Church et a1 Science August 2005 The 454 sequencing Nature July 2005 SBS wwwsolexacorn CISC636 SOS Lec4 Liao Physical Mapping by using SequenceTaggedSites STSs are unique markers Exercise What is the difference between STS and EST expressed sequence tags What is the chance a fragment of 20 bps to be unique in a genome of 3 billion bps Visit to learn more about EST as an alternative to whole genome sequencing llllllllll ll lililllllll lllll lllllll BAC5 ltI1il1 ID 163 Mb Fl GUR E 13 Relationships of chromosomes to genome sequencing markers The X chromosome rs about 163 Mb rh ehgth Ih thrs diagram there are 16 overtapprhg BAG stones that spah the entire ehgth Ih reatrty 1408 BACs were needed to span the X Chromosome Arrows top mark STSs scattered throughout the chromosome and on overlapprng BACs Courtesy of Discovering genomics proteomics amp bioinformatics by Campbell amp Heyer CISC636 SOS Lec4 Liao Genome is nebulized m 2 kb fragmcnts Tenns YAC l Sifxuns are inserted into Cosmid I Mapping I pUC18 Tiling path Read 1 Inscrts m sequenced Gap Conti g 39shmg m 17 337323quot ctg403 CISC636 s08 Lec4 Liao Why is Assembly Dif cult The most natural notion of assembly is to order the fragments so as to fonn the shortest string containing all of them ABRAC W ACADA ABRAC RACAD ADABR ADA DABRA ADABR RACAD DAB However the problem of nding the shortest common superstring of a set of strings is NPcomplete CISC636 soa LecA Liao Assembly is complicated by repeats CISC636 08 Lec4 L120 Informatics tasks Shotgun coverage LanderWaterman Basecalling Phred Assembly Phrap wwwphrapcom Visualization Consed contig editor for phred and phrap Postassembly analyses Sun Kim Li Liao J eanFrancois Tomb quotA Probabilistic Approach to Sequence Assembly Validation Alvaro Gonzalez Li Liao BMC Bioinformatics 9102 2008 CISC636 SOS Lec4 Liao STScontent mapping a Actual ordering that we want to infer l 2 3 4 I b Hybridization data ST Ss DNA Clonel Clone2 Clone3 c Permutation of columns to have CIOHSTS 2413 1 1010 2 1101 3 1001 What we do not know either the relative location of STSs in the genome or the relative location of clones in the genome CISC636 SOS Lec4 Liao consecutive ones in rows 1234 l 0 0 2 01 1 1 3 0 0 Linear time algorithm by Booth amp Lueker 1976 Sequence coverage LanderWaterman 1988 0 Length of genome G 0 Length of fragment L o of fragments N 0 Coverage a NLG Fragments are taken randomly from the original full length genome Q What is the probability that a base is not covered by any fragment Assumption fragments are independently taken from the genome in other words the lefthand end of any fragment is uniformly distributed in OG Then the probability for the LHE of a fragment to fall Within an interval X xL is LG Since there are N fragments in total on average any point in the genome is going to be covered by NLG fragments CISC636 SOS Lec4 Liao Poisson distribution The rate for an event A to occur is r What is the rate to see a lefthand end of a fragment Probability to see an event in time interval t t dt is PAr r dt ht probability no event in Ot This is called exponential distribution By independence of different time intervals ht dt ht l r dt 6h8trht0 gt ht eXprt Probability to have n events in 0 t Pnr eXprt rtn n This is called Poisson distribution CISC636 s08 Lec4 Liao What is the mean proportion of the genome covered by one or more fragments Randomly pick a point the probability that to its left Within L Where there are at least on fragment is l eXpNLG Example to have the genome 99 covered the coverage NLG shall be 46 and 999 covered ifNLG is 69 Is it enough to have 999 covered Human genome has 3 X 10 9 bps A 69 X coverage will leave 3000000 bps uncovered which cause physical gaps in sequencing the human genome Then what is the number of possible gaps CISC636 SOS Lec4 Liao contigs Number of 3580 3888 25693 8066 1586 1880 SBVJ CISC636 SOB Lec4 Line 7 7 What is the mean of contigs N eXpNLG For G 100000 bps and L 500 NLG110 15 20 30 40 50 60 70 Mean of contigs 736 669 541 299 147 67 30 13 CISC636 SOS Lec4 Liao Assembly programs Phrap Cap TIGR assembler Celera assembler CAP3 ARACHNE EULER AMASS A genome sequence assembly primer is available at httpWWWcbcbumdeduresearchassemblyprimershtml CISC636 SOS Lec4 Liao Sequencing by Hybridization Hybridize target to array containing a spot for each possible k mer TGT TGG TGA ACTGAC ACTGAC ACTGAC ACTGAC TGT TGG TGA ACTGAC ACTGAC CTG CTT ACTGAC CTA CTT CTG ACTGAC ACTGAC ACTGAC CTA GAC GAA GAT ACTGAC 3AA GAT GAO ACTGAC ACTGAC ACTGAC CISC636 08 LecA Liao The spectrum of a sequence multi set of all its litlong substrings k mers Goal reconstruct the sequence from its spectrum ACT CTG TGA gt ACTGAC GAC Pevzner 89 reconstruction is polynomial CISC636 SOS L604 Liao Sequence reconstruction and Eulerian Path Problem Pavzner 89 A 11 flllkil the sequence 15 I the 1 mer Haw bark2 The de Bruijn graph of A GA V E where IV A1i1umn1 IE 8i i ln 61 AlH41 GCT TGC GCC ACTGCTGCC ACT cIsc636 508Lec4 ano Different sequences can have the same spectrum ACTCTATAC ACTAC TACTA cIsc636 508 Le4ano Shorty Assembly from Short Paired Reads 1 Clean input readpairs to correct basesequencing errors using read frequency analysis and consensus read correc tion N i Construct the de Bruijn subgraph on left reads so as to group associated the right readsi L i Construct the de Bruijn graph on the right reads of each left group 4 i Select contigs of suf cient size to pass through a shotgun assembler 5i Postassembly contig extension Chen amp Skiena 2005 httpwwwcssunysbeduskiena crscsas 08 Lec4 Lrao CISC 636 Intro to Bioinformatics Spring 2008 Systems biology Gene expressions pro ling and clustering CISC636 808b LecZO Liao expression level A N w 0 gt5 N N w w e lgt m m m m b I an expression level N w a a N N w m a a m u m as s N expression level 3 expression 4eve A N m o N w A Typical expression pro les Cmsrw Mean mus NovelEST can Me My ratio was mwm Famov Mavkcv NME mm mam am am CISC667 07 Lec23 Liao Russ Altman partitions partition 1 partition 2 partition 3 partition 4 partition 5 partition 6 Hierarchical clustering clusters X6 X5 X4 X3 X2 X1 X1X2X3X4xsxs X1XZX3X4X5x6 X1X2X3X4X5X5 X1XZX3X4Xsxs disjoint X11X21x3IX4IXSIX6 X1IX21X31X41X51X5 conjomt Hierarchical clustering W um mlvl mm mm muunmmm a r l39I mummnmmm mm wu ww um a gumei ilwnzztmx mi rm mm It on n 5m x w w mm a rcuIvHLr mun nun1M mun1n IA uvwxnu H mm r u zrmmicmm m Mmm Am crsc557 07 Lec23 Liao Rmmunm Rmmunm c15c557 07 Lec23 Liao Effects of various metrics for measuring distance A Euclidean normal 2 normal 3 cancerA cancerB cancerC 000 002 004 006 008 0l10 012 distances B Pearson normal 1 normal 2 normal 3 cancer A cancer B cancer C I 00 01 02 03 04 05 06 distances Different clustering schemes A h 39 Slngle 11nkage asei ase4 3393 3525 L5 level 5 a dislaznces 3 A easel 20 2 case 2 22 m Complete llnkage 39 54 case 3 55 9 case 4 70 c a I istanaces 4 5 5 case 5 74 39 m2 39 centro1d 39 We 2 3 distances Iterative Distancebased Clustering K rneans Basic idea Given a predetermined constant k the number of clusters iteratively reccmpute centers means of k clusters starting from randomly chosen k instances as centers 1 K instances are chosen at random as cluster centers 2 Instances are assigned to their closest cluster center generating 19 clus ter 3 While there is change in cluster centers 4 Compute the centroid mean of all instances in each cluster 5 Instances are assigned to their closest cluster center generating k cluster 6 end CUunesyufSulem crscom 07 Lec23 Line 9 A Correct Clustering Example O mum r mmmmmm g 7 c1scee7 s07 Lec23 Liao Cumesy ufSunKm An Incorrect Clustering Example O mum The initial ChOiCB of Cluster centers node 1 and nodal leads to an incor rect Clustering Obviously a diff nt choice of Cluster centers node 1 and node 3 result in a correct Clustering CwunzsyafSunme c15c557 07 Lec23 Line 11 Discussion 1 The iterative procedure for k means may end up with a local minimum depending on the initial choice for cluster centers to A simple heuristic is to run the kmean clustering several times with different starting points on How do we know the number of clusters in advance Many different C can he tried ggt K mean clustering as most clustering techniques assumes that in stances can he placed in Euclidian space 01 Speeding up the K mean algorithm is important See the paper in SIGKDD Exploration July 2000 by Farnstorm Lewis and Elkan httpwwwcseucsdedu lkan CumesyufSunKim crsc557 so7 Lec23 Line 12 Fuzzy kmeans clustering Fuzzy membership Each data point X has some probability to belong to a cluster W centered at u PWX The probabilities of cluster membership for each point are normalized Zi1tokPW1Xj1f0rl19man 1 Cluster cost J 2 Z 1 1 tok 2 j 1 ton PW1Xjb IIXJ39 uil2 2 CISC667 SO7 Lec23 Liao l3 Condition for minimum cost aJ 8ui O ui E j z 1 PltWiIxJgtb xpZ J 1 PltWiIxJgtb gt 3 Update posterior probability as PW1Xj ldij 1b1Zr1tok1drj MM 4 Where dij HXJ uillz CISC667 SO7 LecZ3 Liao 14 Fuzzy kmeans Clustering algorithm initialize u1 uk normalize PWiXJ by eql do reeompute ui for i l to k by eq3 reeompute PWiXJ by eq4 until small Change in ui and PWiXJ return u1 uk CISC667 SO7 Le023 Liao 15 Classical kmeans is a special case when membership is de ned as PWiXJ 1 if xj ui lt xj ui for all i i 0 otherwise CISC667 SO7 Le023 Liao l6 Support vector machine SVM Application of SVM classification TASET 16063 genes 218 human tumor samples IIIIIIIIIIIIII BRPR LU CO LY BL ML i UT LE RE PA 0V ME CNS J MULTICLASS PREDICTION OVA melanbm OV39A clasfsjfier 1 cl a ssifjgr 7 39 classifier 14 ALLOTHER h er lane yp p O 39 0 I O 1 I 0 I I O I I O O breast I confidence high con dence prediction 2 V O O 0 test sample 0 BREAST Ir BR PR LU CO LY BL ML UT LE RE PA OV ME CNS CISC 636 Intro to Bioinformatics Spring 2008 Lecture 1 Course Overview Li Liao Computer and Information Sciences University of Delaware Administrative stuff 0 Syllabus and tentative schedule check frequently for update 0 Office hours 300PM5OOPM Wednesdays Appointments 9 Collect info name email dept language 0 Introduce textbook and other resources a URLs PDFPS files or hardcopy handout is A reading list 0 Workload as 4 homework assignments handson to learn the nuts and bolts Language issue Perl is strongly recommended A tutorial is provided a Midterm and final exams 399 Late policy 15 off per class up to two class mtgs CISC 636 808 L801 Liao Bioinformatics Books Markketa Zvelebil and Jeremy Baum Understanding Bioinformatics Garland Science 2008 0 Dan E Krane amp Michael L Raymer Fundamental Concepts of Bioinformatics Benjamin Cummings 2002 R Durbin S Eddy A Krogh and G Mitchison Biological Seguence Analysis Probabilistic Models of Proteins and Nucleic Acids Cambridge University Press 1998 Jo o Meidanis amp Jo o Carlos Setubal Introduction to Computational Molecular Bioloqv PWS Publishing Company Boston 1996 Peter Clote and Rolf Backofen Computational Molecular Biology An Introduction Willey 2000 Dan Gusfield Alqorithms on Strinci Trees and Sequences Cambridge University Press 1997 P Baldi and S Brunak Bioinformatics The Machine Learning Approach The MIT press 1998 o DW Mount Bioinformaics Sequence and Genome Analysis CSHLP mm 80839 Lecm Llao Molecular Biology Books Free materials 0 Kimball39s biology 9 Lawrence Hunter Molecular biology for computer scientists o DOE s Molecular Genetics Primer Books 9 Instant Notes series Biochemistry Molecular Biology and Genetics 6 Molecular Biology of The Cell by Alberts et al CISC 636 808 L801 Liao Bioinformatics use and develop computing methods to solve biological problems The field is characterized by an explosion of data difficulty in interpreting the data large number of open problems until recently relative lack of sophistication of computational techniques compared with say signal processing graphics etc CISC 636 808 L601 Liao Why is this course good for you According to a report in recent ACM Technews CS enrollment has dropped for good or bad 9A factor for this drop is quotthe growing prominence of biotechnology and other fields quot Bioinformatics is a computational wing of biotechnology CISC 636 808 Lec l Liao 3 httpquotwwwccgatechedulnewsldldawnloadpdf Microsoft Internet Explorer File Edit 6010 Favorites Help eBack v Jl Equot 7 Search Favorites Q Media 3 17339 s er Address hHplilii n rr nal arh 39 39 39 quot 39 39 39 nrlF vl Ga Links v Heaveampy GE Search 37539 1 Select v 42 l 3 Q 124 v 9 73 v Zsign v rml I 2 Top 10 Most Popular Magazine and Computing Surveys Articles Downloaded In March 2005 l in A Survey of PeertoPeer Content Distribution Technologies Stephanos And mursellisTheotokis Diomidis Spinellis ACM Computing Surveys Dec 2004 2 76 Bioinforrrmics An lntmduction for Computer Scientissquot iacqum Cohen ACM Computing Surveysjine 1004 3 2 Dam Clusteringquot AKJain MN Murry PJ FlynnACM Computing Surveys Sept I999 4 412 The University39s Next Challengesquot Peter 1 Deming Commnications ofthe MM May I996 5 3 Face Recognition N Zhao R Chellappa PJ Phillips A Rosenfeld ACM Computing SurveysDec 2003 6 I62 Why M Slog Bonnie A Nardi Diana 1 Schiano Michelle Gumbrecht Luke Swarm Communications of the ACM Dec 2004 w 7 4 AspectOriented Programming Tzilla Elr ad Robert E FilrnanAef Baden Communications ofthe ACM Oct 200 E E 9 ll Emerging Busines Models for Mobile Brokerage Servicesquot Clayton A Looney Leonard MJessupJoseph S g Valacid39i Communications ofzhe ACMLjuna 2004 t qt 9 9 iquot The Suite of the Ari in Automating Usability Evaluation of User Interfaces Melody C ivory Marti A Hearst ACM Computing Surveys Dec 200 I 7 0 E l0 l7 USTechnology Policy In the lnfonmmion Agequot Hal Berghel Communications afthe ACNljurle I996 8 quotus Imdms no lvclude amnimds 1 r me Febmmjr and Mnrdi usms as dime mm mm m reflect esubscrlhei1 misusing man 455125 Coming Next Month in V E ii ilzutmfz l o ulHl v rimM Adobencr New Sites news Nature 418 293 2001 doi101038l3506610 Singapore invests in bioinformatics DAVID CYRANOSKI TOKYO Plans for a new institute in Singapore could address one of the most acute skills shortages in science by producing up to 100 ire The planned Bioinformatics Institute is part of Singapore39s USl billion a year effort to turn the island into a powerhouse of biomedical research Within ve years the institute should be delivering 100 masters degrees in bioinformatics This is more than any other institution in the world says Limsoon Wong director of the Kent Ridge Digital Bio lforrnatics Laboratories and one of the planners behind the institute According to Wong training at the institute will go well beyond the curating of data quotWe will be training people how to make predictions 39om the data concerning interaction between proteins and how to use these data to drive experimentsquot he says The research and teaching institute will be housed temporarily at rst before moving to the planned 39biopolis39 science park near the National University of Singapore when it opens in two years time The government has yet to announce its funding level but the park is expected to start with a grant ofaround 13100 million US60 million The institute is likely to absorb the existing bioinforrnatics centre at the National University and to help service the nation39s Qunaquot expanding genomics programme Its own research programme will follow from the interests of the staff who will be recruited internationally Gunaremarn Rajagopal a theoretical physicist at Cambridge University has been named as the instimte39s deputy director and starts wor to build a world class research organization with strong encouragement commitment and active support of the government of Singapore Nature Macmillan Publishers Ltcl 2001 Registered No quot859198 England naturejobs careers I39Jarurejobs 413 25 October 2001 8 2001 doi10103885101T88 Science 8 Technology Networks in Scandinavia December 12th Nature supplement Playing catch up ROBERTTWENDL Robert Tiiendl is a freelance writer based in Tokyo Japan39s government is belatedly realizing that it neeils to increase funding for training in bioinfonnatics says Robelt Tl ient39 field could hinder the country39s efforts Well trained bioinformatics specialists in Japan are not just rare they are virtually non existent This is partly because of a lack of formal education in the subject and the problem is systemic With little formal recognition ofbioinformatics as a eld graduate departments have until recently allocated only a limited number of students to existing bioinformatics teachers The government recognized the need for more bioinformaticians as it scaled up the country39s genomics e o s This year Japan39s Ministry of Education Science Sports and Culture started to upgrade bioinformatics education at national universities by creating additional staE positions and mding both undergraduate courses and graduate level informatics training Kyoto University39s new bioinformatics centre is a product of this new policy The university is Japan39s leading academic centre for bioinformatics research but until a few months ago all its bioinforrnatics activities were concentrated in just one laboratory Minoru Kanehisa39s lab at the Institute for Chemical Research Tokyo University also plans to increase its bioinformatics education and training activities And the private Keio University has set up a whole new campus focusing on systems biology and the dynamic modelling of biological systems such as human blood cells Part of the ministry39s promotion and coordination fund the programme will provide between U851 million and US2 million in additional funding over several years for undergraduate and graduate education in bioinformatics systems biology protein functional analysis and software development Roses Biolog joining hioinft tower HKLNIVE DEHNLII NFUKMHIIUN LLHgt3PIEU QUHQLKIUt Nnrun jnbs Br39ntc39chrmlngy March 2001 Volume 19 Number 3 p1285 23G MAR 2001 University bioinformatics programs on the rise Randy J Zauhar FIGU RES r TABLES Randy J Zauhar is associate professor of biochemistry and director ofthe graduate program in bioinformatics Department of Chemistry and Biochemistry University ofthe Sciences in Philadelphia 500 S 43rd Street Philadelphia PA 191044495emai1 rzauha usi eclu Fueled by strong demand from students and industry39s neecl for trained bioinformaticists universities are increasing their offerings in this fastgrowing field Information is found in the arrangement of things with respect to each other Whetherwe obsenre the arrangement of inllt on the page to form words orthe order of nucleotide bases to form a gene sequence we recognize that information is the key to understanding the world around us Bloinformatlcs studies how information is stored reproduced and used by living systems it is not an overstatement to say that bioinformatlcs is what biology is evolving to become in the 2lst century The subject matter of bioinformatics has existed as long as there has been molecular biology but it has emerged as a distinct discipline only in the last two decades This was m39nmnlorl hit a veritable ovmlncinn nF inFnrmafinn frinnororl hit careers and recruitment Mm 389 420 422 1997 Liann lm Publishers Ltd Running to catch up in Europe HELEN GAVAG HAN Helen Gavaghan is a science and technologyuuriter based in Hebden Bridge Yorkshire UK Across Europe the story is the same Demand for those skilled in bioinformatics exceeds supply Like biochemistry and biophysics befo the barriers between traditional academic elds and demanding exibility and a new way of from its adherents Computational biology has meant different things to diEerent people Not too 10 ago says Hans Prydz of the University of 051039s Biott handling m data or analysing Doppler echograrns Now renamed bioinformatics it means looking for patterns in DNA and RNA prI modelling proteins and mining massive databases that continue to grow When the DNA database run by the European Bioinformatics In contained TOOJJCIU nucleotides now there are more than a billion Driven by the scienti c and commercial importance ofbioinformatics in genomics and drug discovery and development governments un responding with varying degrees ofvigour and success to the skills shortage and are seeking ways to cross the boundaries between disci physics mathematics computer science statistics protein chemistry genetics and molecular biology At European level the EBI based near Cambridge United Kingdom is funded to the sum of about DM 9 million 5 million by meml Israel via their contributions to the European Molecular Biology Laboratory ENIBL in Heidelberg Germany Contributions from the ph industries roughly double the institute39s income The EBI an offshoot of EIVIBL develops tools for bioinformatics seeks innovative ways training courses for academics and industrialists Initiatives with industry include the Industry Af liates Initiative which helps small and me and apply new techniques the BioTitan Project running nodes to enable faster access to databases and the Biostandards project funds European Union for promoting and developing standards National initiatives also exist particularly in the United Kingdom and Germany Says Andrew Lyall responsible for bioinformatics at Gla in pretty good shape quot There are two government nanced initiatives in the United Kingdom both of which received a second lease of lit One of these schemes supported by the Biotechnology and Biological Sciences Research Council BBSRC coordinates the UK bioinf the scheme has concentrated on developing software that would enable biologists without information technology IT skills to use some their trade that are found on the World Wide Web At a meeting earlier this month the steering committee of the scheme decided to cha Brass who runs a masters39 degree course in bioinformatics at the University ofManchester and is a member ofthe committee says quotWt careers and recruitment Mm 404 686 as 2000 e samurai Publishers Ltd Training United States gives priority to skills shortage POTTER WICWVARE Bioinfonnat39irs monies together a wide range of scientific disciplines hut ith a global shortage of skilled researchers train WASHINGTON Industry is draining bioinformatics talent Born universities faster than it can be replenished This is good news for the pee news for the institutions that are scrambling to provide it says Francis Ouellette at the University of British Columbia s Center for Molec Ouellette and Christoph Sensen at Canadian Bioinformatics Resource in Halifax Nova Scotia run a fourpart survey series one week I genomics proteomics and tools development which introduces people to the eld Ouellette worries that the series is only a temporary Sensen stresses the dif culties academic groups have in nding and retaining talent quotIn two years of looking I haven39t found a person will environment PhDs either go to a company or to a nice warm place in the United States where they also get more money But there is an academia because that39s where much of the real science is donequot Chris Lee of the Bioinforniatics Institute at the University of California Los Angeles concurs Industry has the data he says But it lacks lllSEI ViCE university as well as the freedom to quotsit around talking about problems with people from different backgroundsquot The gap between supply and demand in bioinforrnatics is receiving o icial recognition in the United States The US National Institutes of bioinformatics mainly through two institutes the National Human Genome Research Institute and the National Library of Medicine HOW centres outside the NIH must also arise The NIH approves the concept of developing such quotcentres of excellencequot but has been slow tr in astructure The National Institute of General Medical Sciences has also committed itselfto funding training slots and a fourth branch of the NIH the Resources which is not an institute has put itselfbehind shared bio computational resources at more than a dozen centres nationwide 39 Argonne and Oak Ridge laboratories are also huge mders of bioinformatics work as is to a somewhat smaller extent the Department On the private side the Howard Hughes Medical Institute HERE has declared that it will appoint investigators in computational biologj that until now has avoided funding research in what it viewed as engineering disciplines Now however it is becoming clear that biocom IIHIVJI s biomedical mission but is one of its most critical elements Other support is also issuing om the Alfred P Sloan Foundation which has recently called for proposals to fund academic units that CI39E in biology Traditionally these degrees have not carried the same weight in biology as in engineering or business where they are terminal 39 nan careers and recruitment Mum JIJD 963 2001 Man Publishers Ltd Who makes the best bioinformaticians PAUL SMAGLlK Paul Smaglik is editor of Naturejobs Bioinformatics careers can be divided into two paths developing software and using it The eld catalysed by the rapid accumulation of genomic d attention as a salvation for jobs in biology But that sentiment may not provide an accurate assessment of job opportunities at least for career prosp For example InforMax one of the largest bioinformatics companies in the United States generally doesn39t hire biologists tumed prograrmners say chairman and chief executive o Eicer of the company based in North Bethesda Maryland IrforMax has about 95 programmers almost all ofvvhom come Born a maths physics or computer science background Titomirov says it is quotmuch people with those skills about biology than to teach biologists how to code well However as the company turns to developing software to handle fl and protein data it may draw on more biologists to help design new software modules Nature Macmillan Publishers Ltd 2001 Registered No 85998 England It is much easier to teach people with those skills about biology than to teach biologists how to code well Industry is moving in IBM 9 BlueGene The fosTes r compuTer39 with 1 million CPU 9 Blueprim worldwide collecTs all The pr39oTein informaTion 6 Bioinfor39maTics segmem will be 40 billion in 2004 up from 22 billion in 2000 GlaxoSmiThKline Celer39o Merck As rr39oZeneco CISC 636 808 LeC i Liao Computing and IT skills 9 Algorithm design and model building 9 Working with unix systemNVeb server 9 Programming in PERL Java etc o RDBMS SQL Oracle PLISQL CISC 636 808 LeC l Liao People International Society for Computational Biology wwwiscborg 1000 members Severe shortage for qualified bi informatians 8180 635 808 Lec i Liao Conferences ISMB Intelligent Systems for Molecular Biology started in 1992 RECOMB International Conference on Computational Molecular Biology started in 1997 PSB Pacific Symposium on Biocomputing started 1996 TIGR Computational genomic started in 1997 CISC 636 808 Lec l Liao Journals Bioinformatics BMC Bioinformatics Journal of Computational Biology Genomics genome Research Heic Acids Research CISC 636 808 Lec1 Ljao


Buy Material

Are you sure you want to buy this material for

25 Karma

Buy Material

BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.


You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

Why people love StudySoup

Bentley McCaw University of Florida

"I was shooting for a perfect 4.0 GPA this semester. Having StudySoup as a study aid was critical to helping me achieve my goal...and I nailed it!"

Anthony Lee UC Santa Barbara

"I bought an awesome study guide, which helped me get an A in my Math 34B class this quarter!"

Steve Martinelli UC Los Angeles

"There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

Parker Thompson 500 Startups

"It's a great way for students to improve their educational experience and it seemed like a product that everybody wants, so all the people participating are winning."

Become an Elite Notetaker and start selling your notes online!

Refund Policy


All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email


StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here:

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.