BIOINFORMATICS NETWORK ANALYSIS
BIOINFORMATICS NETWORK ANALYSIS COMP 572
Popular in Course
Popular in ComputerScienence
This 288 page Class Notes was uploaded by Cleora Stiedemann on Monday October 19, 2015. The Class Notes belongs to COMP 572 at Rice University taught by Luay Nakhleh in Fall. Since its upload, it has received 14 views. For similar materials see /class/224955/comp-572-rice-university in ComputerScienence at Rice University.
Reviews for BIOINFORMATICS NETWORK ANALYSIS
Report this Material
What is Karma?
Karma is the currency of StudySoup.
Date Created: 10/19/15
Biological Sequence Analysis A Brief Ovewiew Or COMP57I in one class meeting COMP572 Spring 2008 Outline Evolution Sequence alignment Phylogenetics Speciesgene evolution Sunday February 10 2008 Sequence Alignment uuuuuuuuuuuuuuuuuuuu 08 Life through Evolution 0 All living organisms are related to each other through evolution 0 This means any pair of organisms no matter how different have a common ancestor sometime in the past from which they evolved 0 Evolution involves 0 inheritance passing of characteristics from parent to offspring 0 variation differentiation between parent and offspring 0 selection favoring some organisms over others Sunday February 10 2008 Sequence Variations Due to Mutations 0 Mutations and selection over millions of years can result in considerable divergence between presentday sequences derived from the same ancestral sequence 0 The base pair composition of the sequences can change due to point mutation substitutions and the sequence lengths can vary due to insertionsdeletions Sunday February 10 2008 DNA Sequence Evolution uuuuuuuuuuuuuuuuuuuu 08 DNA Sequence Evolution ACCTG UZgt mmncmsnm mltoHo gth0m gthm gtnm DNA Sequence Evolution ACCTG Deletion A I Substitution ACGTT ACTG DNA Sequence Evolution Substitution ACCTG Deetion ACGTT ACTG ACT ACGTT ACTG ACTGG DNA Sequence Evolution Substitution ACCTG Deetion I Insertion ACGTT ACTG Deletion ACT ACGTT ACTG ACTGG Principles of Sequence Alignment l Alignment is the task of locating equivalent regions of two or more sequences to maximize their similarity Mismatches gap indels insertionsdeletions Sunday February 10 2008 7 Principles of Sequence Alignment 2 Alignment can reveal homology between sequences 2 l Similarity is descriptive term that tells about the degree of match between the two sequences 22 Homology tells that the two sequences evolved from a common ancestral sequence 23 Sequence similarity does not not always imply a common function 24 Conserved function does not always imply similarity at the sequence level 25 Convergent evolution sequences are highly similar but are not homologous Sunday February 10 2008 Principles of Sequence Alignment 3 It is easier to detect homology when comparing protein sequences than when comparing nucleic acid sequences 3The probability of a match by chance is much higher in DNA sequences than in protein sequences 32The genetic code is redundant identical amino acids can be encoded by different codons 33The complex 3D structure of a protein and hence its function is determined by the amino acid sequence Hence conserving function leads to fewer changes in the amino acids than in the nucleotide sequence Sunday February 10 2008 9 Scoring Alignments 0 Given two sequences the number of possible alignments is exponential 0 Finding the correct alignment involves defining a scoring scheme and nding an alignment with optimal score Sunday February 10 2008 10 Different Scoring Schemes Lead to Different Alignments A Bovine PI3Kinase piiue CAMPdependent protein kinase Bovine PIezxinese p1lDa CAMPdependent protein kinase Bovine Pyeaxinase p1103 CAMPdependent protein kinase Bovine BI sxinase p1ioa eAMBeoepenoent protein kinase LNWENPDIMSELLFQNNEIIFKNGDDLRQDMLTLQIIRIMENINQNQGLDLRMLPYGCLSIGDCVGLIEVVRNSHTIMQIGCKGELKGAL WENPAQNTAHLDQFERIKTLGTGSFGRVMLVKHMETGNHYAMKILDKQKVVKLKQIEHTLNEKRILQAVNFPFLVKLEFSFKDNSNLV Very high gap penalty results in gaps only at beginning and end and l0 sequence Identlt LIVSKGAQECTKTREFERFuEMCYKAYLAIRQHANLFINLFSMMLGSGMPELQSFDDIAVIRKTLALDKTEQEALEVFMKQMNDAHHGG F39Il QFNSHTLHQHLKDKNKGEIYDAAIDLFTRSCAGYCVATFII FHIDFGHFL TGDF MVMEYVPGGEMFSHLRRGRFSEPHARFYAAQIVLTFEYLHSLDLIVRDLKPENLLDQGGYIQVTDFEFAKRVKGRTHXLCGTPEYLAP rvu vex KDLLRNLLQVDLTKRFGNLKNGVNDIKNHKWF w39LTKMDHIFHTlKQHALN ATTv AnAi nkvnnrrA i 3 Bovine FI 3Kinase p1103 CAMPdependent protein kinese Bovine presKinase o1ine CAMPdependent protein kinase Bovine PyesKinase p1103 eAMB oependent protein kinase Bovine FI3Kinase D1703 enmpeoependent protein kinase LNUENPDIMSELLFQNNEI1FKNEDDLRQDMLTLQIIRIMENINQNGGLDLRMLFYGCL51GDCVGLIEVVRNSHTIMGIGCKGGLKGAL 7WENPAQNTAHLDQFERIKTLGTGSFGRVMLVKHMETENHVAMKILDKGKVVKLKRIEHTLNEKRILEAVNFPFLVKLEFSFKDN Very low gap penalty results in many lFNSHTLHQWLKDKNKGEIYDAAIDLrIn Hui FHIDFGHFL LeiT o i SNLYMVMEVVPGGEMFSHLRR IGRFSEFHARFVAAGIVLTFEVLHSLBLIVRDLKPENLLI39DGQGVIQVTDFGFAKRVKGRTHXLCGT more gaps and 3984 sequence ldentlt NAFI QBFL estKsAaECTKTREFERreaencWYKAVLA1RaHANLF1NLFSMMLGSGMPELnsFBMAHRKTLALBKTEQEALEVFMK PEYLAPFTYI I le n in ESHF KDLLPNIInVnITKn FGNLKN QMNDAHHGGWT v IKQHAL GVNDIKNHKMFATTDWIAIVQRKVEAPFIPKFKGPGDTSNFDDVEEEEIRVXINEKCGKEFSEF SundayFebruary102008 Scoring Schemes 0 Several substitution matrices and gap penalty models have been developed c c s s T T P P A A G G N r N V D D E E 7 Q x 7 Q H 3 1 2 2 2 2 l 1 U 0 8 H R 3 l l 2 l 2 D 2 0 1 D 5 R 6 K 3 0 l 39l 2 J 1 1 1 1 Z 395 K 2 5 M M l L L V V F F Y Y w w 11 AGNDEQH N H BLOSUM62 PAM l 20 Sunday February 10 2008 12 Pairwise vs Multiple Alignment PaIrWIse Multiple F398117P13K7like P13K68D PK3BHUMAN Pl lG Ul P13K68DHUMAN pSLbinding domain P11GHUMAN i p85 inding domain P13K92E PigHUMAN P11DHUMAN KEY helical kinase ras binding ABD C2 domain domain domain PX C2 domain Sunday February 10 2008 13 Local vs Global Alignment Local Global 10 20 3O 40 50 PI3kinase HQLGNLRLEECRI MSSAKRPLWLNHENPDIMSELLFQNNEIIFKNGDDLRQDMLT CAMP PK GNAAAAKKGXEQESVKEFLAKAKEDFLKKWENPAGNTAHLDQFERIKTLGTGSFGRVML IO 20 30 40 50 so PI3 kinase LQIIRIME NIHQNQGLDLRMLPYGCLSIGDCVGLIEVVRNSHTIMQ IQCKGGLKGAL CAMP PK VKHMETGNHYAMKILDKQKVVK LKQIEHTLNEKRILQAVNFPFLVKLEF so 90 150 160 PI3kinase QFNSHTLHQWLKDKNKGEIYDAAIDLFTRSCAGYCVATFILGIGDRHNSNIMVKD D CAMP PK SFKDNSNLYMVMEYVPGGEMFSHLRRIGRFSEPHARFYAAQIVLTFEYLHSLDLIYRlDLK 140 160 PBkinase DRHNSNIMVKDDGQLFHI cAMPPK DLKPENLLIDQQGYIQVTDFG m 200 PI3kmase GQLFHIHFLDHKKKKFGYKRERVP FVLTQDFLIVISKGAQECTKTREFE CAMP PK PENILLIDQQGYIQVTFAKRVKGRTWXLCGTPEYLAPEIILSKGYNKAVDWWALG 170 180 190 PI3 kinase RF QEMC YKAYLAIRQHANLFINLFSMMLGSGMPELQSFDDIAYIRKTLALDKTEQEA CAMP PK VLIYEMAAGYPPFFADQPIGIYEKIVSGKVRFPSHFSSDLKDLLRNLLQVDLTKR 240 280 280 290 300 310 PI3kinase LEYFMKQMNDAHHGGHTTKMDWI FHTIKQHALN cAMP PK FGNLKNGVNDIKNHKHFATTDWIAIYQRKVEAPFIPKFKGPGDTSNFDDYEEEEIRVXIN 340 Sunday February 10 2008 14 Hardness of the Problems Pairwise alignment can be solved ef ciently polynomial time for both the local Smith Waterman algorithm and global Needleman Wunsch algorithm cases Multiple alignment is NPhard and various heuristics and tools are available CLUSTALW is one of the most popular tools Efficient heuristics also enable quick searches in sequence databases BLAST and FASTA are the two most popular Sunday February 10 2008 Phylogenetics uuuuuuuuuuuuuuuuuuuu 08 Phylogenetic Tree 1 raccoon 563 on Sunday February10 2008 17 The Phylogeny Problem 0 Given a set of taxa reconstruct their phylogeny evolutionary history 0 Information sought 0 the topology shape of the tree 0 branch lengths 0 Model of evolution substitution matrix base frequencies Molecular Evolution and Its Consequences 0 Most related sequences have many positions sites that have mutated several times the need for distance correction 0 The rate of accepted mutation is usually not the same for all types of base substitution transitions transversions 0 Different codon positions have different mutation rates synonymousnonsynonymous substitutions Methods for Phylogenetic Tree Reconstruction 0 Distance based methods 0 UPGMA FitchMargoliash Neighbor Joining 0 Sequence or character based methods 0 Maximum parsimony MP maximum likelihood ML Bayesian inference Bl 1 guy P 3 r 3 B dij E 5 13 13W 36 6 888 W 1 62 V 6 AD E 0 4 3 D dry 2 88 on 6 1 0 Example of tree reconstruction using UPGMA Sunday February 10 2008 Assessing the Quality of a Reconstructed Tree 0 Simulations the truth is known 0 statistical methods that assign con dence values to branches bootstrap SpeciesGene Evolution uuuuuuuuuuuuuuuuuuuu 08 Species vs Gene Tree A human B human 1 h39 k 1 chicken C 39C en Xenopus 1 xenopus Catostomus 1 Catostomus human 2 I chicken 2 Drosophla human 3 Artemla chicken 3 Hydra Drosophila Artemla Artemia Hydra Sunday February 10 2008 24 What Causes SpeciesGene Tree Differences Gene duplicationloss Horizontal gene transfer Recombination Hybrid speciation lineage sorting In practice the reconstructed trees may simply have errors in them Gene DuplicationAn Example A time I AOL AB I I Aoc AB I I AOL AB speciesA B time ene gene duplication event 1 duplication A both genes initially identical event 1 a a genes undergo functional divergence l to functions on and B Speclatlon I I 0 B speciation gene duplication I I event 2 Ba BB gene duplication event 2 A both genes initially identical I I I Ba Ba BB Bot genes undergo functional AOL Ba By AB BB l divergence to functions 01 and y I I I Bot By BB species B Sunday February 10 2008 Gene DuplicationLosszAn Example tlme A B 0 Con Do BIS CB DE AU D0 313 CB Homology Revisited The term was introduced by Richard Owen in I843 to designate the same organ in different animals under every variety of form and function The distinction between orthology and paralogy two forms of homology was introduced byWalter Fitch in I970 Homology General De nition 0 Homology designates a relationship of common descent between entities 0 Two genes are either homologous or not 0 It doesn t make sense to say two genes are 43 homologous Sunday February 10 2008 29 Orthology vs Paralogy 0 Two genes are orthologs if they originated from a single ancestral gene in the most recent common ancestor of their respective genomes 0 Two genes are paralogs if they are related by duplication Orthology vs Paralogy W Dp2 F lt7 A1 AB1 B1 B2 C1 C2 C3 Source FitchTrends in Genetics 65227232000 Sunday February 102000 31 The Terms Have Been Further Re ned Homologs Genes Shari Oxthoiogs 39 39 39 39 k 39 39 L39 I gene in L 39 ofnhe Compared genomes t t s i differential lineagespeome gene loss Xenologs P omoiogons genes acquired m XGD by one 0139 both ofrhe compared Coroi39thologs 1 1 Twoor collectively w or gt 39 L lineageduetoaquot gt quot I r s t I respecu39ve specintion event G Parnlogs enes related by duplication r L L r L J J 39 r in t speciation event no absolute meaning Ourpamlogs 39 K 39 viven allopuralogs I A 39 39 nnsn nlnr meaning Pseudopamlogs quot L L 39 39 L39 J v r 39 rri a inherian and HCT Source Koonin Annu Rev Genet 39 309338 2005 unue r mum i 2008 Species Tree Reconstruction 0 Only orthologous genes should be used to reconstruct species phylogenetic trees Acknowledgments Materials in this lecture are mostly based on the boollt Understanding Bioinformatics by M Zvelebil and J Baum l st Edition Garland Science 2007 Please see Koonin Annu Rev Genet 39 309338 2005 for an excellent review of homology and its subclasses Robustness of Biological Networks COMP572 Spring 2008 The Robustness Principle Wednesday February 13 2008 2 The Robustness Principle 0 The computations performed by a biological circuit depend on the biochemical parameters of its components such as the concentration of the proteins that make up the circuit 0 These parameters often vary signi cantly from cell to cell due to stochastic effects even if the cells are genetically identical 0 How can biological systems function despite these variations Wednesday February 13 2008 The Robustness Principle 0 Biological circuits have robust designs such that their essential function is nearly independent of biochemical parameters that tend to vary from cell to cell 0 Robustness to parameter variations is never absolute it is a relative measure 0 We talk of the robustness of a property with respect to certain parameters Wednesday February 13 2008 0 Robustness is often misunderstood to mean staying unchanged regardless of stimuli or mutations so that the structure and components of the system and therefore the mode of operation is unaffected O In fact robustness aows changes in the structure and components of the system owing to perturbations but speci c functions are maintained Wednesday February 13 2008 5 Robust adaptation return to a periodic attractor Transition to a new a rector Stochastic process th Robust adaptation i return to a point attractor Figure 1 Robust reactions of the system to stay or to change The state of a system can be shown as a point in the state space In this case the state space is simpli ed into two dimensions Perturbations forcefully move the point representing the system s state The state of the system might return to its original attractor by adapting to perturbations often using a negative feedback loo D M 39 r 1quot 39 attractions in the state space within which the state of the system moves back to that attractor ifthe bound is exceeded the system might move into an unstable region or move to other attractors Positive feedback can either move the system39s state away from the current attractor or push the system towardsanewstate quot J b quot quot 39 a quot 39 facilitate transition between t 39 quot quot t b Often L I 39 L attractors a hen in A phage fate decision but maintenance of a new state has to be robust against minor perturbations nmnl Wednesday February 13 2008 The Robustness Principle 0 Kacser and Burns I973 experimentally demonstrated the robustness of metabolic fluxes with respect to variations of enzyme levels in yeast 0 We now review a study that demonstrated the design principle of robustness using the protein signaling network that controls bacterial chemotaxis Alon et al Nature I999 Wednesday February 13 2008 A Case Study Bacterial Chemotaxis Wednesday February 13 2008 8 Bacterial Chemotaxis 0 Bacteria have the ability to sense and move along gradients of speci c chemicals a process called bacterial chemotaxis 0 Chemicals that attract bacteria are called attractants and those that drive bacteria away are called repellents 0 E coli can sense a variety of attractants such as sugars and the amino acids serine and aspartate and repellents such as metal ions and the amino acid leucine Wednesday February 13 2008 9 Bacterial Chemotaxis O Bacteria can detect concentration gradients as small as a change of one molecule per cell volume per micron and function in background concentrations spanning over ve orders of magnitudeAll this is done while being buffeted by Brownian noise such that if the cell tries to swim straight for IO sec its orientation is randomized by 90 on average 0 How does E coli manage to move up gradients of attractants despite these physical challenges Wednesday February 13 2008 10 Bacterial Chemotaxis 0 The answer was given by Howard Berg in the early l970s E coli uses temporal gradients to guide its motion 0 It uses a biasedrandomwalk strategy to sample space and convert spatial gradients to temporal ones 0 ln liquid environments E coli swims in a pattern that resembles a random walkThe motion is composed of runs in which the cell keeps a rather constant direction and tumbles in which the bacterium stops and randomly changes direction 0 The runs last about sec on average and the tumbles about O sec Wednesday February 13 2008 11 Bacterial Chemotaxis 0 To sense gradients E coli compares the current attractant concentration to the concentration in the pastWhen E coli moves up a gradient of attractant it detects a net positive change in attractant concentrationAs a result it reduces the probability of a tumble and tends to go up the gradient 0 If it detects that the concentration of a repellent increases with time the cell increases its tumbling frequency and thus tends to change direction and avoid swimming towards repellents 0 Thus chemotaxis senses the temporal derivative of the concentration of attractants and repellents Wednesday February 13 2008 12 Response and Exact Adaptation 6 Attractant 15 added 0 C 3 1 E quot5 E 05 7 E adaptation i3 0 i i i O 10 15 20 25 Time min Bacterial chemotaxis shows exact adaptationzthe tumbling frequency in the presence of attractant returns to the same level as before attractant was added In other words the steadystate tumbling frequency is independent of attractant levels The Chemotaxis Signal Transduction Network in put attractants repellents l o E E Output tumbling frequency information about the chemical environment is transduced into the cells by chemoreceptors such as the aspanate receptor Tar which span the membrane The chemoreceptors form complexes inside the cells with the kinase CheA A and CheW W CheA phosphorylates itself and then transfers phosphoryl P groups to CheY Y a di usible messenger protein The phosphorylated form of CheY interacts with the agellar motors to induce tumbles The rate of CheY dephosphorylation is greatly enhanced by CheZ Z Binding of attractahts to the receptors decreases the rate of CheY phosphorylation and tumbling is reduced Adaptation is provided by chan es in the level of methylation of the chemoreceptors methylation increases the rate of CheY phosphorylation A pair of enzymes CheR R and CheB B add and remove methyl m groups To adapt to an attractant methylation of the receptors must rise to overcome the suppression of receptor activity caused by the attractant binding CheA enhances the demethylating activity of CheB by phosphorylating CheB on its aminorterminal domainmiz Findings Adaptation precision i O I 9 539 Adaptation time min CheFt told expression Steadyrstate tumbling frequency tumbles per 5 Finnrn 9 r Oh hr min h R ofiPT Winn nin li V NKIDOL at cells and cells stimulated with 1 mM Leaspartate A precision of moonesponds to exact adaptation Cells lacking CneR RP4968 with the control vector pHSGE S triangle did not respond to attractants but showed a persistent response of about 06 tumbles per 3 to repellent 50 mM Leieuoine b Average steadyrstate quot 39 rinhr naiet anna Frame adaptation 39 p quot M 39 left scaieii Solid circle wildtype strain RP437 pHSGE Si Triangle tumbling frequency of RP4968 pHSGS7SV Lines are guides to the eye Relative CheR l t mea d u immuu u t 39Wiidrtype39 a de ned as the induction level where the adaptation time was equal to that of RP437 pHSGS75 immunobiots also showed tnattne level of other chemotaxis proteins CheB and CneY did not vary measurably with CheR expression Errors in relative CneR level are estimated to be under 30 Mean and standard deviation of triplicate experiments are shown Wednesday February 13 2008 Findings Robust property igt Adaptation precision ii I 539 Adaptation time min CheFt told expression Steadyrstate tumbling frequency tumbles per 5 Finnrn 9 r tin hr min h R oriprr inmmi nin li V NKIDOL ai ceiis and cells stimulated with 1 mM Leaspartate A precision of moorresoonds to exact adaptation Cells lacking CneR RP496E with the control vector pHSGE S triangle did not respond to attractants but showed a persistent response of about 06 tumbles per 3 to repellent 50 mM Leieucine b Average Steadyrstaie quot 39 rinht naiei anna Frame adaptation 39 p quot M 39 len scaieii Solid circle wildtype strain RP437 pHSGE Si Triangle tumbling frequency of RP4968 pHSGS7SV Lines are guides to the eye Relative CheR l t mea d u immou ul t 39Wiidrtype39 a de ned as the induction level where the adaptation time was equal to that of RP437 pHSGS75 immunobiots also showed mattne level of other ohernotaxis proteins CheB and CneY did not vary measurably with CheR expression Errors in relative CneR level are estimated to be under 30 Mean and standard deviation of triplicate experiments are shown Wednesday February 13 2008 Findings Robust property a 5 139 9 1 Q L T E r t A Finnquot 9 t g f 7 i m hr min r n L I U E 05 ofiprr inrinmi nin immuneta 390 n lt 1 2 3 4 5 6 50 cells and cells stimulated with 1 mM Leaspartate A precision of t0corresoonds to b exact adaptation Cells lacking CneR RP496E with the control vector pHSGE S Adaptation time min 3 1 2 3 4 5 CheR told expression 08 06 04 02 we 0 6 50 Steadyrstate tumbling frequency tumbles per 5 triangle did not respond to attractants but showed a persistent response of about 06 tumbles per 3 to repellent 50 rnM Leieucine b Avetage Steadyrstaie quot 39 rinht pallet anna mane adaptation 39 quot 39 t M 39 heft scale Solid circle wildtype strain RP437 pHSGE Sl Triangle tumbling frequency of RP4968 pHSGS7SV Lines are guides to the eye Relative CheR mea ed u immuu m t 39Wiidr a de ned as the induction level where the adaptation time was equal to that of RP437 pHSGS75 immunobiots also showed thattne level of other chemotaxis proteins CheB and CheY did not vary measurably with CheR expression Errors in relative CheR level are estimated to be under 30 Mean and standard deviation of triplicate experiments are shown Finetuned property Wednesday February is 2008 Robustness Mechanisms Wednesday February 13 2008 16 Robustness Mechanisms O Mechanisms that ensure the robustness of a system 0 system control 0 failsafe mechanisms 0 modularity O decouphng Wednesday February 13 2008 System Control 0 System control consists of negative and positive feedback to attain a robust dynamic response observed in a wide range of regulatory networks 0 Negative feedback is the principal mode of control that enables robust responseadaptation to perturbations 0 Positive feedback contributes to robustness by amplifying the stimuli often producing bistability so that activation level of a downstream pathway can be clearly distinguished from nonstimulated states and so that these states can be maintained Wednesday February 13 2008 18 Failsafe Mechanisms 0 Robustness can be enhanced if there are multiple means to achieve a speci c function because failure of one of them can be rescued by others 0 A failsafe mechanism is usually attained by having multiple heterogeneous components and modules with overlapping functions 0 Although redundancy through duplicated genes is frequently observed there is no reported case of duplicated circuits similar networks are the result of convergent evolution Wednesday February 13 2008 Modularity 0 Modularity is an effective mechanism for containing perturbations and damage locally to minimize the effects on the whole system 0 Modules are widely observed in various organisms functioning as possible biological design principles 0 There are physical functional spatial and temporal modules Wednesday February 13 2008 20 Decoupling Decoupling isolates lowlevel variation from highlevel functionalities For example Hsp90 fixes proteins that are misfolded and decouples genetic variations from the phenotype using the same mechanism therefore providing a genetic buffer against mutations These genetic buffers provide robustness to cope with mutation while maintaining a degree of genetic diversity Feedback controls sometimes compensate for changes in rate constants of interactions within the network and changes in the initial state of the network Wednesday February 13 2008 21 Decoupling 0 Another example of decoupling might take place between information encoding and conversion of stimulus dosage into pulses of protein activations 0 When the DNA is damaged p53MDM2feedback loops generate oscillatory behaviorthis behavior was recently found to be a potential converter of graded stimuli degree of DNA damage to digital pulses with a peak of p53 activation so that it is only the number of pulses that matter after the conversion not the subtle changes in concentration levels Wednesday February 13 2008 22 Perturbations 7 Flight control instructions Sgggggband Flight path Feedback Flight Flight CONTOI Figure 2 l Explaining robustness the aeroplane example The concept of robustness computer is best described using c 4 39 llll n i la pa iiger number l number 2 number 3 aei39o lanes have an automatic flight control system AFOS that maintains a ight path direction altitude and velocity of flight against perturbations in atmospheric conditions This can be accomplished by a feedback control in which deviations from the de ned flight path i i W i iie ieo AFC that quot h h maintenance of the flight path by controlling the aeroplanes flightrcontrol surfaces rudder elevator flaps aileron etc and the propulsion system engines AFCS is genera composed of three modules With the same functions thereby creating redundancy although each d 39g d 39 to a old a r mrnnn mode failure Three computers are mace that are modular so that failure in one module does not affect the functions of other parts of the system This type of mechanism is implemented using digital technologies that decouple low level voltages from digital signal ONOFF of pulses thereby preventing noise from influencing its functions Although this is a simplified explanation ofthe actual system the concept applies to details of the basic system as much as it does to the l c a c d i h39 fh 39 39l t39 h i oi Aunt my robustness is the basic organizational principle of evolving dynamic systems be it through evolution competition a market niche or socie y Wednesday February 13 2008 23 The Origin of Robustness Wednesday February 13 2008 24 O Robustness is ubiquitous 0 So what leads to robustness 0 One theory is that robustness is an inherent property of evolving complex dynamic systems various mechanisms incurring robustness of organisms actually facilitate evolution and evolution favors robust traits Wednesday February 13 2008 25 system contro gt g utiui ru ces m g gt Hmsponsss and t 2g 39 ducts 8 m 8 15 g Hgwconaerve I g a core processes g m E quot 5 gt E g a S Conserved versatue nm acss weak Wage gurem H mmh h m m HmnaAH n n Th n m rm n H mman H h m m N mnn um mm m we nH m m r m n nv y m pans ofthe system Robustness Tradeoffs Wednesday February 13 2008 27 O The Wright brothers airplane was not robust against atmospheric perturbations unlike modern airplanes 0 However modern airplanes are extremely fragile against unusual or rare perturbations such as total power failure 0 Although these types of tradeoffs might seem simple they are very important to understand and appreciate because diseases are often manifestations of fragility Wednesday February 13 2008 28 Robustness and Disease 0 Robust systems are most vulnerable when the system s fragility is exposed 0 Diabetes mellitus can be thought of as an exposed fragility of the system that has acquired robustness against nearstarvation a high energydemand lifestyle and high risk of infection but it is unusually perturbed by overnutrition and a low energy demand lifestyle Wednesday February 13 2008 Robustness and Disease 0 The system is relatively tolerant of the removal of some components or cells because of available alternative mechanisms and the robustness of the architecture 0 However the system is vulnerable when components behave inappropriately but are not being removed from the system Wednesday February 13 2008 30 Robustness and Disease 0 The epidemic state might exhibit robustness against natural and therapeutic countermeasures if intrinsic mechanisms for robustness in our bodies are coopted 0 Cancer and HIV infection are maintained and even promoted through the intrinsic robustness mechanism of host systems 0 The dif culty of treating tumors is attributed to acquired robustness partly owing to the cooption of intrinsic mechanisms for robustness Wednesday February 13 2008 31 Towards aTheory of Biological Robustness Wednesday February 13 2008 32 Robustness is a system s characteristics that maintain one or more of its functions under external and internal perturbations Under this de nition robustness R of the system s with regard to function a against a set of perturbations P can be mathematically described as 313 fp wltpgt02ltpgtdp Mp is the probability for perturbation p to take place Wednesday February 13 2008 33 Dp is the evaluation function which determines if the system still maintains function under a perturbation and to what degree 0 pEACP D8 W fanMO p e P A A is a set of perturbations where the system failed to maintain its function For exampleATP production drops 20 under a certain perturbation compared with ATP production under unperturbed statethen Dp is 08 Wednesday February 13 2008 34 gt B I Dpb0 ipf0gt08 p15 p25 p35 p45 a I DP st 0 08 2 fpf0gt06 I Dp ya 0 06 2 fpf0gt04 I Dpe o 04 2 fpf0gt00 I Dpo Degree of perturbation Degree of perturbation m 13a m nun ma a 91 92 93 91 92 93 94 95 96 Features that are perturbed Features that are perturbed Figure 3 Robustness Perturbations are imposed on each feature and at different degree if applicable The figure illustrate coarse grain View of perturbation space where there are six features to be perturbed each of which is perturbed at six different degree Colors on box for each perturbation indicate how system responded to each perturbation Fled box indicate that system fail to maintain its function Dif erent blue colors show the level of degradation of the function Although the area the function is maintained is same in A and B A is considered more robust as the function is better preserved than B Wednesday February 13 2008 35 A system SI is said to be more robust than a system SZ with regard to a function a against a certain set of perturbations Y when Rily gt R3 Considering an entire perturbation space P robustness fragility tradeoff should hold thus difference of robustness between two systems shall be Mi wltpgtltD 1ltpgt D 2ltpgtgtdp R530 R530 0 P Wednesday February 13 2008 36 Measures of Degeneracy and Redundancy in Biological Networks Wednesday February 13 2008 37 O Degeneracy the ability of elements that are structurally different to perform the same function 0 It is a prominent property of many biological systems ranging from genes to neural networks to evolution itself 0 Redundancy the same function is performed by identical elements 0 Structurally different elements may produce different outputs in different contexts Wednesday February 13 2008 38 The Rest of this Lecture 0 Functional measures of the degeneracy and redundancy of a system with respect to a set of outputs using information theoretic concepts Wednesday February 13 2008 Background Let X be a discrete random variable with possible values XX2Xn The entropy ofX is de ned as HX ZP11Og2Pi i1 A fair die is rolled and X is the number it shows 6 1 1 1 HX z2292 10g2 192 6 5101b 5 10g2 5 2585 A biased die is rolled Pl I and X is the number it shows 6 HlX 10g2pli 110g2 1 l 010g2 0 0 Wednesday February 13 2008 Background 0 The joint entropy of two random variables is de ned as HX Y Z 2pxy10g2py way where pmy is the probability of a pair of outcomes 0 A die is rolledX the number shown is gt 3Y the number shown is even PX gt 3Y even PX S 373 Odd PX gt 3 Y odd PX g 3 Y even 2 1 2 1 Chli Chllo Wednesday February 13 2008 41 Background The mutual information of two discrete random variables X andY is MIX Y HX HY HX Y In the previous example MIX Y HX HY HX Y 1 1 12516 2 07484 If instead we have X the number shown is even and Y the number shown is odd MIXY HXHY HXY 2 11 1 1 Another example X a die shows an even number Y a coin shows head MIXYHXHY HXY11 20 Wednesday Februar y 13 2008 42 Example of a System FIG 1 Schematic diagram illuslrating bases for the proposed measure of degeneracy A We consider a system X composed of individual elements gray circles 11 8 that are interconnected arrows among each other with a connection matrix CONX and matrix CONXO MI of the System 0 Consider ajth subset jk of the system X composed of k units and the output sheet 0 Then MIXf 0 HXf HO HXf O Wednesday February 13 2008 Degeneracy of the System 0 We are interested in measuring the causal effects on the output of changes in the state of the subsets of the system 0 To evaluate these causal effects we consider the change in the MI between each subset Xjk and 0 when the subset is perturbed B ti if l 5 HAsubsellsltzltlbdingr bl39llwsystemxis i perturbed by injecting Int 39s at ihe top a xed amount uf variance 0 tuncormlatcd noise into each of its constilucnt units This pcrturln o e 0 nun activate numct cnnncctiunz within the system thick nrrnwsi and pnnluces ch e variance nf a minim til unit within A39 l 39 O The C 39 infm matiun under perturbation v T 1 I A t u 0 e O l 7 39 r 1 17 1 MIPXikO atlhsctsot sizes I SA squot ol the system A 0 Wednesday February 13 2008 Degeneracy of the System The Ml value obtained by the perturbation is MIPX O For simplicity we assume that the MI value is 0 before the perturbation We define the degeneracy DNXO of X with respect to O as 7 DNXO kZltMIPxlt0gt e knMIPXO 22 According to Eq 2a DNXO is high when the mutual information between the whole system kn and the output is high and at the same time the average mutual information between small subsets of the system small values of k and the output is higher than would be expected from a linear increase over increasing subset size Degeneracy of the System 0 The degeneracy can also be de ned as DNXO lZkltMIPXfX e X390gt 2b where 7 MIPXf X 7 Xf 0 MIPX O MIPX 7 Xf 0 7 MIPX 0 According to Eq 2b DNXO is high when on average the mutual information shared between any bipartition of the system and the output is high Redundancy of the System 0 The redundancy RXO ofX with respect to O is the difference between the summed mutual information upon perturbation between n subsets of size kl and O and the mutual information upon perturbation between the entire system kn and O RXO fMIPX JO e MIPXO 3 According to this de nition redundancy is high if the sum of the mutual information between each element and the output is much larger than the mutual information between the entire system and the outputThis means that each of the elements of the system contributes similar information with respect to the output Redundancy is zero if all elements of the system contribute to the output independently and the mutual information between the entire system and O is equal to the sum of the mutual information between each element of the system and O Degeneracy and Redundancy 0 The degeneracy can be expressed as the average redundancy values with respect to O for increasing subset sizes DNXO nknmocm e ltRX Ogt 4 According to Eq4 DNXO is high when the redundancy of the system ie for kn with respect to the output is high and at the same time the average redundancy for small subsets small values of k is lower than would be expected from a linear increase over increasing subset size Graphical Representation of the Measures A B c ltMrxk0gt ltMIF Xl X X Ogt 1 n FIG 2 Graphical representation of different expressions for de generacy A Degeneracy expressed in terms of the average mutual information between subsets ofX and 0 under perturbation see Eq 2a B Degeneracy expressed in terms of the average mutual information shared between bipartitions ofX and 0 see Eq 2b C Degeneracy expressed in terms of the average redundancy see Eq 4 Illustration A Connectivity Correlation ltMIquotX Ogt ltMIPX X X Ogt ltRX0gt IT 5 mi iDv cl 5 0 0 2 4 6 5 Z 4 6 B 2 4 6 8 Sunset Size Subsel Size Subsel Size l W l 5 r 39 lllL 04 a 0 Z 4 6 B Z 4 6 8 2 4 6 E Subset Size Suhsel Size Subsel Size 65 1 9 15 I5 ll H l l l Wu 8 2 l 2 4 E a i A a Subsex Size Suhsel Size Subsel Size 39 39 39 39 n iili Ill arcncatiy39 gt r Barium icdundnnl case A Graphs oi connccllons among nnits of llit system and among syslcm units and thc mil connection strength not all connections shown B Cuirelalinn matriccs displaying corrotations in sic ti m r n FIG 4 39 39 quot I 39k39 IInil I ll a heel nffntn uni s quotquot me a systcm that lacks all intrinsic conncotivity Top independcnl case a system that is cliaiacleiizcd by ion mo L I dnics or we strongly pul Arron widlli indlcnlcs x intrinsic to o as well as X an o ma information hctwcon snhscts of x and 0 over an snhsct size tscc Eq 2a D Distribution of thc nycrn C k gc mntnni information shatcti hctwcen hipartitrons an mid 0 sec Eq 2b E Distrihntion of he arctagc iednndzincy oycr aii subsel sizes soc Eq 4 Degenmacy is mdicatcd in GE as a Slladctl men compare Fig 2 ExpregtSIOIIS oi the otdinalcs ol giaplis t39n CrE cm at the 10p of each column Wednesday February 13 2008 Observations 0 In a system constituted of completely independent elements each element accounts for a different portion of the entropy of the output units and thus can be said to have an independent function 0 In a fully redundant system each element shares the same portion of the output entropyAll elements thus can be said to perform the same function 0 In a degenerate system a large number of elements of the system jointly contribute to portions of the entropy of the output unitsThe system is thus functionally redundant and faulttolerant with respect to many output functions Wednesday February 13 2008 52 Degeneracy amp Redundancy 0 To be degenerate a system must have a certain degree of functional redundancy 0 However a completely redundant system will not be degenerate because the functional differences between distinct elements and thus their ability to contribute independently to a set of outputs will be lost Wednesday February 13 2008 53 Acknowledgments 0 Materials in this lecture are mostly based on 0 Biological Robustness by H Kitano 0 Towards aTheory of Biological Robustness by H Kitano 0 Measures of Degeneracy and Redundancy in Biological Networks by GTononi O Sporns and GM Edelman Wednesday February 13 2008 Elucidating Properties of Networks Based on their Stoichiometric Matrices COMP572 Spring 2008 Nhnday ApH lt4 2mg Biological Components Have a Finite TurnoverTime Most metabolites turn over within a minute in a cell mRNA molecules typically have 2hour halflives in human cells The renewal rate of skin is on the order of 5 days to a couple of weeks Therefore most of the cells that are contained in an individual today were not there a few years ago However we consider the individual to be the same Monday April 14 2008 Biological Components Have a Finite TurnoverTime 0 Components come and go 0 The interconnections between cells and cellular components define the essence of a living process Components vs Systems 0 In systems biology it is not so much the components themselves and their state that matters but it is the state of the whole system that counts Monday April 14 2008 Links and Functional States of a System Links between molecular components are basically given by chemical reactions or associations between chemical components These links are therefore characterized and constrained by basic chemical rules Multiple links between components form a network and the network can have functional states Functional states of networks are constrained by various factors that are physiochemical environmental and biological in nature Monday April 14 2008 Links and Functional States of a System 0 The number of possible functional states of networks typically grows much faster than the number of components in the network 0 The number of candidate functional states of a biological network far exceeds the number of biologically useful states to an organism 0 Cells must select useful functional states by elaborate regulatory mechanisms Monday April 14 2008 6 So Far 0 Understanding general properties of networks degree distributions modularity motifs conservation etc Monday April 14 2008 7 This Lecture 0 This lecture and last one on Petri nets are aimed at understanding how the cell uses networks regulatory and metabolic to change its behavior Monday April 14 2008 8 Elucidating Metabolic Pathways O Metabolism is broadly defined as the complex physical and chemical processes involved in the maintenance of life It is comprised of a vast repertoire of enzymatic reactions and transport processes used to convert thousands of organic compounds into the various molecules necessary to support cellular life Metabolic objectives are achieved through a sophisticated control scheme that efficiently distributes and processes metabolic resources throughout the cell s metabolic network Monday April 14 2008 Elucidating Metabolic Pathways O The obvious functional unit in metabolic networks is the actual enzyme or gene product executing a particular chemical reaction or facilitating a transport process The cell controls its metabolic pathways in a switchboardlike fashion directing the distribution and processing of metabolites throughout its extensive map of pathways To understand the regulatory logic implemented by the cell to control the network it is imperative to elucidate the cell s metabolic pathways Monday April 14 2008 Elucidating Metabolic Pathways O In this lecture we will cover theoretical techniques based on convex analysis that have been used to identify metabolic pathways and analyze their properties 0 The techniques have also been applied to analysis of regulatory networks signal transduction networks Monday April 14 2008 11 Stoichiometry O The set of chemical reactions that comprise a network can be represented as a set of chemical equations 0 Embedded in these chemical equations is information about reaction stoichiometry the quantitative relationships of the reaction s reactants and products 0 Stoichiometry is invariant between organisms for the same reactions and does not change with pressuretemperature or other conditions 0 All this stoichiometric information can be represented in a matrix form the stoichiometric matrix denoted by S Monday April 14 2008 12 The Stoichiometric Matrix 0 Mathematically the stoichiometric matrix S is a linear transformation of the flux lt vector V VI V2Vn to a vector of derivatives of the concentration vector x X X2Xm as dX S dt V The dynamic mass balances equation gtkFlux the production or consumption of mass per unit area per unit time Monday April 14 2008 13 The Stoichiometric MaCtDSix ylt5 9 Five meia iruitizi ABCDE V1 Four internal i39 aulik L TW 391 0 0 0 0 0 V2 dAdt two ofwhich are 2 3911 11 c g 2 V3 y t 39 39 39 reveiZTeriaTrf glyi g SIX 0 0 0 1 1 0 V4 D 739t o o o o o 1 V5 dEdt V6 S v dxdt anday An lt4 2mg The Stoichiometric Matrix v v 33969 d 01 16 Sikvk 1C 7 W 0m 7 711 Fluxes that form C Fluxes that degrade C The Fundamental Subspaces of a Matrix 0 Each matrixA de nes four fundamental subspaces O The column space the set of all possible linear combinations of the columns ofA O The row space the set of all possible linear combinations of the rows ofA O The null space the set of all vectors v for which Av0 O The left null space the null space ofAT Monday April 14 2008 16 The Column and Left Null Spaces of the Stoichiometric Matrix Writing the dynamic mass balance equation as E Slvl82v2 398n 0n where s are the reaction vectors that form the columns of S it is clear that dxdt is in the column space of S The reaction vectors are structural features of the network and are fixed The fluxes V are scalar quantities and represent the flux through reaction i The fluxes are variables The vectors in the left null space are orthogonal to the column space these vectors represent a mass conservation Monday April 14 2008 17 The Row and Null Spaces of the Stoichiometric Matrix 0 The flux vector can be decomposed into a dynamic component and a steady state component V den V88 0 The steady state component satis es SVSS O and vss is thus in the null space of S O The dynamic component of the flux vector den is orthogonal to the null space and consequently it is in the row space of S Monday April 14 2008 18 The Null Space of S Monday Apri 08 The right null space of S is de ned by SVSS 0 Thus all the steadystate flux distributions V55 are found in the null space The null space is spanned by a set of basis vectors that form the columns of matrix R that satisfies SRO A set of linear basis vectors is not unique but once the set is chosenthe weights w for a particular Vss are unique Monday April 14 2008 20 Example 4 V 0000 1000 anday An lt4 2mg I oooo The set of linear equations can be solved using V4 and V6 as free variables to give V4 v4v6 W1f 11 W2I 2 r and r2 form a basis Monday April 14 2008 v1 v4 1 0 v2 v4 V6 1 1 v vv 3 4 a 1 1 V V v4 1 v6 0 w1r1w2r2 4 4 0 1 v5 v6 0 1 V6 V6 r1 r2 For any numerical values of v4 and v6 a flux vector will be computed that lies in the null space Monday Aer 14 2008 Example Any steadystate flux distribution is a unique linear combination of the two basis vectors For example 2 1 o 1 1 1 v 2 w1r1w2r2 2 1 1 J 2r11r2 1 0 1 1 o 1 Monday Aprii14 2008 24 Example 0 1 w1r1w2r2 2 1 393 2r11r2 1 1 This set of basis vectors although mathematically valid is chemically unsatisfactoryThe reason is that the second basis vector r2 represents fluxes through irreversible elementary reactions V2 and V3 in the reverse direction and it thus represents a chemically unrealistic event The problem with the acceptability of this basis stems from the fact that the flux through an elementary reaction can only be positive ievi20A negative coef cient in the corresponding row in the basis vector that multiplies the flux is thus undesirable Monday April 14 2008 25 Example We can combine the basis vectors to eliminate all negative elements in themThis combination is achieved by transforming the set of basis vectors b 10 11 1 10 11 1 1o r1 r2 1 0 01 1 1 P1sP2 01 01 01 01 In this new basis pr whereas the p2rr2 P1 13 N Monday April 14 2008 Linear vs Convex Bases O The introduction of nonnegative basis vectors leads to convex analysis 0 Convex analysis is based on equalities in this case Sv0 and inequalities in this case OSViSVimax 0 It leads to the definition of a set of nonnegative generating vectors Monday April 14 2008 27 b c 593999 Dg ivgrk 4 9 Creation of Reactions stoichiometrlc 0 g Pathway O matrvx 0 1 to analysis 1 O 1 a g lt OO O 0 0 5 m 4 2 T o m x O x L 4 m r 7 39 M can characterize r l39m 4 h r x y a b These A 39 II the 39 39 quot39 A I Wilhihe matrix I m m L m 28 Monday Aer 14 2008 From Reactions To Pathways To Networks and Back to Pathways a 4gt b Evmution 0T networkbased c r 7 7 7 7 7 7 7 7 7 7 7 7 7 7 pathways 1 gt gtA gt 3 System boundary Pathways are concepts but networks are reality Monday ApH lt4 2mg 29 Extreme Pathways Biochemically meaningful steadystate flux solutions can be represented by a nonnegative linear combination of convex basis vectors as Vss Zaipi where 0 S 0 S aimax The vectors pi are a unique convex generating set but on may not be unique EXtreme Pathways for a given VSS These vectors correspond to the edges of a cone They also correspond to pathways when represented on a flux map and are called extreme pathways since they lie at the edges of the bounded null space in its conical representation Monday Apil lt4 2mg an ys em boundary v v v v v vm o a 7 v avg4 n v5 L12 439 02 03 04 o5 as o2 03 V 1 22 0 awo o n 0 10 47110 0 u V5 8 o o141o o o a V3 0 0 IO n o 4 a v4 0 1 1 0 a 0 0 0 71 Dyp 633 Ds 004140o o n no P3 3 t V3 7gt97 i 9 a 9 4 anday An lt4 2mg Extreme Pathways 0 Every point within the cone C can be written as a nonnegative linear combination of the extreme pathways as k C Viv Z wipi w 2 0 Vi 1 Monday Apii lt4 2mg Putting It All Together Convex Analysis of Metabolic Networks 0 A cellular metabolic reaction network is a collection of enzymatic reactions and transport processes that serVe gt I to replenish and drain the relative amounts of certain l metabolites A system boundary can be System boundary drawn around all these types of physically occurring reactions which constitute internal fluxes operating inside the network Monday Apll mi zuua as Putting It All Together Convex Analysis of Metabolic Networks 0 The system is closed to the passage of certain metabolites while others are allowed to enter andor exit they system based on external sources andor sinks which are operating on the network as a whole 0 The existence of an external sourcesink on a metabolite necessitates the introduction of an exchange flux which serves to allow a metabolite to enter or exit the theoretical system boundary Monday April 14 2008 34 Putting It All Together Convex Analysis of Metabolic Networks 0 All internal fluxes are denoted by vi for iElnwhere m is the number of internal fluxes 0 All exchange fluxes are denoted by bi for iElnEwhere nE is the total number of exchange fluxes O Thermodynamic information can be used to determine if a chemical reaction can proceed in the forward and reverse directions or it is irreversible thus physically constraining the direction of the reaction Monday April 14 2008 35 Putting It All Together Convex Analysis of Metabolic Networks 0 All internal reactions that are considered to be capable of operating in a reversible fashion are considered for mathematical purposes only as two fluxes occurring in opposite directionstherefore constraining all internal fluxes to be nonnegative 0 There can only be one exchange flux per metabolite whose activity represents the net production and consumption of the metabolite by the system 0 Thus he can never exceed the number of metabolites in the system Monday April 14 2008 36 Putting It All Together Convex Analysis of Metabolic Networks 0 The activity of these exchange fluxes is considered to be positive if the metabolite is exiting the system and negative if the metabolite is entering the system or being consumed by the system 0 For all metabolites in which a source or sink may be present the exchange flux can operate in a bidirectional manner and is therefore unconstrained Monday April 14 2008 37 G b 1 4 7 5 X 6 9 System boundary 4 anday An lt4 2mg 33 A simple yet informative analysis of a metabolic system may involve studying the systems structural characteristics or invariant properties those depending neither on the state of the environment nor on the internal state of the system but only on its structure The stoichiometry of a biochemical reaction network is the primary invariant property that describes the architecture and topology of the network Stoichiometry refers to the molar ratios in which substrates are converted into products in a chemical reaction egglucokinase converts one mole of glucose and one mole ofATP into one mole of glucose6phosphate and one mole of ADP Sglucoseglucokinase I These ratios remain constant under changing reaction conditions which may serve to alter the kinetic parameters and rate of reaction as a function of time Monday April 14 2008 39 Dynamic Mass Balance S Stoichiometric matrix vvlvzvn flux vector xxlxzxm vector of derivatives of the concentration vector dX S dt V Monday April 14 2008 40 Steadystate Analysis 0 The desired pathway structure should be an invariant property of the network along with stoichiometry O This can be achieved by imposing a steadystate condition S39v0 Monday April 14 2008 41 Constraints 0 All internal fluxes must be nonnegative ViZO Vi Monday April 14 2008 42 Constraints For external fluxes we have OSbSB If only a source input exists only 0 is set to negative in nity and B is set to zero If only a sink output exists on the metabolite O is set to zero and B is set to positive infinity If both a source and a sink are present on the metabolite then the exchange flux is bidirectional with 0 set to negative infinity and B set to positive infinity leaving the exchange flux unconstrained Monday April 14 2008 43 Example 1 ltTQ gtL41gt System boundary Mas minute consumn s 1 4 1 0 U o o 1 7 7 1 71 o a 0 1 71 0 0 l 0 0 0 l annal ux consuainls 0 U a K 20 6 I D U U o a 0 L o in 1 U o b o 0 0 7 U 9 bx b anday An lt4 2mg m r1 m In p I m I m Flux dislnbmiun 1 I U 0 a 0 U h v42010174211 7 1 0 7 39 L Subsci pathwnyx39 an o o o 1 o 0 1 13 12 m p gt 0 o 1 0 o I 0 a P 0 n 0 I 1 D 5 Null space dsmensmns m 5m 0 1 u o x 0 pb MSW nrrmel 0 0 0 0 0 5 Unique decomposmun m v 1 1 l i o 0 0 b v39szluOUO o 0 x 1 4 0 n 7 m a l u a 1 O 0 a v4plt 112 HU39P anday An lt4 2mg Topological Analysis of massbalanced Regulatory Networks Using Extreme Pathways Monday Apri 08 0 Various modeling approaches have been successfully used to investigate particular features of smallscale signaling networks However largescale analyses of signaling networks have been lacking due in part to l a paucity of values for kinetic parameters 2 concerns regarding the accuracy of existing values for kinetic data 3 strong computational demands of kinetic analyses and 4 limited scalability from small signaling modules using kinetic models Monday April 14 2008 47 O Obtaining the extreme pathways of a massbalanced signaling networks allows for analyses focused and based solely on the structure topology or connectivity of a signaling network Monday April 14 2008 48 Categories of Signal Transduction Events Singlelnput SingleeOutput Multipleelnput SingleeOurput classical case signal concatenation A B Singleelnput MultipleeOnrpm Multipleelnput MultipleeOmput signal pleiotropy complex signaling event C D Classification of signal transduction input output relationships The classical case of a transduced signal relates a single input to a single output A Some outputs require the concatenation of multiple inputs B Other signaling interactions occur in which the transduction of a single input generates multiple outputs a type of signaling pleiotropy C Complex signaling events arise as multiple inputs trigger interacting signaling cascades that result in multiple outputs D Monday Apm 14 2008 49 W iduxz H mm 5212 Prototypic Signaling Network ndLilB O QM O huenmllosyslem O lupulmxyswm O empmnynm um Monday Apru 14 2008 mm pr 6 MA MC 1 s o O quotW O a 1mm i D mmk E 1mm WWW gt 1 famZWW L r gt g sn39pwzm lmwwzws w 9 GP 6 i v 1 y 1 J 1 regulmmml anday An lt4 2mg Reaction Listing Mme SITEpWW 7 SVVpVVZVV fornuWZWJ fox mWW2 formzww W3 3 lcquuuon Chem LikrvLR LIVmks rADPiipiLLm gtTp5 L342 gtL R3 SiATPisl rADPislpiLlRLm 7 3 7734 TLpiwl gtT3W3p L442 gtLR 57R 2R2 maximus PSpL3R3m LIVmus Anri LVLR Viv 7w rszrnrwz pVp prwzrwz WVLerLprwrs w1 p 7 why wz WLpp up 7 My 7 W34 WN NZVV341pp 2 W47 7 w3p aww pm System Inputs and Outputs Cumpn um I Inp utf Gr nu p Output L Input Sigm tling IltptttS L3 Input L3 Input L4 Input L5 Input R Input Cnmpnncnts nt Signaling nctwnrk R3 Input R3 Input W Input WZ Input W3 Input ATP Input Interaction with energy nu I ah u unit A DP Output Wp Output Signaling outputs Wlp Output W 3p Outp ut INTJ H lpp Output I W 3 IN lppp O u tp ut 3WW3ppp Output LRJII Output Dugrndtttion UI itlt wtit C rcccptt pr ligund complex LERLLin Output L3R3in Output Monday April 14 2008 52 O Signaling reactions like those just described are subject to mass balance and thermodynamic constraints and consequently can be analyzed using networkbased pathways including extreme currents elementary modes and extreme pathways 0 We focus here on extreme pathways and their use in characterizing topological properties of the signaling network Monday April 14 2008 53 0 There are a total of 2 l extreme pathways 0 These extreme pathways can be studies for l feasibility of inputoutput relationships 2 quantitative analysis of crosstalk 3 pathway redundancy 4 participation of reactions in the extreme pathways 5 correlated reaction sets Monday April 14 2008 54 Feasibility of InputOutput Relationships An assessment of the feasibility of inputoutput relationships can be performed with extreme pathway analysis because all possible routes through a network can be described by nonnegative combinations of the extreme pathways A feasible inputoutput relationship signi es that with the available signaling inputs there exists a valid combination of the extreme pathways that describes the given signaling output Analysis is represented as an inputoutput feasibility matrix Monday April 14 2008 55 Outputs anday An lt4 2mg Crosstalk Analysis 0 Extreme pathway analysis can be used to quantitatively analyze the interconnection of multiple inputs and multiple outputs of signaling pathways which has been called crosstalk O Herein crosstalk is the nonnegative linear combination of extreme pathways of a signaling network 0 The pairwise combination of extreme pathways is the simplest form of crosstalk Monday April 14 2008 57 Crosstalk Analysis 0 As such crosstalk can be classi ed into nine categories based on extreme pathways mpm imqu Wm WW inpmx mam m o A mm D r 1 a m GgtEd dzl mpmsdnjmmoutpms mpms mum anpm mm mpm culprit 0 6 3 o o v J v E a l E I39D39l fl l l mpmx mm mm output My WW 7 o o o o by J dsiomr Inputs redundam mums mm overlay mm mm mu Mum C F D o These classi cations are topological and thus do not account for changes in the activity level of a reaction Nhnday Apil lt4 2mg Monday Apil 14 2mg Inputs O Outputs 0 Total 8 15 589 16 620 o 00 230 07 237 00 134 10 144 Total 15 953 33 Crosstalk analysis of the prototypic signaling network following the classification scheme on the previous slide With 211 extreme pathways there are a total of 22155 21122112 pairwise comparisons Pathway Redundancy 0 Two extreme pathways with identical inputs andor outputs represent two systemically independent routes by which a network can be utilized to reach the same objective input aulpmx 6 ledundaur mpms dlsjmm Burpms A J 8 quotmm Lupus ourpm merlap H quotquot quot WW v upmr WW input mme 0 3 0quot 39 j msiommvuumdundamnmvus ml unve ap xedundnmoukpms fully Yedundnm C F7 13 Nhnday Apil lt4 zuua 0 There were I35 distinct inputoutput states of the 2 l extreme pathways This result suggests that on average the prototypical signaling network can convert an identical set of inputs to an identical set of outputs using two systemically independent routes Monday April 14 2008 61 Completely redundant extreme pathways Pathways 88 and 108 have identical inputs and outputs and yet use different internal reactions Monday Aprii 14 2008 O The number of redundant output states with different inputs was also calculated 0 There were l7 distinct output states in the set of extreme pathways for the network Monday April 14 2008 63 The number oi extreme pathwa with equiialent output states Distinct ttutput states Number ni pathmys Wp WLp W3p w2w3pp WW2W3ppp 2ww3ppp 0 l 0 0 l 0 50 0 O l l 0 0 26 l 1 0 0 0 1 22 l l l 0 0 0 22 l 4 0 U U l 10 0 Z l 0 l 0 10 l 2 l 0 0 0 10 l l 0 l 0 0 10 l 0 0 2 U 0 10 0 0 I 0 0 1 10 l 0 0 l l U 10 0 2 l 0 0 l 4 0 l 0 2 0 l 4 0 l l 0 0 0 4 l 1 0 0 0 0 4 l 0 0 l 0 0 4 l 0 l U U 0 l Because extreme pathways are systemically independent the combinatorial effect of the multiple pathways that produce Wp W2p and W3p cannot explain the redundancy in the output ofWW2W3ppp Rather the redundancy is a result of emergent uses of the network to produce the particular transcription factor Monday Apm 14 2008 an Reaction Participation The number of extreme pathways that a particular reaction participates in can be computed efficiently Disrupting or regulating the activity of highly connected reactions would influence a large number of extreme pathways or functional network states 0 The percentage of a total of 2 l of extreme pathways that use each individual reaction in the prototypic signaling network was computed Monday April 14 2008 65 100 90 Percentage of Pathways U1 2 Reaction Participation 10 20 30 Reaction Number mm wmm van swuw 3 m quotmimimwm izPL w L wmm L z H m u Nhnday Apii lt4 2mg m gt m E m n o a c n c a u a 0 Reaction Participation 20 30 Reaction Number Nhnday Apii lt4 2mg 100 90 Percentage of Pathways U1 2 Reaction Participation 10 20 30 Reaction Number mm wmm van swuw 3 m quotmimimwm izPL w L wmm L z H m u Nhnday Apii lt4 2mg 100 Reaction Participation I Exchange Reactions D internai Reactions i Percentage of Pathways imamlamaquot quotmmuimwm a a m use nan mm may 3 s 10 20 30 40 Reaction Number w w L w392v av L mwu 2 my m g y xmtdbulmn Nhnday Apii lt4 2mg 100 90 Percentage of Pathways U1 2 Reaction Participation 10 20 30 Reaction Number mm wmm van swuw 3 m quotmimimwm izPL w L wmm L z H m u Nhnday Apii lt4 2mg Percentage of Pathways Reaction Participation 10 20 30 40 Reaction Number imamlamaquot annimwm mm m ak mm W L wmun L VLWEJ K 5 L mu39iption Facio Nhnday Apii lt4 2mg Correlated Reaction Sets From the set of extreme pathways for a network correlated reaction sets can be calculated Correlated reaction sets are a collection of reactions that are either always present or always absent in all of the extreme pathways Effectively these sets of reactions function together in a given network although the reactions themselves may not be adjacent in a network map Monday April 14 2008 69 Correlated Reaction Sets 0 The correlated reactions for the prototypic signaling network were computed and are summarized in the following table RCACHOH wt RcAClmn names hm LR 2 bmdLZRl z 3 SlpTZ v2 4 bmdLERJ L3 R3 L3R3m s 3123 3pT3 s bmdLAR 7 bindLS s x i39onnwzw3 VZJNKJU 9 lomuZWWZWL VVVZVVBVppp m l39omiZWW3 2 VV3ppp m I R2 LZR Lm l3 ATP ADP Nhnday Apil lt4 2mg Correlated Reaction Sets 0 The correlated reactions for the prototypic signaling network were computed and are summarized in the following table Rmcuon Sci Reaction names bmdLR L 2 bindDRI 1 3 1772 mm 4 bindL3R3 L3 R3 L3R3m L3R3p53 53m 5 bindMR L4 7 hm 8 lbn39nVVLVVl W2W3pp 9 immzwwzwa VVVQVV3mpp m i39omuzwwz 2VV3ppp m I R2 LZR Lm Ix ATP ADP Nhnday Apil lt4 2mg Correlated Reaction Sets 0 The correlated reactions for the prototypic signaling network were computed and are summarized in the following table RCACHOH wt RcAClmn names hm LR 2 bmdLZRl z 3 SlpTZ v2 4 bmdLERJ L3 R3 L3R3m s 3123 3pT3 s bmdLAR 7 bindLS s x i39onnwzw3 VZJNKJU 9 lomuZWWZWL VVVZVVBVppp m l39omiZWW3 2 VV3ppp m I R2 LZR Lm l3 ATP ADP Nhnday Apil lt4 2mg Correlated Reaction Sets 0 The correlated reactions for the prototypic signaling network were computed and are summarized in the following table Rmcuon Sci lnpm r Rcacuon names gt rmVV 2W3 raw34 immzwwzwa VVVQVV3mpp i39omuzwwz 2VV3ppp 111 R2 LZRL m ATP ADP 3913a39I 77lli il output Nhnday Apil lt4 2mg Correlated Reaction Sets 0 The correlated reactions for the prototypic signaling network were computed and are summarized in the following table RCACHOH wt RcAClmn names hm LR 2 bmdLZRl z 3 SlpTZ v2 4 bmdLERJ L3 R3 L3R3m s 3123 3pT3 s bmdLAR 7 bindLS s x i39onnwzw3 VZJNKJU 9 lomuZWWZWL VVVZVVBVppp m l39omiZWW3 2 VV3ppp m I R2 LZR Lm l3 ATP ADP Nhnday Apil lt4 2mg Correlated Reaction Sets 0 The correlated reactions for the prototypic signaling network were computed and are summarized in the following table RCAC IOH sci Reaction names bindLR L bindum D SZpTZ TZpV v bindL3R3 L3 R3 L3R3m L3R3p53 53st 4 WLPp VVVZVVLPpF vw3ppp R 111 R2 LZRL m ATP ADP Input and faction only receptors arr2 no specific to the particular ligand Nhnday Apil lt4 2mg Correlated Reaction Sets 0 The correlated reactions for the prototypic signaling network were computed and are summarized in the following table RCACHOH wt RcAClmn names hm LR 2 bmdLZRl z 3 SlpTZ v2 4 bmdLERJ L3 R3 L3R3m s 3123 3pT3 s bmdLAR 7 bindLS s x i39onnwzw3 VZJNKJU 9 lomuZWWZWL VVVZVVBVppp m l39omiZWW3 2 VV3ppp m I R2 LZR Lm l3 ATP ADP Nhnday Apil lt4 2mg Correlated Reaction Sets 0 The correlated reactions for the prototypic signaling network were computed and are summarized in the following table Nhnday Apil lt4 2am The tom Rmcuon Sci Reaction names 1772 TZpVVZ bmdL3R3 L3 R3 L3R3m L3R3p53 53m dL 2 LS lbnnWZWl WLWLpp rummvwnw wwzxwppp m lbnn VWl 2ww3ppp II R m I RZLZRLm ATP ADP I ms of the ll com Correlated Reaction Sets 0 The correlated reactions for the prototypic signaling network were computed and are summarized in the following table RCACHOH wt RcAClmn names hm LR 2 bmdLZRl z 3 SlpTZ v2 4 bmdLERJ L3 R3 L3R3m s 3123 3pT3 s bmdLAR 7 bindLS s x i39onnwzw3 VZJNKJU 9 lomuZWWZWL VVVZVVBVppp m l39omiZWW3 2 VV3ppp m I R2 LZR Lm l3 ATP ADP Nhnday Apil lt4 2mg Correlated Reaction Sets 0 The correlated reactions for the prototypic signaling network were computed and are summarized in the following table RCAC IOH sci Reacuon names sipT szwz L3 R L L3R3m 11 stTJ L d bmdLSRZ LS amdm tws 353 7 35 w g L lbnnW 2W3 WLWLpp I omizwwzwa VVVZVVLPpF 10 lmeZVVW3 2ww3ppp l R m I R2 LZRL m 1 ATP ADP ionobvious correla rewri on Nhnday Apil mi 2mg The End for Spring 08 Hope to SeeYou in COMP 57 Fall 08 REVIEWS PROTEIN CHIPS Similar to DNA microarrays this evolving technology detection ofactivity binding to lipids and so on yDepartment afPhysiEs University afNatre Dame Natre Dame Indiana 46556 USA Depurtment afPu thalagy Narthwestem University Chimga Ulinais 6061 1 eemuib ulbndedu zn0008narthwestem edu doi101038m g1272 NETWORK BIOLOGY UNDERSTANDING THE CELL S FUNCTIONAL ORGANIZATION AlbertrLciszl Bamb s Zoltain N Oltmz39f Reductiom39sm which has dominated biological research for over a century has provided a wealth o knowled about individual cellular components and their nce tions Despite its enormous success it is increasingly clear that a discrete biological function can only rarely be attributed to an individual molecule Instead most biological characteristics arise from complex interace tions between the cell s numerous constituents such as proteins DNA RNA and small molecules Therefore a key challenge for biology in the twentye rst century is to understand the structure and the dynamics of the com plex intercellular web of interactions that contribute to the structure and mction of a living cell The development of highethroughput dataecollection techniques as epitomized by the widespread use of microarrays allows for the simultaneous interrogation of the status of a cell s components at any given time In turn newtechnology platforms such as PROTEINCHIPS or semieautomate YEASTTonYERiD SCREENS help to deter mine how and when these molecules interact with each other Various types of interaction webs or networks including proteiniprotein interaction metabolic sige nalling and transcriptioneregulatory networks emerge from the sum ofthese interactions None ofthese net works are independent instead they form a network of networks that is responsible for the behaviour of the cell A major challenge of contemporarybiology is to embark on an integrated theoretical and experimental A key aim of postgenomic biomedical research is to systematically catalogue all molecules and their interactions within a living cell There is a clear need to understand how these molecules and the interactions between them determine the function of this enormously complex machinery both in isolation and when surrounded by other cells Rapid advances in network biology indicate that oellular networks are governed by universal laws and offer a new conceptual framework that could potentially revolutionize our view of biology and disease pathologies in the twentyefirst century programme to map out understand and model in quane i i the various networks that control the behaviour of the cell Help along the way is provided by the rapidly develop ing theory of complex networks that in the past few years has made advances towards uncovering the orgae nizingprinciples that govern the formation and evolution of various complex technological and social networks This research is already making an impact on cell biology It has led to the realization that the architectural features of molecular interaction networks within a cell are shared to a large degree by other complex systems such as the Internet computer chips and society This unexpected universality indicates that similar laws may govern most complex networks in nature which allows the expertise from large and wellemapped nonebiological systems to be used to characterize the intricate interwoven relationships that govern cellular mctions In this review we show that the quantifiable tools of network theory offer unforeseen possibilities to under stand the cell s internal organization and evolution fundamentally altering our view of cell biology The emerging results are forcing the realization that not withstanding the importance of individual molecules cellular function is a contextual attribute of strict and quantifiable patterns of interactions between the myriad of cellular constituents Although uncovering the generic organizing principles of cellular networks NATURE REVIEWS l G E N ETI cs VOLUME 5 l FEBRUARY 2004 l 101 REVIEWS ST TWOVHYBRID SCREEN A genetic approach for the identi cation ofpotential transaiptionalractivation the proteins ieconstitutes N r y I A the DNArbinding domain Box 1 i Network measures Networkbiology offers a quantifiable description of the networks a Undirected network that characterize various biological system s Here we define the m n t L 39 aiiuw r A characterize different complex networks Degree Themnt quot 4quot4 ii i I 1 I eL nnn fix ih Ir whiLh quot nodes For example in the undirected network shown in part a of the figure node Ahas degree k 5 In networks in which each link has a selected direction see figure part b there is an incoming degreekm L39 L J L ofquot I tnznnde J 39 J km L39 L J L uflinksthat start from it For example node Ain part b of the figure has km 4 an m 1Ah undirected networkwith Nnodes andL links is characterizedby an average degree ltkgt 2LN where ltgt denotes the average b Directed network D egree distribution The degree distribution P k gives the probability that a selected node has exactly klinks Pk is obtained by counting the number o fhodes Nk with k 12 links anddividingbythe total number of nodes N The degree distribution allows us to distinguish between different classes of networks For example apeaked degree distribution as seen in a random network BOX 2 indicates that the system has a characteristic degree andthat there are no highly connectednodes which are also known as hubs By contrast a k JD J39 39 39 indicatestlma 39 39 together numerous small nodes BOX 2 C Scalenfreenetworks and the degree exponent Most biological networks are scalenfreewhich means that their degree distribution approximates apower law P k k V where y is the degree exponent and indicates proportional to The value of y determines many properties of the system The smaller the value of y the more important the role of the hubs is in the network Whereas for ygt3 the hubs are not relevant for 2gt ygt3 there is ahierarchy of hubs with the most connected hub being in contact with a small fraction of allnodes and for y 2 h b d I 39 hub being in contact with alarge fraction of allnodes In general the unusualproperties of l f quotJ f i 3when L J39 r 39 of the Pk distribution which is defined as 02 lt k2gt 7 lt kgt1 increases with the number of nodes that is o diverges resultingin a series of unexpected f ahw L 39 L J L 39 accidental node failures For i I I p i i i m 4 K r mn tnnn Hal L r one Shortest path andmean path length to travel T i imicein Jwitiithepathl north quot vv ii 3 r 1 between two node L 39 39 two nod th hnrt t I h ithe path with the L links between the selected J h p 39 role I J39 J th di Mn AB from node Ato node B is often different from the distance BA from B toA For examplein part b of the figure EA 1 whereas AB 3 Often there is no direct path between two nodesAs shown in partb of the figure although there is a path from C to Athere is no path from A to C The mean path length lt f gt represents the average over the shortest L pairs of J m f iiavi ability Clustering coef cient In many networks if node A is connectedto B and B is connected to C then it is highly prob able thatA also has a direct linkto C T 39 L quot39 J 39 39 f 39 q 2nIkk71where nIisthe number of links connecting the kI neighbours of node Ito each other In other words q gives the number of triangles see BOX3 that go through node I whereas 712 is the total number of triangles that could pass through node I should all of node I s neighbours be connected to each other For example only one pair of node A s five neighbours in part a of the figure are linkedtogether B and Cwhich gives nA 1 and CA 220 By contrastnone of node F s neighbours linkto each other giving CF 0 TL 1 39 f 39 lt C 39 39 39 tendency of nodes to form clusters or groupsAn importantmeasure of 139 L39 L 39 J f39 J L 39 L 39 nf allnodes with klinks For many real networks k kquotwhich is anindication of anetwork s hierarchical character 7v53 see BOX 2 D J g kgt m ran nathlength lt J 1 39 m 39 Cgt depend on the number of J J quot L J L in the network By contrastthe P k and k functions are independent of the network s size 39 L39 L L to be used to classify various networks L I i and they 39 I 102 i FEBRUARY 2004 VOLUME 5 wwwnaturecomreviewsgenetics is fundamental to our understanding of the cell as a sys7 tem it also needs to develop relevance for the experimen7 tal biologist helping to elucidate the role of individual molecules in various cellular processes Therefore we I L I 39 quot 39 detajlsandL quot origins that contribute to the formation of cellular net works and the impact of the netwnrk m r imentally observable mction and behavioural features Our goal is to help understand the largeescale characteris tics of cellular networks complementing recent excdlent reviews on the flmction of small genetic circuits for example see REFS 26 We also look to the future and the uncharted territories for which I I L might bear further fruits Basic network nomenclature The behaviour of most complex systems from the cell to the Internet emerges from the orchestrated activity of REVIEWS many components that interact with each other through pairwise interactions At a highly abstract level the com ponents can be reduced to a series of nodes that are con nected to each other by links with each link representing the interactions between two components The nodes and links together form a network or in more formal 39 a e a graph ROXI Establishing the identity of various cellular networks is not trivial Physical interactions between molecules such as proteiniprotein proteininucleiceacid and proteinimetabolite interactions can easily be conceptue alized using the node link nomenclature Nevertheless I functional 39 t 39 ca be con sidered within this representation For example small molecule substrates can be envisioned as the nodes of a metabolic network and the links as the enzymeecatale ysed reactions that transform one metabolite into another FIG 1H a c P p w Orthophosphale d e 00 f 10 l Emmerlmg er a E Q 10 Aicr o Q g 3 E o g E 2 E 104 a o 104 I mm Hmm mm mw mw Hmm 106 mm mm mm Hmm 100 10 102 103 100 10 102 103 100 10 102 103 k k Experimental ux v of GLC uptake rate gure 139 i i i i i i i A N h h h N r V l lg rdependantenzymesws illustrated a lmhemost abstract reactions thatmterconve r v 3 i A i i i i i i products din Plltot r 39 4 b3 5 elThescaling ottheclustering coe icientC m quot I an average over43 organisms 553 fl law thh high uxes carry most otthe rrietabolicactiwtygl This plot l8 based on datathat mm my 3105 lt on all I line w 3 3 TR UMP uridme monophosphate UTR uridinetriphosphate I GLO c UDin quot If NATURE REVIEWS l G E N ETI cs VOLUME 5 lFEBRUARY 2004 l 103 REVIEWS a 6 a Mat 7 n 41ml w M quot4 1 quotu u lquot a FlgureZl J mapotp 39 m a l The largest In u V yeast 39 Illustrates et 15113 e Macm t t Illan Ma 3 1h gazmes Ltd er 3 1 cluster whlch contalns 78 of all proteins l8 shown The colour of a node Indlcates r Hm IMI M lml from REF 18 l rottm III F Depending on the nature ofthe interactions net works can be directed or undirected In directed networks the interaction between any two nodes has a welledefined direction which represents for example the direction of material ow from a substrate to a product in a metabolic reaction or the direction of information ow from a transcription factor to the gene that it regulates In undirected networks the links do not have an assigned direction For example in protein interaction networks FIG 2 alink represents a mutu binding relationship ifprotein A binds to protein B then protein B also binds to proteinA Architectural features of cellular networks From random to scale free networks Probably the most important discovery of network theory was the realizae tion that despite the remarkable diversity of networks in nature their architecture is governed by a few simple principles that are common to most networks of major scientific and technological interest For decades graph theory 7 the field ofmathematics that deals with the mathematical foundations of networks 7 modelled complex networks either as regular objects such as a square or a diamond lattice or as completely random network This approach was rooted in the in uential work of two mathematicians Paul Erdos and Alfred R nyi who in 1960 initiated the study of the mathematical properties of random networks Their much 39 random network L a xed number of nodes are connected randomly to each other Box 2 The most remarkable property of the model is its democratic or uniform character characterizing the degree or connectivity k 30x1 of the individual nodes Because in the model the links are placed randomly among the nodes it is expected that some nodes collect only a few links whereas others collect many more In a random network the nodes degrees follow a Poisson distribution which indicates that most nodes have roughly the same number of links approximately equal to the networks average degree ltkgt where ltgt denotes the average nodes that have significantly more or less links than ltkgt are absent or very rare Box 2 Despite its elegance a series of recent ndings indi cate that the random network model cannot explain the topological properties of real networ s The deviations from the random model have several key signatures the most striking being the finding that in contrast to the Poisson degree distribution for many social and technological networks the number of nodes with a given degree follows a power law That is the probability that a chosen node has exactly k links follows Pk k Ywhere y is the degree exponent with its value for most networks being between 2 and 3 REF 15 Networks that are characterized by a powerelaw degree distribution are highly nonfuniform most of the nodes have only a few links A few nodes with a very large number of links which are often called hubs hold these nodes together Networks with a power degree distribution are called scaleefree S a name that is rooted in statistical physics literature It indicates the absence 0 a typical node in the network one that could be used to characterize the rest ofthe nodes This is in strong contrast to random networks for which the degree of all nodes is in the vicinity of the average degreewhich could be considered typical However scaleefree networks could easily be called scaleerich as well as their main feature is the coexistence of nodes of widely different degrees scales from nodes with one or two links to major hubs Cellular networks are scule free An important develop 7 ment in our understanding of the cellular network architecture was the nding that most networks within the cell approximate a scaleefree topology The rst evi dence came from the analysis of metabolism in which the nodes are metabolites and the links represent enzymeecatalysed biochemical reactions FIG 1 As many of the reactions are irreversible metabolic networks are directed So for each metabolite an in and an out degree Box 1 can be assigned that denotes the number of reactions that produce or consume it respectively The analysis of the metabolic networks of 43 different organisms from all three domains of life eukaryotes bacteria and archaea indicates that the cellular metaboe lism has a scaleefree topology in which most metabolic substrates participate in only one or two reactions but a few such as pyruvate or coenzyme A participate in dozens and mction as metabolic hubs W 104 l FEBRUARY 2004 lVOLUME 5 wwwnaturecomrevlawsgenetics REVIEWS Box 2 Network models Network quot 39 for haping uui quot complex help to explain the origin of observed network i 1 39 ks There are had 2 direct impact an n biologicaI networ Random networks The ErdosiR nyi ER model of a random network see figure part A starts with N nodes and connects each pair of nodes with probability p which creates a graph with approxim ately pNN712 randomly placed links see figure part Aa The node degrees follow a Poisson distribution see figurepart Ab which indicates that most nodes have approximately the same number of links close to the average degree lt kgt The tail high kregionoft d g d39 t quot t39 Pk d I 39 which39 quot 39 nodes L 39 39f39 dcviatc from L g extrem ely rare The clustering coefficient is independent of anode s degree so Ck appears as ahorizontal line if plotted as a function of k see figure part Ac The mean path length is proportional to the logarithm of the network size l log N which indicates that it is characterized by the smalleworld property Scaleifree networks Scaleefree networks see figure part B are characterized by apowerelaw degree distribution the probability that anode has klinks follows Pk k V where y is the degree exponent The probability that a node is highly connected is statistically more significant than in a random graph the network s properties often being determined by a relatively small number of highly connectednodes that are known as hubs see figure part Ba blue nodes In the BarabasiiAlbert model of a scalerfree network at each time point anode with M links is added to the network which connects to an already existing node I with probability III kIEJkJ where k1 is the degree ofnode I FIG 3 and I is the index denoting the sum over networknodes The networkthat is generatedby this growth r l d g d39 t quot t39 that 39 39 by the degree exponent y 3 Such distributions are seen as a straight line on a logilog plot see figure part Bb The network that is created by the Baraba siiAlbert model does not have an inherent modularity so C k is independent of k see figure part Bc Scaleefree networks with degree exponents 2ltylt 3 a range that is observed in most biological and nonebiological networks are ultraesma1134 35with the average path length following f log log N which is significantly shorter than log Nthat characterizes random sm allrworld networks Hierarchical networks To account for the coexistence ofmodularity local clustering and scalerfree topology in many real systems it has to be assum ed that clusters combine in an iterative manner generating ahierarchical networkm see figure part C The starting point of this construction is a small cluster of four densely linked nodes see the four central nodes in figure part Ca Next three replicas of this module are generated and the three external nodes of the replicated clusters connected to the central node of the old cluster which produces a large 167node module Three replicas of this 167node module are then generated andthe 16 peripheral nodes connected to the central node of the old m odule which produces a new module of 64 nodes The hierarchical networkm odel seamlessly integrates a scaleefree topology with an inherent modular structure by generating a networkthat has a powerelaw A Random network B Scalefree network 0 Hierarchical network Aa Ba Ca degree distribution with degree Ab Eb Cb exponenty1n4lfn3226 307 see figurepart Cb and alarge 1 lc 7 systemrsize independent average 0 17 lg clustering coefficient lt Cgt 0 Q Q g lw 7 The most important signature of E E O Ol 39 Q 1057 hierarchical m odularity is the 0 ml 1ere scaling of the clustering 0 00017 1077 coefficient which follows 1 Ck kquot astraight line of slope k 1 10 100 1000 10 100 1000 10000 71 on alogilog plot see figure k k part Cc Ahierarchical Ac Bo cc architecture implies that sparsely connectednodes are part of highly clustered areaswith g comm unication between the g g 3 different highly clustered 9 neighbourhoods being m aintained by a few hubs see figurepart Ca k k mg k NATURE REVIEWS GENETICS VOLUME 5 FEBRUARY 2004 l 1 05 REVIEWS As for direct physical interactions several recent publications indicate that proteiniprotein interac7 tions in diverse eukaryotic species also have the fea7 tures of a scaleifree network B This is apparent in FIG 2 which shows the protein interaction map of the eastquot 1 quot A quot quot tumti twoihybrid screens 2 Whereas most proteins partici7 pate in only a few interactions a few participate in dozens i atypical feature of scaleifree networ Further examples of scaleifree organization include genetic regulatory networks in which the nodes are individual genes and the links are derived from the expression correlations that are based on microarray data15gt25 or protein domain networks that are con structed onthe basis of t 39 4 interaction 27 However not all networks within the cell are scaleifree For example the transcription regulatory networks of S cerevisiae and Escherichia coli offer an interesting example of mixed scaleifree and exponential character istics Indeed the distribution that captures how many different genes a transcription factor interacts with follows a power law which is a signature of a scaleifree network This indicates that most transcription factors regulate only a few genes but a few general transcription factors interact with many genes Howevert e income ing degree distribution which tells us how many differ ent transcription factors interact with a given gene is best approximated by an exponential which indicates that most genes are regulated by one to three transcrip7 tion factors 29gt So the key message is the recognition that cellular networks have a disproportionate number of highly connected nodes Although the mathematical de nition of a scaleifree network requires us to establish that the degree distribution follows a power law which is dif cult in networks with too few nodes the presence of hubs seems to be a general feature of all cellular net works from regulatorywebs to the p53 module These hubs mdamentally determine the network s behaviour see below Smallworld effectund ussortutivi 134 A common feature ofall complex networks is that any two nodes can be connected with a path of a few links only This small7 world effect which was originally observed in a social study has been subsequently shown in several systems from neural networks to the World Wide Web Although the smalliworld effect is a property of random networks scaleifree networks are ultra smallms 7 their path length is much shorter than predicted by the smalliworld effect BOX 2 Within the cell this ultra smalliworld effect was first documented for metaboi lism where paths of only three to four reactions can link most pairs of metabolites 57 This short path length indicate Llldl local I L 39 in quot trations could reach the whole network very quickly Interestingly the evolutionarily reduced metabolic net work of a parasitic bacterium has the same mean path length as the highly developed network of alarge multii cellular organisml 5 which indicates that there are evolui tionary mechanisms that have maintained the average ath length during evolution FIGUREZ illustrates the disassortative nature of cellui lar networks It indicates for example that in protein interaction networks highly connected nodes hubs avoid linking directly to each other and instead connect to proteins with only a few interactions In contrast to L nature of social networks in which well Web Internet networks A t ough the small and ultraismalliworld property of complex networks is mathematically well understood34gt35 the origin of disasi sortativity in cellular networks remains unexplained Ewlmiwm 39 39 f 39 The ubi 7 uity of scaleifree networks and hubs in technologic biological and social systems requires an explanation It has emerged that two fundamental processes have a key role in the development of real networks First most networks are the result of a growth process dur7 ing which new nodes join the system over an extended time period This is the case for the World Wide Web which has grown from 1 to more than 37billion web pages over a 107year period Second nodes prefer to connect to nodes that already have many links a process that is known as preferential attachment For example on the World Wide Web we are more familiar with the highly connected web pages and therefore are more likely to link to them Growth and preferential attachment are jointly responsible for the emergence of the scaleifree property in complex networks FIG3a Indeed if a node has manylinks new nodes will tend to connect to it with a higher probability This node will therefore gain new links at ahigher rate than its less connected peers and will turn into a hub rowth and preferential attachment have a com mon origin in protein networks that is probably rooted in gene duplication39 Duplicated genes pro duce identical proteins that interact with the same protein partners FIG 3 Therefore each protein that is in contact with a duplicated protein gains an extra link Highly connected proteins have a natural advan7 tage it is not that they are more or less likely to be duplicated but they are more likely to have a link to a duplicated protein than their weakly connected cous 7 ins and therefore they are more likely to gain new links if a randomly selected protein is duplicated This bias represents a subtle version of preferential attachment The most important feature of this explanation is that it traces the origin of the scaleifree topologyback to a welliknown biological mechanism 7 ene duplii cation Although the role of gene duplication has been shown only for protein interaction networks it proba7 bly explains with appropriate adjustments the emere gence of the scaleifree features in the regulatory and metabolic networks as well It should be noted that although the models showbeyond doubt that gene duplication can lead to a scaleifree topology there is no direct proofthat this mechanism is the only one or the one that generates the observed power laws in cellular networks However as gene duplication is a 106 I FEBRUARY 2004 IVOLUME 5 wwwnaturecomreviewsgenetics REVIEWS major engineer of the genomic landscape it is likely to be a key mechanism for generating the scaleefree topology Two further results offer direct evidence that net work growth is responsible for the observed topological features The scaleefree model BOX 2 predicts that the the metabolic hubs indicates that the remnants of the RNA world such as coenzyme A NAD and GTP are among the most connected substrates of the metabolic network as are elements of some of the most ancient metabolic pathways such as glycolysis and the tricare boxylic acid cycle In the context of the protein interace tion networks crossegenome comparisons have found 1 j r 39 h a Y 0 e I more links to other proteins than their younger coune terparts S This offers direct empirical evidence for preferential attachment Motifs modules and hierarchical networks Cellular functions are likely to be carried out in a highly modular manner In general modularity refers to a group of physically or functionallylinked molecules nodes that work together to achieve a relatively dis tinct function isiw Modules are seen in many systems for example circles of friends in social networks or web sites that are devoted to similar topics on the World Wide Web Similarly in many complex engineered sys7 tems from a modern aircraft to a computer chip a highly modular structure is a fundamental design attribute Biology is full of examples of modularity Relatively invariant proteiniprotein and protein7 A complexes physical modules are at the core of many basic biologe ical functions from nucleiceacid synthesis to protein degradation Similarly temporally coregulated groups of molecules are known to govern various stages of the cell cycle or to convey extracdlular signals inbactere ial chemotaxis or the yeast pheromone r r way In fact most molecules in a cell are either part of an intracellular complex with modular activity such as the ribosome or they participate in an extended func7 tional module as atemporally regulated element of a relatively distinct process for example signal ampli cae tion in a sig 39ngpathwayquot To address the modularity ofnetworks tools and measures need to be developed that will allow us not only to establish if a network is modular but also to explicitly identify the modules and their relationships in a given network High clustering in cellular networks In a network repre sentation a module or cluster appears as a highly interconnected group of nodes Each module can be reduced to a set of triangles BOX 1 a high density of trie angles is re ected by the clustering coef cient C REF 33 the signature of a network s potential modularity BOX 1In h mndiilaritv L 39 cient of the real and the randomized network are com parable The average clustering coefficient ltCgt of Proteins Before duplication After duplication Proteins Figure 3 l The origin of the scalefree topology and hubs in biological networks The origih of the scalerfree topology ih complex hetWorllts cah be reduced to two basic mechanisms growth ahd preferential attachment Growth meahs that the hetWorllt emerges through the subsequeht additioh of heW hodes such as the heW red hode that is added to the hetWorllt that is shoWh ih part a Preferehtial attachmeht meahs that heW hodes prefer to lihllt to more cohhected hodes For example the probabilitythat the red hode Will cohhect to hode t is tWice as large as cohhectihg to hode 2 as the degree of hode 1 034 is tWice the degree of hode 2 K52 atta m ht g t mechahism h the more likely it is that heW hodes Will lihllt to it which allows the highly cohhected hodes to acquire heW lihllts faster thah their less cohhected peers lh proteih ihteractioh hetWorks scalerfree topology seems to have its origih ih ehe duplicatioh Part b shows a small proteih ihteractioh hetWorllt blue ahd the gehes that ehcode the proteihs greeh Wheh cells diwde occasiohally ohe or several gehes are copied tWice n i A A circles This ihduces growth ih the proteih ihteractioh hetWorllt ecause how we have ah extra gehe that ehcodes a heW i Thii Miiiiia the old ohe so they both ihteract With the same proteihs Ultimately the protelhs that ihteracted With the origihal duplicated proteih Will each gaih a heW ihteractioh to the heW proteih Therefore proteihs With a large humber of ihteractiohs tehd to gaih lihllts more ofteh as it is more lilltelythat they ihteract With the proteih that has beeh duplicated This is a mechahism that geherates preferehtial attachmeht ih cellular hetWorllts lhdeed ih the examplethat is shoWh ih part b it does hot matter which gehe is duplicated the most cohhected lh cohtrast the square Which has ohly ohe lihllt gaihs a heW lihllt ohly if the hub is duplicated NATURE REVIEWS l G E N ETI cs VOLUME 5 FEBRUARY 2004 l 107 REVIEWS most real networks is significantly larger than that of a random network of equivalent size and degree distribe ution The metabolic network offers striking evi7 dence for this lt Cgt is independent of the network size in contrast to a moduleefree scaleefree network for which lt Cgt decreases The cellular networks that have been studied so far including protein interaction and protein domain networks have a ig lt C gt which indicates that high clustering is a generic feature of biological networks Motifs are elementary units ofcellular networks The high clustering indicates that networks are locally sprinkled with various subgraphs ofhighly inter linked groups of nodes which is a condition for the emergence of isolated flmctional modules Subgraphs capture specific patterns of interconnections that characterize a given network at the local level B0x3 However not all subgraphs are equally significant in real networks as indicated by a series of recent obsere vationsm To understand this consider the highly regular square lattice an inspection of its subgraphs would find very many squares and no triangles Box 3 It could correctly be concluded that the prevalence ofsquares and the absence oftriangles tell us some thing fundamental about the architecture of a square lattice In a complex network with an apparently random wiring diagram it is difficult to find such obvious signatures of order all subgraphs from triane gles to squares or pentagons are probablypresent However some subgraphs which are known as motifs are overrepresented when compared to a randomized version of the same network3 gt5 For examp e triangle motifs which are referred to as feed forward loops B0x3 in directed networks emerge in both transcriptioneregulatory and neural networks whereas fourenode feedback loops represent charac7 teristic motifs in electric circuits but not in biological systems Each real network is characterized by its own set of distinct motifs the identification of which provides information about the typical local intercone nection patterns in the network The high degree of evolutionary conservation of motif constituents within the yeast protein interaction network and the convergent evolution that is seen in the transcription regulatory network of diverse species towards the same motiftypes55gt57 further indicate that motifs are indeed of direct biological relevance As the molecular components of a specific motif often interact with nodes that are outside the motif how clusters For example in the E coli transcription regulatory network most motifs overlap generating distinct homologous motif clusters Box 3 in which the specific motifs are no longer clearly separableAs motifs are present in all of the real networks that have been examined so far it is likely that the aggregation ofmotifs into motif clusters is a general property of most real networks Hierarchy organization of topological modules As the number of distinct subgraphs grows exponentially with the number of nodes that are inthe subgraph the study of larger motifs is combinatorially unfeasibleAn alter native approach involves identifying groups of highly interconnected nodes or modules directly from the graphs topology and correlating these topological entie ties with their potential flmctional role Module identi 7 cation is complicated by the fact that at face value the scaleefree property and modularity seem to be contra dictory Modules by definition imply that there are groups of nodes that are relatively isolated from the rest of the system However in a scale free network hubs are in contact with a high fraction of nodes which makes the existence of relatively isolated modules unlikely Clustering and hu s naturally coexist however which indicates that topological modules are not independent but combine to form a hierarchical networllt 7gt53 An example of such a hierarchical network is shown in Box 2 this network is simultaneously scaleefree and has a high clustering coefficient that is independent of system size The network is made of man small h39 integrated 47node modules that are assembled into lar er 167node modules each ofwhich combines in a t e dependence of the clustering coefficient on the degree of a node which follows Ck k l REF 58 This indicates that nodes with only a few links have ahigh C and belong to highly interconnected small modules By contrast the highly connected hubs have a low C with their role being to link different and otherwise not communicating modules It should be noted that the random and scaleefree models that are shown in Box 2 do not have a hierarchical topology because Ck is independent of k in their case This is not surprising as their construction does not contain elements that would favour the emergence of modules Identifying topological and functional modules Signatures of hierarchical modularity are present in all cellular networks that have been investigated so far r ing from metabolic to proteinip rotein interaction20 22 and regulatory networks But can the modules that are present in a cellular network be determined in an auto mated and objective fashion This would require a unique breakdown of the cellular network into a set of biologically relevant functional modules The good news is that if there are clearly separated modules in the sys7 tem most clustering methods can identify them Indeed several methods have recentlybeen introduced to iden7 ti modules in various networks using either the net 5 or combining the topology with integrated flmctional genomics datass m It must be kept in mind however that different methods predict different boundaries between modules that are not sharply separated This ambiguity is not only a lime itation of the present clustering methods but it is a consequence of the network s hierarchical modularity The hierarchical modularity indicates that modules do not have a characteristic size the network is as likely 103 l FEBRUARY 2004 lVOLUME 5 wwwnaturecomreviewsgenetics REVIEWS to be partitioned into a set of clusters of 10720 compo modules Does this mean that it is inherently impossible nents metabolites genes as into fewer but larger to identifythe modules in a biological network From a modules At present there are no objective mathematie mathematical perspective it does indeed indicate that cal criteria for deciding that one partition is better than looking for a set of unique modules is an illedefined another Indeed in most of the present clustering algoe problem An easy solution however is to avoid seeking a rithms some internal parameter controls the typical breakdown into an absolute set of modules but rather to size of the uncovered modules and changing the visualize the hierarchical relationship between modules parameter results in a different set of larger or smaller of different sizessis The identi cation of the groups of Box 8 i motifs and motif clusters Whereas the scalerfree andhierarchical features of complex L 39 L 39 39 mimiu ue 11111 determine 1 1 1 1 starts from the bottom andlooks for highly representative P 1 L i I 1 Subgraphs A connected subgraph represents a subset of nodes that are 1 l A 1 1 1 1 1 examplein part a of the figure four nodes that form alittle square yellow represent a subgraph of a square lattice Networks with amore intricate wiring diagram can have various different subgraphs For examplein part A of the 5 3 gure in BOX 1nodes A B and C form atriangle subgraph 4 whereasA B F and G form a square subgraph Examples of re nre ent in undirected 1 1 1 1 1 L is shown in art c It shouldbe notedthat the number of ii tinct 39 39 quot 39 39 39 39 number of nodes Motifs Not all subgraphs occur with equal frequency lndeedthe square lattice see figure part a contains many squares but no triangles In a complex networkwith an apparentl d random wiring diagram all subgraphs ifrom triangles to squares andpentagons and so on i are present However some subgraphs which are known as motifs are over represented as comparedto a randomizedversion of the same networkw s For examplethe directed triangle motif that is known as the feedrforward loop see figure top of part c I em erges in both transcriptioneregulatory and neural networks whereas fourrnode feedbackloops see figure middle of part c represent characteristic m otifs in electric circuits but not in biological system 530 To identify the motifs l H 139 that characterize a given network all subgraphs of nnodes in 1 1 1 1 N w 1 1 1 1 111 1 L Inn1 11 AL I 1 1 if2 33if eii lfiiih ii fd 535531 W1 I I A F3 X I 1 1 more frequentlyin the real network as compared to 6 O 9 randomized one are designated to be the motifs if Motif clusters r l T1 1 1 1 I o the Escherichia culi transcriptioneregulatorynetwork are a l shown simultaneously As the gure shows 208 of the 209 390 biefan motifs form two extendedmotif clusters R Dobrin 2t almanuscript in preparation and only one motif remains isolated bottom left corner Such clustering of motifs into motif clusters seems to be a general property of all real networks In part d of the gure the motifs that share links uL 1 u L in L1 1 1 1 D D The different colours and shapes of the nodes illustrate their I l functional classi cation NATURE REVIEWS l GENETICS VOLUME 5 FEBRUARY 2004 i 1 09 REVIEWS molecules of various sizes that together carry out a spe7 ci c cellular function is a key issue in network biology and one that is likely to witness much progress in the near future Network robustness A key feature of many complex systems is their robust ness which refers to the system s ability to respond to changes in the external conditions or internal organiza7 tion while maintaining relatively normal behaviour To understand the cell s functional organization insights into the interplay between the network structure and robustness as well as their joint evolutionary origins are needed Topological robustness Intuition tdls us that disabling a substantial number of nodes will result in an inevitable flmctional disintegration of a network This is certainly true for a random network if a critical fraction of nodes is removed aphase transition is observed breaking the network into tiny nonecommunicating islands of nodes Complex systems from the cell to the Internet can be amazingly resilient against component failure withstanding even the incap acitation of many of their individual components and many c anges in extern conditions We have recently learnt that topology has an important role in generating this topological robust ness7 Scaleefree networks do not have a critical thresh7 old for disintegration 7 they are amazingly robust against accidental failures even if80 of randomly selected nodes fail the remaining 20 still form a come pact cluster with a path connecting any two nodes This is because random failure affects mainly the numerous small degree nodes the absence of which doesnt dis the other hand induces a soecalled attack vulnerability i the removal of a few key hubs splinters the system into small isolated node clusters This doubleeedged feature of scaleefree networks indicates that there is a strong relationship between the hub status of a molecule for example its number of links and its role in maintaining the viability and or growth of a cell Deletion analyses indicate that in S cerevisiae only 10 ofthe proteins with less than 5 links are essential but this fraction increases to over 60 for proteins with more than 15 interactions which indicates that the protein s degree of connected ness has an important role in determining its deletion phenotype Furthermore only 187 of S ceree visiae genes 144 in E coli are lethal when deleted individuallyn and the simultaneous deletion of man E coli genes is without substantial phenotypic effect75gt75 These results are in line with the expectation that manylightly connected nodes in a scaleefree net work do not have a major effect on the network s integrity The importance of hubs is further corroboe rated by their evolutionary conservation highly inter acting S cerevisiae proteins have a smaller evolutionary distance to their orthologues in F l 139 39 elegans77 and are more likely to have orthologues in 78 higher organisms Functional and dynamical robustness A complete understanding of network robustness requires that the functional and dynamic changes that are caused bypere turbations are explored In a cellular network each node has a slightly different biological mction and therefore the effect of a perturbation cannot depend on the node s degree only This is well illustrated by the nding that compose o 39 ormly essential or noneessential mole miles This indicates that the flmctional role disp ens ability of the whole complex determines the deletion phenotype of the individual proteins The functional and dynamical robustness of cellular networks is supported by recent results that indicate that several relatively welledelineated extended modules normal flmction despite significant changes in a specie fied set of internal or external parameters which leaves its tumbling frequency relatively unchanged even under f 39 d d 39 quot in the r ligand concentrationsm The development of the cor rect segment polaritypattern in Drosophila melanogaster embryos is also robust to marked changes in the initial conditions reaction parameters or to the absence of certain gene pro uctsmj However similar to topologie 39 functional L also selective whereas some important parameters remain unchanged under perturbations others vary widely For example the adaptation time or steadyestate behaviour in chemotaxis show strong variations in response to changes in protein concentrations Altho our understanding of network robustness is far from complete a few important themes have emerged First it is increasingly accepted that adaptae tion and robustness are inherent network properties and not a result ofthe fine tuning of a component s characteristicsm Second robustness is inevitably accompanied by vulnerabilities although many cellular networks are well adapted to compensate for the most common perturbations they collapse when well selected network components are disrupted Third the ability of a module to evolve also has a key role in devele oping or limiting robustness Indeed evolutionarily frozen modules that are responsible for key cellular flmctions such as nucleiceacid synthesis might be less able to withstand uncommon errors such as the inactie vation of two molecules within the same functional module For example orotate phosphoribosyltranse ferase pyrE echallenged E coli cdls cannot tolerate fur n presumably considerably quite intertwined with the communication between modules probably limit ing the effects of local perturbations in cellular networks Beyond topology character De pite then A p A A have important intrinsic limitations For example the activity of the various metabolic reactions or regulatory ing the links L 1 10 l FEBRUARY 2004 l VOLUME 5 wwwnaturecomreviewsgenetics REVIEWS micro A A class ofsmall nonrcoding RNAs that aie important for development an homeostasis with possible roles in several umah disease pathologies interactions differs widely some are highly active under most growth conditions others switch on only under rare environmental circumstances Therefore an ultimate description of cellular networks requires that both the intensity that is strength and the temporal aspects of the interactions are considered35735Although so far we know little about the temporal aspects of the various cellular interactions recent results have shed light on how the strength of the interactions is orga nized in metabolic and geneticeregulatory networks In metabolic networks the ux of a given meta bolic reaction which represents the amount of sub strate that is being converted to aproduct within a unit of time offers the best measure of interaction stre Metabolic uxibalance approaches FBAW which allow the ux for each reaction to be calculated have recently significantly improved our ability to make quantifiab e predictions on the relative importance of various reactions giving rise to experimentally testable hypothesesmo A striking feature of the ux distribu tion of E coli is its overall reaction with Future directions Despite the significant advances in the past few years molecular network biology is only in its infancy Future progress is expected in many directions ranging from the development of new theoretical methods to characterize the network topology to insights into the dynamics of motif clusters and biological function Most importantly to move significantly beyond our present level of knowledge we need to enhance our data ii Li isw 1 st 4 iI of highly sensitive tools for identifying and quantifying the concentrations uxes and interactions of various types of molecules at high resolution both in space and time101 In the absence of such comprehensive data sets whole arrays of functionally important cellular net works remain completely unexplored ranging from sige nalling networks to the role of microRNAS in network topology and dynamics Similarly most work at present focuses on the totality of interactions or snapshots of activity in a few selected ux that spans several orders ofmagnitude coexist under the same conditions This is captured by the ux distribution for E coli which follows a power law This indicates that most reactions have quite small uxes coexisting with a few reactions with extremely high A similar pattern is observed when the strength of the various genetic regulatory interactions that are provided by microarray datasets are investigated Capturing the degree to which each pair of genes is coexpressed that is assigning each pair a correlation coefficient or examining the local similarities inpere turbed transcriptome profiles of S cerevisiaeindicates that the nctional organization of genetic regulatory networks might also e highly unevengm hat is although most of them only have weak correlations a few pairs show quite a signi cant correlation coefficient These highly correlated pairs probably correspond to direct regulatory and protein interactions This hypothe esis is supported by the finding that the correlations are higher along the links of the protein interaction net work or betweenproteins that occur inthe same come plex as compared to pairs of proteins that are not known to interact directly79gt95 97 Taken together these results indicate that the bio chemical activity in both the metabolic and genetic networks is dominated by several hot links that repre sent high activity interactions that are embedded into a web ofless active interactions This attribute does not seem to be aunique feature of biological systems origin of this seemingly universal property of the links is probably rooted again in the network topology Indeed it seems that the metabolic uxes and the weights oflinks in some nonebiological systems are uniquely determined by the scaleefree nature ofthe network topologymg At present a more general prin ciple that could explain the coexpression distribution data equally well is lacking in an abstract space However a cell s internal state or position in the cell cycle for example is a key determinant of actual interactions102 that requires data collection in distinct functional and temporal states Equally importantly all these interactions take place in the context of the cell s physical existence So its unique intracellular milieu threeedimensional shape anatomical architecture compartmentalization and the state of its cytoskeleton are likely to further restrict the potential interactions in cellular networks Finally most studies have so far focused on different subsets of the complex cellular networks Integrated studies that allowus to look at all metabolic regulatory spatial and so on interactions could 0 er er insights into how the network of networks contributes to the cell s observe i i i i r r i utilization pathway1 03 Extending them to the whole cele lular network of an organism is the ultimate aim of net work and systems biology c onclu si ons It is impossible to ignore the apparent universality we have witnessed by delving into the totality of pairwise interactions among the various molecules of a cell Instead of chance and randomness we have found a high degree of internal order that governs the cell s mole ecular organization Along the way a new language has been created which allows the cell s molecular makeup to be discussed as a network of interacting constituents and to spot and quantify the interplay between behave function The cell canL I I L A from the bottom up moving from molecules to motifs and modules or from the top to the bottom starting from the network s scaleefree and hierarchical nature and moving to the organismespeci c modules and mole ecules5 In either case it must be acknowledged that structure topology network usage robustness and flmction are deeply interlinked forcing us to complee ment the local moleculeebased research with integrated approaches that address the properties ofthe cell as a whole innr NATURE REVIEWS l G E N ETI cs VOLUME 5 FEBRUARY 2004 l 1 1 1 REVIEWS It is now clearly understood that most cellular funca tions are carried out by groups of molecules within functional modules These modules are not isolated from each other they interact and frequently overlap for 104 within a network with an inherent which L range is constrained bythe underlying topologym This organizational framework is shaped during evolution at e accumulation of local changes that the small highly integrated modules slowly impacts the larger less integrated modules which indicates that evolution and natural selection reuse existing modules to example see REF quot1 1 39 many levels 1 m uiei its complexit This developing framework will significantly alter our understanding ofbiology and eventually will in L39 LIA affect 1 L L 1 1 piuuauuny auu have important implications for the practice of media cine The breathtaking advances of modern molecue lar reductionist biology are starting to pay clinical dividends from the diagnosis of selected leukaemias on a molecular level to their molecularly targeted treatment with for example receptor tyrosine kinase inhibitors Network biology offers the possibility of simultaneous advances in the coming deca es The widespread use ofmicroarrays to refine pathology diagnosis is already evident for example see REF 105 What is lacking is a welladeveloped framework in which such clinical data can be used to identify mode ules that are I altered in a given disease state Once such a framework is developed the tare geted pharmaceutical modification such as rewiring of diseased modules will surely follow 1 HartwellL H HopfieldJ J LeiblerS 3MurrayA W 25 i t P p r 4 kirn i krani Di R R rin r r m molecular to modular cell biology Nature 402 interactions in Saccharomyces cerevisiae Nature 403 li39iii39iltenorder percolation and giant iiuctuations in a protein C47eC52 1999 525527 2000 interaction ne ork Phys Rev E Sgt Nonin Soft 39 39 39 ne ltoT MatterPhys 66055101 2002 modular organization ofbioiogicai functions the yeast protein interactome Proc NatAcad Sci USA 93 45 agner A How e global structure otprotein interaction 2 Hasty J McMilen D 3 Collins J J Engineered gene 455974574 2001 networks evolves Proc R Soc Lond B 270 4577455 circuits Nature 420 22472509002 25 Eeatherstone D E 3Broadie ilt Wrestling wittn pleiotropy 005 5 kitano H 45 Eisenberg E Levanon E v Preferential attachmentin the 205210 2002 expression network Bpessays 24 2577274 2002 protein network evolution Phys Rev Lett 91 158701 2005 4 iltoonin dt Yl 3iltarev G P 2 Na nl H 47 Ra sz E 3 ab slAnL Hierarchical Organization in protein universe an genome evolution Nature 420 constructed from gene expression data Phys Rev Lett 39 complex networks Phys Rev E Stat Nonfn SoftMatter 218225 2002 258702 2002 Phys 67 0251122005 5 oltvai Z N 3 Barabasl A 7L Utes complexity pyramid 27 Wuchty S Scaleriree behaviorin protein domain networks 48 Aiberts B The cell as acollection of protein machines Scrence 293 755754 2002 Moi Brot Evol 13 1594e1 702 2001 preparing the next generation otmolecular biologists Get 92 5 Wall M E Hlavacekvv S 3Sava eau M A Design of 28 Apic G GoughJ 3Teichmann S A Aninsight into 2917294 1998 gene circuits lessons from bacteria Nature Rev Genet 5 domain combinations Bpinformatcs 17 S85S89 2001 49 Simon l etat Serial regulation of transcriptional regulators in 54742 2004 29 Shenr rr til R Manna i n ii t n in ai ii i o 001 7 Bray D Molecular networks the topndowi39i mew Scrence ts in the transcriptional regulation network of 50 Tyson J J Cslkasanagy A 3Novak B The dynamics of 301 185471865 2005 Eschemhia cot Nature Genet 31 54858 2002 cell cyde regulation Bpessays 24 109571109 2002 8 AlonU Biological networks the tinkerer asanengineer 50 MiloR ShenrOrnS lth it ka htan N i n U 51 M riarn H H hanir i i Scrence 301 185518572005 Network motifs simple building blocks of complex regulatory network operating in time and space Scrence g in n R Raraha i e n n rk n 293 L8 301 1874718779008 networks Rev Mod Rhys 74 47797 2002 References 29 and 30 introduce the concept ofmotifs 52 B la Ram P T 3lyengar R MAP kinase 1 n r n ts N M nri i F E 39 39 39 39 39 phosphatase asalocus Of exibllltyll iamltogel inactivated 31 n i r in R ian n l rn i iininnrh n protein kinase signaling network Scence297 10181025 University Press Oxford 2005 network Nature 403 5077510 2000 02 11 Bornholdt S 3Schuster H G HandbookofGraphsand 52 Milgram S The small world problem Psychol Today 2 50 55 Ravasz E SomeraA L Mongru D Aoltvalz N 3 Networks from he Genome to henternet WileynVCi i 195 Barabasi ArL Hierarchical Organization otmodularityin Berlin Germany 2005 55 Watts D J 3Stro atz S H Collective dynamics of metabolicnetworks Scrence 297 1551715552002 12 Stroga H Exploring complexnetworks Nature41o smallrworld networks Nature393 445442 1998 This paper introduced the conceptofhierarchicai 258275 2001 54 Chung F 3Lu L n rm rir quot 39 39 15 Bollobas B Random Graphs Academic Press London with given expected degrees Proc NatAcad Sci USA 99 54 lizkowiz S Milo R kashtan N Ziv G 3Aion U 1985 1587945882 2002 Subgraphs in random networks Phys Rev E Stat Month 14 ErdosP 3RenyiA 55 r h n R ntin SoftMatterF hys 63025127 2005 Pool Math inst Hung Acad Sci 5 17751 1950 Phys Rev Lett 90 058701 2005 55 Wuchty S oltvai Z N 3Barabasi A 7L Evolutionary 15 Barab slAnL ih rt R 55 Ma l n nn n k networks Scrence 236 5097512 1999 of protein networks Scrence 296 915915 2002 interaction network Nature Genet 35 175179 2005 This paper introduced the concept ofscaIefree 39s ape reports th tin protein interaction networks 55 Conant G C 3Wagner A Convergentevolution ofgene networks and proposed a mechanism for their the highly connected nodes tend to link to less circuits Nature Genet 34 25472459005 emergence connected proteins which is the socaiied 57 HinmanVFN en T amer 15 Jeong H Tombor B Albert R oltvaiZ N 3 disassortative proper Developmental gene regulatorynetwork architecture across Barab leAnL The lar enscale organization ofmetabolic 57 astornSatorras R VazquezA 3VespignariiA Dyl lamlcal 50 mil ion ye rs o inod m e ion Proc a networks Nature 407 5517554 2 00 and correlation properties otthelntemet Phys Rev Let i USA1oo 155551 551 005 17 W ner ll esmall world inside large metabolic 37 258701 2001 58 Dorogovtsev S N A V 3Mend sJ E F networks Proc R Soc Lond B263 180518102001 5 Newman M E J Assortativemixing in networks s udotractals er web Phys v E Sgt Nonin Soft neier n ma ii 39 h R9 ti 012002 atterPhys 65 055122 002 Iargesca rg ization of metabolic networks 59 thetsky A 3 Gomez S M Birth otscaleriree molecular 59 Scinuster S PteiiterT Moldenhauer Eiltochl 3 showmg its scalefree nature networks and the number otdistinct DNA and protein Dandekar T Exploring the pathway structure otmetabolism 18 Jeong H Mason S P Bar shAnL 3oltvaiz domains pergenome Broiriformatrc517 988995 decomposition intosubnetworks and application to Lethality and centrality in protein networks Nature 411 2001 My plasma neumoriiae Bromformatrcs 13 5517551 41742 2001 40 Qlari J Luscombe N M 3 Gerstein M Protein family and 2002 19 agner A a id 50 Snel B Bork P 3 Huynen M A The identification of rapidly and contains rew redundant duplicate genes evolutionany model J Mol Bot 313 575581 2001 functional modules from the genomic associaion or genes Moi Bpl Evot 13 12851292 2001 4 Bhan A Galas D J 3Dewey T G Aduplication growth oc NatAcad Sci USA 99 58955895 2002 20 Giot L etal Aprotein interaction map otDrosophia model ofgene expression networks Bromformatrcs 13 51 Girvan M 3 Newman M E J Community structure in soda meanogaster Scrence 302 1727717552005 1485 20 and biological networks Proc NatA REVIEWS l liRiHavlll liS 3 plex l39 ain DS n oq ic analyses 34 Splrll hV 3MirnyL A Protein complexes and functlol39lal 81 Alon Li surette lvl GBarkalN ampL ihl r p hii tn 95 H liii chir h e M xvidai M modules in molecular networks Proc NatAcad Sci USA in bacterlal chemotaxis Nature 397 138171 1999 transcriptome and interactome mapping data trom 100 1212012128 20 3 References 80 and 31 represents the rst acchar ycescerevlSae Nature Genet 29 4827485 55 lhm l l 39 39 2001 transcriptional network Nature Genet 31 370377 2002 robustness ofa cellular subnetwork focusing on the 97 Jansen ere n aurm D 3 Gerstell h M Relating whole 55 Baden G D amp Hoguel 39 39 uu genome expression datawith protell39laprotell39l interactions interactlon data obtained trom ditterent sources Nature 82 von Dassovw e Meln E Munro E M ampOdell e M nome Res 12 3743 2002 Biotechnol2099179979002 98 k l kahnn n kirn n eiuctuationdriven 37 Stuartd M Segah E lltoller D 3km S K Agel39ler module Nature406i18amp1922000 ynamics otthe internet topology Phys Rev Lett 33 88 lb rt F2 othm r H G 108 Oi 2002 geneticmodules Scence 302 240255 2003 interactions predicts the expressm pattern oftne segment 99 Braurlstell39h L A Buldyrey S v Cohe 38 Tomovw S 3 Melyes H w i Them 8390 Stanley H E Optimal paths in disordered com protein interactio tworks and gene expression Nucleic 2231480008 netw r s 0 3 Ac os Res 3102805289 2003 84 Km ell M Memafm EVONabl l y F VOCl NatAC SCl 100 Merlezes M A ampBarab slAaL Fluctuations in network 39 Jansen R etal A Bayesian networks approach tor USA 95 8429842709984 dynamics Phys Rey Lett in the press 39 39 8 quot M 101E itMBlti n d inniae Science 302 440453 2003 l 70 Barrdoseplm z etal Computational discovery otgene Readlngl 1975 11801183 2002 y nn 91 85 F m 102 Zeitiingerd etal Pro ramrspecl cdlstrlbu o ota 1337e1342 2003 Pomandl LOW 1997 transcription tactordependenton artner transcription 71 Albert R Jeol39lg H ampBarab slAaL Errorand attack 87 sml l gl Walssoni Br 0 The underlwng pathway tactora d lltsignaling Ce113 395e4042003i tderance of complex networks Namrelioe 378382 2000 owe 0 b omm ca ream quotworks P Oc39 NatMad 103 ldeker T et al lntegrated genomic and proteom This paper addresses the topological robustness Scquot USA 95 4198 1998 ofasystematlcally perturbed metabolic network Science and vulnerability of complex networks 88 segrel D lV Wpl D 8 Chum G M MEWS 0 292 920934 2001 72 winzelenE A etal Eunctionalcharacteiization otthe WowMad Sm USA 99 15112451179002 104 Danial N N 89 PM W WM p U 388m B Wm mltochondrlal complex thatintegrates gtycolysis and Science 235 901903 1999 apoptosis Nature 424 952953 2003 i predictions or Escherichia col metabolic capabilities are 73 Glaeven ei etal Functlol39lal protiling or the 105 AllzadelmA tal Distincttypes otdittuse large Brcell Sacchayomyces cereyisiae genome Nature 413 w temw m emer mema dab Name Bmecm 19 lymphomaidentit ied by gene expression prot39iling Nature 3873939 2002 90 go ogdwards J S ampPalsson B o Escherichia 4 3l5OH 20 O 74 s V 103 Emmerlll39lg M etal Metabolictlux responses to pyruvate r 420 185189 2002 kinase knockoutin Escherichia coli Jthero1Mi 75 Bgct eWl 35 557MB 2003 After their theoretical work on llux balance analysis 152454 20 using aTrT rlargeted CreloxP exo sion system Nature Bbtecnnoi 20 101810232002 mambo uxvalues We thank two anonymous reyiewers tor their comments and 73 Kollsl lychel lkq v eta Engineering areduced 91 aas E Kov cs BHV Csem OM Z N amp M Vidal tor sharing unpublished work This research was sup Escherichia coli genome Genome Res 12 340347 Baab s ASL ni 2002 E col Nature in the press of Energy to A 7L B and z N 0 and the National Science 77 Frasen H B leslmA E i Stell39lrnetz L M i Scharfa c 3 92 dewuente A Bram p ampMendes p mm he FoundationaoArLB Feldmarh M network Science 296 750752 2002 data pends Genet 13 395398 2002 Competll ig lnterests statement 78 Krylow D M W0 Yl Rogozll39hl B ampKool lll l E v 93 Kumetsov y A mm G D Women R F Genera The authors dedare that theyhaye no competing l39larlClal interests G ssi t l equen l i statistics otstochasticproce sotgene expression in dlspel lsablllty expression level and interactiyity are eucamnc cells Genet 161 13211332 2002 correlatedi eukaryotic evolution Genome R9813 git Farms 1 ieong H Vicsek T Barabagi AeL amp 22297228 Oltvah z N The topology or the transcription regulatory 79 ezqu OltVahZ N ampBarab slArL Biointormatlcs network in theyeastSacc aromyces ceieyisiae Physica FURTHER INFORMATION anatysls or experim entally determined protein complexes in 313 301e312 2003 AlberlaLa szl Barab sl s laboratow http Awww nd edualb the ye st Saccharomyces cereysiae Genome Res 13 95 erigoriey A relationship between gene expression and Zolm n N Oltvalls laboratory 24502454 2003 d 80 Barkal N 3LeiblenS Robustness in simple blochemlcal bacteriophageT t l networks Nature 337 910917 1997 NuceicAcos Res 29 35103519 2001 Access to this interactive links box isrree onli e NATURE REVIEWS l G E N ETI cs VOLUME 5 FEBRUARY 2004 l 1 1 3 ONLINE outtheAuthors AlberteLaszlo Barabasi is the Emil T Hofman Professor of physics at the University of Notre Dame USA His research group intro duced the concept of scaleefree networks and studied their relee Vance to biological and communication systems He obtained his MSc degree in physics in 1991 from Eotvos Lorand University Budapest Hungary and his PhD in 1994 from Boston University USA After a year as a postdoctoral fellow at IBM Thomas J Watson Research Center USA he joined the University of Notre Dame in 1995 He is a fellow of the American Physical Society and the author of the general audience book Linked The New Science of Networks Zoltan Nagy Oltvai is an assistant professor of pathology at Northwestern University s Feinberg School of Medicine USA His clinical interest is molecular pathology and he is the director of diagnostic molecular pathology at the medical school and Northwestern Memorial Hospital His research group s interest is the theoretical and experimental study of intracellular molecular interaction networks He received his MD degree from Semmelweiss Medical University Budapest Hungary and did his clinical pathology molecular biology research residency at Washington University Barnes Hospital in St Louis USA At a ance 39 he emergence of new highethroughput dataecollection techniques increasingly allows us to simultaneously interrogate the status of a cell s components and to determine how and when these molecules interact with each other 39 Various types of molecular interaction webs including proteiniproe tein interaction m tabnlir 39 quot Ira n cr 391 tione net works emerge from the sum of these interactions that together are principal determinants of the systemescale behaviour 0 39 A major c allenge of contemporary biology is to embark on an inte7 grated theoretical and experimental programme to map out under stand and model in quantifiable terms the topological and dynamical properties of the various networks that control the behaviour of the cell 39 Here we review the present knowledge of the designprinciples for the structure and systemescale mction of cellular networks and the evoe 1utionary mechanisms that might have shaped their development 39 A key insight is that the architectural features of molecular interaction networks within a cell are shared to alarge degree by other complex systems such as the Internet computer chips or society This unexe pected universality suggests that similar laws govern the development and ftmction of most complex networks in nature 39 Providing that suf cient formalism will be developed this new concepe tual framework could potentially I I of molecular cell biology view and practice Links ArL Barabasi s laboratory httpwwwndeduhalb ZNOltvai s A quot Selfeorganized netoworks see wwwndedunetworks ml I Conserved patterns of protein interaction in multiple species Roded Sharan Silpa Suthram Ryan M Kelley Tania Kuhn Scott McCuine Peter Uetz Taylor Sittler ker ll Richard M Karp and Trey lde in i u l947 Center street Berkeley 0494704 1Department of e iiipu i p i ii Bioengineering University of California at san Diego 9500 Gilrn an Drive La iolla CA 92093 and lristltute of Genetics Research Center Karlsrun rnany Postfach 3640 D77602l Karlsruhe Ger Contributed by Richard M Karp December 22 2004 To elucidate cellular machinery on a global scale we performed a multiple compar39so 39 interaction netwo k nagaster and Saccharomyces cerevisiae This comparison inte A 71 network regions that were conserved across all three species and many exciusiy to the me zoa We used t is conservation and round statistically significant support for 4545 previously unde interactions We tested so interaction predictions for yeast by ybrid analysis confirming approxrmateiy half of these Sig oft epr 39 ct39 39 u not have been identified from sequence simiiari aio strating that network comparison p essentia information beyond what is gleaned from the genom corn paratiye analysis l multiple alignrn ent l protein network l east twohybrid Awmplex networks of interading genes proteirs and small L A Mm Methods We developed a general framework for comparison and analysis multiple protein networks Full details are provided in Suppmtmg Text Figs 5711 and Tables 376 which are published 39 39 39 on the PNAS web site Brie y this process integrates interactions with sequen information to generate a network alignment raph Each node in the graph consists of a group ofsequencesimilar proteins one from each 9 protein interactions between the corresponding protein groups F39 1 A search over the network alignment is performed to C E 011 nse clusters of interactions which in oomplexes The search 39s grided by reliability estimates for each protein interaction computed based on a method by Bader o oi re which are combined into a probabilistic model forscoring subnetworks Under the model a log likelihood ratio score 39s used clusterversus its likelihood given that each species39 interactionmap inwholesgenome approaches are now enabling us to racterize t ese networks systematically by using pr ocedures sun as th twoshybrid assay 1 and protein coimmunoprecipitation 2 to 39 39 39 39 anlate 39 have generated large interaction networks for bacteria 3 yeast 477 and recently fruit fly 8 and nematode worm The large amount of protein interaction data now available A h n and function 39 roles to interartinn rim 39 39 39 39 39 false positives 11 and ultimately organizing largesscale interao that r in a real subnetwork each interaction should be present independently with higr probability and n in a random subnet work the probability of an interaction between any two proteins The search algorithm exhaustively identi es wanting subs network seeds and expands them in a greedy fashion The sig1ifr icance ofthe iientified subnetworks isevaluated by comparirg their 39 39 in whirh earh nf tion data into models of cellular sigialirg and regulatory machine Results ery 39 t in biology an h b A i 1 tionary US930 wmpar39som provides a valuable framework form a threesway alignment of the proteinrprotein intern rhubdlt for addressing these challenges However althtmgh comparing DNA and protein sequences have be 39 39 39 an paw methods for n a mainstay of at other levels of biological infonnation including protein interao trons 12714 metabolic networks 1H7 or gene expressiondata be I evrsed a method called PATHBLAST 13 for comparing the protein interaction networks of two species Just as PATHBLAST is based on ef cient alignment of two protein networks 39 entify conserved network re 39ons Here we extend this ap proach to present a computational framework for aligrment and compa 39 n of more than two protein networks We apply this to per action networks of Cuem is eleguns Dmmphllu melanin gamer and Succhummyces cereyrsme These species span the A and along with mouse comprise the major model organisms used to study cellular physiology development and disease Protein in raction data were obtained from the Interacting Proteins 23 February 2004 download tained 14319 interactions among 4389 proteins in yeast 3926 39 39 39 39 an 720 interacs tions among 7038 proteins in fly Protein sequences obtaine from the Succhdmmyces Genome Database 24 WormBase able protein networks for worm fly and yeast and show that although any single network contairs falsespositive interactiors embedded beneath this noise are a repertoire ofprotein interaction I II fh 39 hhr inn n ntin tvresenteddress stnooi of Computer science reiAyiy University reiAyiy 59978 isreei To whom correspondence may be addressed Esmall kerpsitsi berkeleyedu or treyez cioengdtsdedd 19744979 l PNAS l February81m5 l vol i0z l no 6 wwwpnas orgcgidoii0 l073pria 0409szzi0z INA Prepruressing Network alignment itiniis bxprossioii ccnelariori clusKLmlg snelr groups iiipluii nlpmiriris one per species wimin In liesl matches BLAST Evirllt In Suhnetwnrk Search Collsewed paths Conserved clusters Filtering pvaliiseam 580 uvcrlap Flg1 it man proteins network nelghb i iiiu iw ii N iiiieratt or rm to or at a signitiranre level ot P lt 0 01 are retained A final tiltering step removes path and luster With gt80 n overlap 25 and FlyBase 26 were combined with the protein inter action data to generate a network alignment of 9011 protein 39 39 39 A 4 39 39 ne hiee networks A search over the network alignment identi ed 183 protein our These covered a total of 649 proteins among yeast worm and fly in Fig 2 The identified conserved clusters and paths along with Fig 3 shows a global map of all clusters and paths conserved among the yeast worm and fly protein networls The map shows iiiieeiiiiay quot 39 H i ri e network ali nmens yeastworm yeastfly and wormfly This process identified 220 significant conserved clusters for yeastworm 835 for yeastfly and 132 for wormfly Several 39 L ares own in Fig 9 Global overviews oft e pairwise conserved clusters similar to Fig 3 are provided in Fi 678 alysis of the prokeins shared among the different pairwise and threeeway network comparisons led to two general findings First r x x compan39son were considerably greater than for the other compar isons because of the large amouns of interaction data for these species relative to worm see Table 6 and Fig 11 Second the todefine 71 LN L L L L L L L arising from the other analyses For example only 29 of the L L L L L LL L L L proteins in the wormfly clus eis were assigne o conserved quot 3quot f L L L A L clusters in the threeeway analysis 135 of 462 This observation is consistentwith the closer taxonomic relationship of worm and fly uppernght L 39 39 39 oaiis ior NA W R the c elegum proteinepmtein interaction screen roughly onee 1quot quot 39 39 39 39 2an 21mm r A n we processes for instance linking kinase signaling red to protein catabolism green lower right or to regulation of transcription yellow upper middle To validate our resuls we compared these conserved clusters to known complexes in east as annotaed by the Munich Information Center for Protein Sequences MJPS 27 We only considered t 39 39 39 39 pwln in i i i u I r experimens Overall the network alignment contained 486 ans hierarchy We defined a cluster to be pure if it contained three or annotation Ninetyefour percent of the conserved clusters were pure indicating the high specificity of our approach compared to of our method to data from yeast only ie applying the same methodology to search for highescoring cluskeis within the yeast network on We further checked whether the conserved clusters were biased by spurious interactions resulting from Stickyquot prokeins that lead to positive twoehybrid tess without interaction of 39 proteins with gt50 network neighbors only 10 were included in conserved cluskeis These 10 proteins were involved in 60 intracluster inter actions 85 of which were supporte 39 39 39 39 experimens This finding indicakes that the clusters were notbiased because of artifacs of the yeast Woehybrid assays Sharan et a i e twoethirds had no clear yeast ortholog 9 39 39 39 39 many proteins of the same known function suggest that their 39 39 39 L hm funr nn V we predicted protein functions whenever the set of prokeins in a A h t or path combined over all species was signifie cantly enriched for a particular Gene ontology G0 23 3111107 tation P lt 001 and at least halfof the annotaed proteins in the clusteror path had that annotation When these criteria were met enriched GO annotation see Scppcmiig Text T is process resulted in 4669 predictions of previously under scribed GO Biological rocess tations spanning 1442 distinct proteins in yeast worm and fly and 3221 predictions of GO Molecular Function annotations spanning 1120 proteins We 857 in which one hides part of the data uses the rest of the data for prediction and tess the prediction success by using e he out datasee Scppcmiig Text As shown in Table 1 depending on the speci 5se four predictions of GO rocesses agreed wit the known annotations see also Tables 4 This lysis outperformed a sequenwebased method of annotating proteins based on the known functions of their best sequence mamhes for r r r xi s a Tm L L L in Table 7 which 39s published as supporting information on the PNAS web site PNAS l rebiuaryx2oos l vol 102 i no a l 1975 EVOLUTION HEIR a IIIInIuIllllnrlnlnwurl C nuwmnmu rrguln lm mu mu gt d IIM nmnnnnm 9 RNA nwluhulhm i Prumln uldlnu a W k I husphnrmmrulhwlhm e Prmein raining I hunk3r lnnnpnn u b thplmrnamcmlmlhm 323 m 39rrumcrxpxiunux rl39gnlminll g Ilulk39 mum Fig 2 m H ybue hexagom protem are e e the same row of he ahgnment Automated ayout of network Text u evidence r alignment to predict proteinrpmtein physical inmractions We 39 39 mannaan predicmd an inmraction between a pair of prams based on x L 39 L 39 197s vvvvanasorggwdowWO1073pn350409522102 sharan eta GO Ccllulllr Praccsses ii pmlifgmutm 0 gquot smcmholls n e I 399 s iiiascsigna inpi I dwclupmvm c ax intracellulnr trnnspori gmwlh nucleci oplnsii I Iimneoaiasis quotm protein folding RNA t be I am 93W mi I c a c ro n I rbzulallonollr llsbnpllbn m 10 lWP snared proteins 9 Connecting pallis gtlw lllllllllllll r Flg 3 Modular structure or nriri m ni rt int 7i aiiiiiiiiuii iuii Ml rlnn h r shared i a i Fig 2 path The accuracy of these predictions was evaluated by using Sefold cross validation as described in Suppmtmg Text 1n cross validation strategyi achieved 7778401 specificity and 2350 sen iti itv quot F 39 39 l39 39 made see Tables2 and 5 These resuln were highly significant for virtually all false positive predictions specificity gt99 while L L L I WMW 15quot a T3 82 t combined strategies we were able to predict 176 previously under scribed interact39 ns for yeast 1139 for worm and 1294 for fly with Table 1 Crossvalidation results for protein cellular process prediction o No of Success Species correct predictions rate quota Yeast l l 4 l93 53 Worrn s7 95 so Fly l l 5 mm 63 For each species the nurnber or correct predictions the total nurnber or nr nini n sharan et al high confidence Thus although protein interactions have been used previously to predict interactions among the orthologous L M L L l 39 impro es the specificity of prediction The complete list of predicted protein interactions is provided in Tim L web site Table 2 Crossvalidation results for protein interaction prediction Sensitiyity Specificity quota quota Species Pyalue Strategy east so 77 i leezs i Worrn 43 32 Wei 3 1 Fly 23 34 s 3ees Yeast 9 99 1296 i ii Worrn lo loo oeezt i ii Fly o 4 loo o s i ii For each species the specificity and sensitiyity or the predictions in Mold crossyalidation the significance or the results and the prediction strategy PNAS l FebmaryK2005 l vol loz l no a l 1977 EVOLUTION a Pun tnwrnton rs Ascon PmJnlunwm vousuon vuuioc Vo zstc At 7 vantssw YMRHA cAn vimsmvvwvnto AZ vrsinszwvomaaw c a w osew vomww A mouse same m mom vx mac M VHRDJDE moow m vaaAsswmiAzsw As veuaowv sic in mom voR At vetosvc c o no wth n s Vme Vultzaw m a YiLOS iC Youza Au vzwAsw music as vsmsswmnosaw A9 mum vEan as vAumc mamrc m a mom VPmalc n7 YGRMWVL mw m 7 vameow were on vnowc mttsw m vmmec vowtr m mam v am at a mains mksaaw mu s vanavc VGRZSDC a v ViLDSiD vmnsw mt vemswvotoaaw a s matte mme m vrsaAsswvnnsoac m vERlzavaUmc E1 s vomewmnsuw as vumwkum 1 7 vasuvastmw an a mum wow 5 a Vain vzmssw nA vousawvmszw so vsntss v zssc an vntotsw Nina is 1 Wmth VMRJMW as mum to an vuwzaw vomrlw on v mmsw Risav 7 maauwvomsm an A vuiazswmaztoc a vouszw sum or a vain mmnc as F aszw mums ct A VDNVWVLHMJW In vntmw made 02 wwmscmmw En e Von 2wv 27w s vsm39s7wvnnwoc m VBLuazw vanaew a a CT 39nme tUJgt is mnwwwilb WIHE 0 t mew cs lwv n3 vrsrewvcsmc c1 vstaAswvcmAAw F4 Vommwvbtnzaw c voRmscvomsvc rs name a Fig 4 VeAAtAtatAon A A hybAAdtestAng c A A uA A than on the agar PrtAnivs p p haltr nnA l Pr t An oteam pan p a u ua uA quotaquotintheht quot7 in a oiome D4 E2 and E5 3 1 l A more tonservatwetally A Au lgot To further evaluate the utility of protein interaction prediction the 1139 specificity of the resulting predictions can be maintained L l 39 l 39 A h see below interactions that were predicted for yeast by using the combined 39 39 4a l 39 Best oust Hits May Not lmply Functional Conservation Frequently L the 39 quot 39 39 quot pAerins bee s 17 A t transcriptionally activated if the two tested proteins bait and prey can physically interact see Suppmtmg Text and Fig 41 Five of the an Ay any prey Fig 4c of the remaining 60 putative interactions 31 tested positive more conservatively 19 of 48 see Fig 4 yielding an o 39 40752 verall success rate in the range of Discu n comparison to Existing Methods We have developed pairwise tween species even though they are not each other39s best match For instance the conserved network region in Fig 5 t at t e worm protein ean la the same functional role asyeast Pabl and fly CG33070 BLAST Ervalue N 10 based on the conserved interactions with Ascl F08G122Rack1 yeastwormfly RnalSUnce75 yeast worm and T01D12prh wormfly However G33070 is only the fifth best mar match in fly overall the best being CG3151 atE Value N 10397 Overall of the 679 protein triples within a threerway conserve t r only 177 contained at least L aligned at the same position a a t clus e bacteria Hellwbmter pylcn The multiple network alignment W Wquot dpaths quot W SAOWS L Clem h A heme 9quot 3 a P 56quot in some cases the best matches are not present within conserved l A b 39 39 39 39 39 L 39n networks of of the current approach over the previous approaches are A it is a unified method to detect both paths and clusters which generr refined probabilistic model for protein interaction data and in it in ludes an automatic sysem for laying out and visualizing the resulting conserved subnetworks A related method that uses crossspecies data for predicting ppr one or more species However it is unlikely that true inmractions ci 39 isons provide observations suggest that protein network com ar essential information about function oonservation multiple proteins in a cluster and across multiple spe es These P Functional Links Within Conserved Networks Conserved network protein interactions is the interolog a oach 12 18 a pair of M x 39 39 39 39 39 39 39 39 Aha may An A 39 39 39 Remuse of the matches In another speues were reported to Interact In compare appreciable error rates inherent in measuremeno of proteinr m pm d scheme can associate promins that are no r best se 11 h quence mam T is advantage previously unrelated processes would typically be ignored as a false positive However an observation that two or three netr ty in detecting conserved function by all Ao 39 39 A t t39 ene 1 Because conservation 39s evaluated in the context of a protein 1973 i wvvanasorgegAdmioionpnasuaoaszzioz especially when the interaction is embedded in a densely Cone nected conserved network region For example Fig 2n links Sharan eta protein degradation to the process of po1yA RNA elongation Althougi these two processes are not connected in this r gion of the yeast network several protein interactions link them in the networks of worm and y eg ProsZSVRacklrMsi or ProsZSV Racklr These indings are corsisten with previously documented association of proteasomes with mRNArbinding pr39 4 L39 quot aseen controversial 30 31 A related functional link between the oteasom 39 th 39 wasdetected inourearlier n 13 teraction Third predicting interactions by using a multiple 39 a h co network alignment appro c appli a set of 72 reported interactions in yeast predicting 71 previously undescribed interactions in worm Seven of the h k A assay 10 whereas 19 of the previously reported yeast inter actions 26 retested positive Corsidering only the 39 predicted based on the 19 network comparison of ye nd the bacteria H pqu interactions that were amt er examp1e jg 9 shows a wormfly wnserved lnteractlonslnyeastslx of these tested positive upper boun lng cluster forwhich 40 ofthe proteins have no significant yeast the prediction accuracy at 31 In tests of 145 addmoml nh I An nrprlirtinn 7 air gt as proteolysis Pros25 Pros2s1 Pasrlr7 actin binding 39 transport CG32810 C40A117 CherW04D21 ion C52B112 and axon guidance ra Taken together these 0 I C n ding and ion transport within an intricate network of protein interactions Validzlion ol Pruiided Inleradions Our tworhybrid tests of pre dicted interactions yielded a suu ess rate in the range of 40752 identifying protein interactions at random 0024 m an earlier tworhybrid screen 4 of 192 baits x 6000 preys e ing pairs Second tworhybrid n of interactions 11 rotein 3913 were R030C have eenshownto interact genetically in syntheticrlethal screens 27 suggesting a possible physical ins 4717439 man Nnczacacm R2 23 Shavan ets nnna 39 439 Yu et ul 29 where the accuraciesofthe interolog approach and an e ens39 39 10 ion of it were estimated at 3073 0 Conclusion Nearly all comparative genomic studies of multiple species have een based on DNA and protein sequence anal is Here we transcend that framework by resenting withlnt protemn a over evolution and at these circuits cover a va ety of well defined functio al categories B cause easure ents pr t interacti t b noisy and i omplete it would have been difficult if not impossible to fin the a is ylook tonly single sp ies M ilarit39 a d the etwork onnections t e 1 1d not have been su ed b quence similarity alone as the pro ems involved are frequently not best sequence matches emultiple net ork e a allows us to asc ibe unique functions to man prote39 d predict previously unobserved proteinrprotein interactio Therefore comparative network analysis is a powerful approach for elucidating network organization and function We thank Jacob Scott and ShenrHong Chen for help wrth the analyses and teven Brenner Jens Cram and Markus errgard for critical reading of the manuscript Thls research w s supported by grants fr m the National Science Foundation and National Institutes of Healt la 19 PNAS l February81m5 l vol 102 i no 6 l 1979 EVOLUTION Molecular Systems Biology 3 Article number 137 doi101038msh4100179 Citation Molecular Systems Biology 3 13 007 EMB nd Nature Publishing Group All rights reserved 1744 429207 www molecularsystemsbiology com EDITORIAL molecuar systems biology Towards a theory of biological robustness Molecular Systems Biology 18 September 2007 doi101038msb4100179 Robustness is one of the fundamental characteristics of biological systems Numerous reports have been published on how robustness is involved in various biological processes and on mechanisms that give rise to robustness in living systems Savageau 1985a b 1998 Barkai and Leibler 1997 Alon et al 1999 von Dassow et al 2000 Bhalla and Iyengar 2001 Csete and Doyle 2002 2004 Kitano et al 2004 2004ab Stelling et al 2004 Kitano and Oda 2006 Kitano 2007a With increasing interest in systems biology properties at the system level such as robustness have attracted serious scientific research Nevertheless a mathematical foundation that provides a unified perspective on robustness is yet to be established For systems biology to mature into a solid scientific discipline there must be a solid theoretical and methodological foundation Often systems biology is equated with computer simulation of cells and organs Although computer simulation is a powerful technique for clarifying the complex dynamics of biological systems it is also a useful tool for exploring the foundation of biological systems While investigation on the dynamic properties of specific aspects of organisms is scientifically signi cant and can be widely applied it is a study on specific instances of design within a design space that is shaped by fundamental principles structural environmental and evolutionary constraints The scientific goal of systems biology is not merely to create precision models of cells and organs but also to discover fundamental and structural principles behind biological systems that de ne the possible design space of life Figure 1 The value of understanding fundamental and structural theories is that they provide deeper insights into the governing principles that complex evolvable systems including biological systems follow Building a solid theoretical foundation of biological robustness and in particular defining a mathemae tical formulation of robustness represents a key challenge in Systems Biology Such a framework would be enormously useful as it would provide general constraints on possible architectural features of living organisms The concept of robustness First of all there must be a common understanding on what robustness means De ning any scientific term is a nontrivial issue but in this paper the following definition will be used robustness is a property that allows a system to maintain its functions against internal and external perturbations Kitano 2004a A similar definition with a slightly different phrasing was used by others such as robustness the ability to maintain performance in the face of perturbations and uncertainty is a longerecognized key property of living systems Stelling et al 2004 and is thus considered to be the most appropriate definition It is important to choose the most reasonable and 2007 ElVlBO and Nature Publishing Group appropriate definition rather than creating yet another definition of robustness To discuss robustness one must identify system function and perturbations It important to realize that robustness is concerned with maintaining functions of a system rather than system states which distinguishes robustness from stability or homeostasis Homeostasis is descri ed as ol ows Th coor inated physiological processes which maintain most of the steady states in the organism are so complex and so peculiar to living beingsiinvolving as they may the brain and nerves the heart lungs kidneys and spleen all working cooperativelyi that I have suggested a special designation for these states homeostasis The word does not imply something set and immobile a stagnation It means a conditionia condition which may vary but which is relatively constant Cannon 1932 According to this definition homeostasis is clearly a property that maintains the state of the system rather than maintains the state of the system In addition the robustness of a subsystem often contributes to homeostasis of the system at the higher level Such examples can be seen in yeast diauxic shift DeRisi et al 1997 and glycolytic shift in tumor metabolism Mazurek and Eigenbrodt 2003 in which the state of the system changes at the level of metabolic functions that maintain ATP production despite environmental perture bations This illustrates that robustnessinot stability or homeostasisiof subsystems may contribute to homeostasis of the whole system when the function maintained ATP production in our example is related to the stability of the system state at the higher level Whereas homeostasis and stability are somewhat related concepts robustness is a more general concept according to which a system is robust as long as it maintains functionality even if it transits through a new steady state or if instability actually helps the system to cope with perturbations Figure 2 Such transition between states is often observed in biological systems when facing stress conditions An extreme example can be seen in the anhye drobiosis of tardigrade that suspends metabolism almost completely if not entirely under extreme dehydration and enters the dormant state surviving for years Crowe an Crowe 000 This dormant state is attained by extensive production of trehalose and tardigrade become active again upon rehydration Such dramatic shifts can be observed in other organisms as well Singer and Lindquist 1998 and some have argued that this is a third form of life called cryptobiosis Clegg 2001 These examples of extreme robustness under harsh stress conditions show that organisms can attain an impressive degree of robustness by switching from one ste dy state to the other rather than trying to maintain a given state Such a phenotypic switch is also Molecular Systems Biology 2007 1 Editorial H Kitano Environmental constraints lelng organlsms as Instances ol deslgn Structural principles Theores on design constralnls Evolution Fundamental principles Theorles on elementary matters and Interactions Figure 1 Fundamental principles structural principles and design Living organisms are designed through evolution and perturbed under environmental constraints Each instance of design is an actual life form that exists in the past present and future Viable design is only possible within the constraints of fundamental principles and structural principles Fundamental principles include basic laws such as quantum theory Maxwell s equations basic chemistry and physics that apply to almost everything universally Structural principles govern properties of systems and have a specific architecture such as control theory communication theory and various principles applied to specific con gurations of components that are generally architecturespeci c and contextdependent For systems biology to be truly successtul not only studies on specific instances of litehlit39 quot quot39 39 L 39 quot 39 Steady state 3 Steady state 1 Steady state 2 Figure 2 Stability homeostasis and robustness Assume that the initial state of the system is at the center of steady state 1 A perturbation may drive the state of the system toward the boundary of the basin of attractor of steady state 1 When the state of the system returns to its original state it is called stability and homeostasis When it transits to steady state 2 stability is once lost and the s stem regains its stability in the new steady state It the system s functions are still intact such transition of state is considered a part of robust response The system is considered to be robust it it maintains tunctions regardless otwhether it is in steady state 1 or 2 On extreme case the system may continue to transit between multiple steady state points to cope with perturbations observed in bacteria and can be considered to be important for drugsresistance Balaban et al 2004 Robustness is also not identical to stability Some species gain robustness by increasing instability in a part of its system The HIV71 virus is robust against numerous therapeutic interventions due to a 2 Molecular Systems Biology 2007 high mutation domain Larder and Kemp 1989 Tisdale et al 1993 which is one of the general mechanisms for viral survivability Eigen 1993 and tumors are robust against various chemotherapies because chromosome instability enhances heterogeneity within a tumor cell population Baisse et al 2001 Rasnick 2002 In summary whereas robustness is a general concept homeostasis or stability can be considered as particular instances of robustness n er modern control theory a set of sophisticated methods generally called robust control has been developed Robust control assumes uncertainties in a model and de nes a method of applying stable control over the system such that proper control is guaranteed even if the model deviates from the real system due to modeling errors Zhou and Doyle 1997 Note that robust control assumes a control system that stabilizes the target system so as to be robust against model errors this mechanism for robustness is consistent with the definition of robustness given above Nevertheless control theory assumes a system that is designed to meet given criteria and so it cannot be directly applied to biological systems that have evolved and for which the desirable state of the system is not explicit In addition most of the mathematics used to describe robustness are mostly based on control theory which tend to focus on stability and performance of monostable systems A theory that take into account multistability and evolution of instable systems needs to be developed and new theoretical avenues need to be explored to provide a broad and unified account of robustness of biological systems A particularly interesting topic in the context of robustnesss is its tradesoffs What kind of tradesoff exists in biological d of co on c a oyle 2002 Highly optimized tolerance HOT theory demonstrates taking the example of a forest fire that a system that is optimized for a specific perturbation inevitably entails extreme fragility for unexpected perturbations Carlson and Doyle 1999 2002 see Box 1 Commercial jet airliners with flysbyswire control are highly robust against most component failures and atmospheric perturbations but become extremely fragile against highly improbable events such as a total power failure as they depend entirely on electric control The Wright Flyer on the other hand is a nonsrobust system but free from power failure problems because it does not use any electric system Biological examples of such tradesoffs are a un ant ome diseases can be considered as manifestations ofsuch tradesoffs Kitano et al 2004 Kitano 2004b Kitano and Oda 2006 and the efficacy and side effects of drugs may be related to robustness tradesoffs Kitano 2007b In addition biological tradesoffs may actually not only involve robustness and fragility but also resource demands and performance of the system For example having an entire n robustness of the system against certain perturbations is increased it may result in increased fragility against unexpected perturbation increased resource demands and degradation of performance A simultaneous increase of robustness and reduction of fragility 2007 ElVlBO and Nature Publishing Group Editorial H Kitano Box 1 Robustnessfragility tradeoffs in forest fire countermeasures A B Doyle 1999 2002 The HOT model argues that systems that have evolved to have a higher level of complexity are optimized for specific perturbations but at the same time are also inevitably extremely fragile against unexpected perturbations Carlson and Doyle 1999 2002 Carson and Doyle use the example of a forest fire to illustrate the intrinsic nature of the system Forest buffer zone patterns and tree planting patterns that are optimized for specific types of fire can be very fragile against unexpected types of fire and may ignite from unexpected directions A They also argued that a design that can utilize a large degree of freedom can be more optimal for anticipated perturbations and therefore can be extremely fragile against unexpected perturbations Reynolds et al 2002 Of course such tradeoffs are not limited to the simple design of buffer zone locations If one decides to plant all trees around a city in a circular manner to be able to cope with fire from any direction such design may actually allow a fire to spread and encircle the city causing major damage to the city as well B However if one cuts all the trees in the field the city will be very fragile against the rainy season as flooding will be more likely due to the resulting loss of water absorbing capacity of the trees C The point here is that such tradeoff is inherent and cannot be avoided This figure is inspired by the HOT theory Carlson and may be achieved when additional resources are integrated properly into the system or if system performance is reduced Alternatively a system s performance can be maximized by giving up robustness of the system against various perturba tions We should also note that these features are not independent Performance in terms of maneuverability of some animals in a hostile environment may translate into robustness against predator attacks Increased resource demands may translate into fragility against severe resource competition as well as perturbation on available resources The key issue is whether it is possible to nd a formalism in which robustness and its tradeoffs could be de ned so that robustness is a conserved quantity or whether the tradeoffs discussed above are bound to remain at the level of useful but empirical observations Understanding such tradeoffs would be critically important to understand the basic design principles of life at the level of individual organisms and cells It may also explain the origin of the diversity of life through evolutionary selection of design space under competitive environments Mammals have evolved to be highly robust against a broad range of perturbations but require important resources for their development and maintenance of their daily life Bacteria on the other hand have adopted a set of rather simple mechanisms at the individual level but can reproduce very quickly and sustain huge populations due to smaller resource demands than other species How can we map different evolutionary niches within a map based on robust ness and its tradeoffs Mathematical formulation of biological robustness The effort towards formalizing a theory of robustness and its tradeoffs is still in its infancy and much remains to be completed to build a mature theory For a theory to be useful it must be able to predict characteristics and behaviors of the 2007 EMBO and Nature Publishing Group system This means that the theory has to be framed to explicitly describe constraints that bind the system First of all mathematical de nitions of terms are given Robustness can be de ned as a system s characteristics that maintain one or more of its functions under external and internal perturbations Under this de nition robust ness R of the system s with regard to function a against a set of perturbations P can be mathematically described as Rap vltpgtDzltpgtdp lt1 The function 1ppis the probability for perturbation p to take place and this should be 1 when all perturbation to take place at equal probability D p is an evaluation function under perturbation p and P is the entire perturbation space The evaluation function determines if the system still maintains function under a perturbation and to what degree and is de ned as S Op E A C P W mmmm e 1M 2 A is a set of perturbations where the system failed to maintain its function This means that Dp is zero when a function does not meet a de ned criteria under perturbation p and Dp returns a relative Viability of a function under perturbation compared against nonperturbed condition otherwise For example ATP production drop 20 under a certain perturba tion compared with ATP production under unperturbed state then 08 shall be returned Note that p in this equation represents a speci c instance of a perturbation Figure 3 illustrates de nition of robustness A system Sl can be said to be more robust than a system SZ with regard to a function a against a certain set of perturbations Y when 81 82 RaY gtRaY Molecular Systems Biology 2007 3 Editorial H Kitano Degree of perturbation 9 To B r t 02 o 5 oz 9 m 02 D Features that are perturbed i gt 92 I Dp0 fpf0gt08 Dp 0 08 2 fplf0gt06 i Dp o 04 2 fpf0gt00 I Dp0 Dp 0 062 fpf0gt04 94 95 Features that are perturbed gs 96 Figure 3 Robustness Perturbations are imposed on each feature and at different degree if applicable The figure illustrate coarse grain View of perturbation space where there are six features to be perturbed each of which is perturbed at six different degree Colors on box for each perturbation indicate how system responded to each perturbation Fled box indicate that system fail to maintain its function Different blue colors show the level of degradation of the function Although the area the function is maintained is same in A and B A is considered more robust as the function is better preserved than B However considering an entire perturbation space F or sufficiently broad perturbation space robustnessefragility tradeeoff should hold thus difference of robustness AR between two systems shall be ARi sSZwtvltDiltv DSZdpRip Rpo lt3 P which is reminiscent of the Bode integral formula If robustness is conserved then above equation should be zero with equiprobability over the perturbation space see also Figure 4A Assuming that 81 and 82 are the same system but with parameters optimized for different subset of perturbae tions this equation implies that any increase in robustness against a subset of perturbation will be offset by decrease of robustness against other perturbations In fact the notion that tradeoffs between robustness and fragility represents a conservation of robustness fragility in biological systems was initially inspired by the soecalled Bode Integral formula Csete and Doyle 2002 10gl5ltwldmgt0 lt4 0 Where 50 is the sensitivity of a system at a frequency w The Bode integral formula represents conservation of sensitivity of a negative feedback NFB system along the frequency axis Bode 1945 the relevance of this theorem to biological systems is best described in Csete and Doyle 2002 see also Box 2 The Bode theorem indicates that an improvement of sensitivity gained by NFB in the lowefrequency range is traded off by increased instability in the highefrequency range In addition within the theoretical framework of Metabolic Control Analysis the summation and connectivity theorems represent constraints that are imposed on parametric changes in metabolic pathways Fell I992 implying that the sensitivity of the network is conserved for changes in rate constants As noted above tradeoff between robustness and perfor mance also need to be considered It is often the case that 4 Molecular Systems Biology 2007 systems that are particularly well tuned for a specific task under a given environment are fragile against change in the environment In contrast systems with moderate performance tend to be more robust and thus can remain functional under Width which represent similar tradeeoffs How such tradeeoff can be generalized to biological systems remains to be explored Similar argument apply to tradeoffs between robustness and resource use where robustness against component failure can be improved by having a greater level ofredlmdanm henceincrea ed A A An example of this is provided by reliability engineering which offers a mathematical basis for reduced fault rate Figure 4C Never theless it is unclear whether formulations for each tradeeoff can be integrated into a single unified system of equations However efforts to further elaborate such relationships shall provide us deeper mathematical insights into biological systems Future challenges This article briefly discussed a primitive concept of how biological robustness may be formulated mathematically and raised some of the key issues that remain to be resolved Although there are numbers of challenges ahead it is clearly understood that much of the basic mathematics are already in place provided we deal with a wellechosen set Further theoretical studies should be able to utilize such formalization as a starting point Bode integral theorem and a set of theorems from metabolic analysis have already illustrated the conserva tion of robustness to some extent and reliability engineering is a solid basis for component failure analysis Mathematical and experimental studies are still required to characterize the tradeoff relationship between robustness and performance It will be a major challenge to find out under which conditions tradeeoffs exist and how to calculate systemelevel properties such as robustness or performance when additional 2007 EMBO and Nature Publishing Group Editorial H Kitano A 31 a2 C 9 x 3 e 1 3 G t D G s r e m 0 G G 57 e 8 3 D 91 92 Q3 Q4 Q5 96 Features that are perturbed Features that are PenumEd Robustness fragility tradeoff B A lowperformance highrobustness case A highperformance lowrobustness case M vbi0b A A Flb s e E E U U C C U U E E O O t t 8 3991 3991 Degree of perturbation Degree of perturbation Robustness performance tradeoff 2 m E 0 CE Resource used Robustness resource tradeoff Figure 4 Robustness tradeoffs A If robustness is 39 r u then any 39 39 lUl quot L 39 L quot L in u in fragility elsewhere Left panel is a pro le of robustness o a r 39 mat q quot r L 39 L feature from gt to g5 Now ifthe system is tuned to cope better with perturbations of a subset of features gt g2 and g3 hen robustness against other subset of perturbations are significantly reduced right panel At both systems w er 39 39 r 39 q 39 B ll p tiaueOi i holds a y tern triati turieu tU attain 39 39 let sasslrme Ya f 0 for quot quot 39 performance of the function of the system under perturbation 0 no perturbation and Ft5 is the robustness of the system over some defined perturbations Although the figure simply refers to the colored areas for ease of understanding the exact needs to be calibrated based on Equation 1 The horizontal dashed lines indicate thethreshold under which the system fails to perform the function considered A robustnessperformance tradeoff would then imply that Y Fl Fl Identical circuits with slightly difference resource use are shown Both use NFB loop but one uses only one resistor in the loop whereas the other one uses two resistors in parallel Parallel use of components significantly improves robustness of the system against component failure but requires more resources Here the probability of m I L r r 1 1 U39 r like this one It is however challenging to derive an expression for more complex systems under various perturbations The question is how can we compute ARE Umsl msz where my and m it y tern DI and Q9 resner tivelv Inc run tlullU 39 39 39 i 39 39 39 39 tU wriiui Whether it is at all possible to define such a function and more fundamentally whether such conservation actually exists in either relative or strict manner remains open resource use is accompanied With changes in system con gue of de nitions based on biological network properties and the ration In the long run the theory should be extended to deal development of a comprehensive set of innovative computae With major structural changes This Will require the elaboration tional methods to derive such characteristic quantities for 2007 EMBO and Nature Publishing Group Molecular Systems Biology 2007 5 Editorial H Kitano Box 2 tradeoffs in and physics Frequency Normalized fragility gt Gain anilitv Therefore a larger gain reduces the sensitivit and 39 increase in robustness increases fragility in iriuea eiri quot fragility in the middle With a larger 1 v8 examples which they describe asfollows given the output of the system I H I a I IIH V IHIII I I without feedback This normalized sensitivity can be described by the equation Do 0 Frequency the tradeoff between stability in specific frequency range provided by a NFB loop is compensated by increased instability in overall gain of the amplifier This is a central issue in electric circuit design and has been intensively investigated in control theory Assuming a simple feedback circuit as seen in amplifiers the steadystate sensitivity 8 against a perturbation d of the system having feedback gain G the robustness main tnemlrhafinn i W i y a specific frequency domain A and B Sensitivity of the system against perturbations fragility is conserved An 1 r quot39 m ma u bll r I 1 7 7 r B adapted from Yi et al 2002 The mathematics behind this tradeoff is well known but is particularly well documented by Yi et al2002 related to biological l L n L I I Q Du Let 80m be a sensitivity function of the open loop system then we can define a base line sensitivity as longowl logSwdw 2 0 This inequality is essential as it implies that feedback control cannot improve overall sensitivity it only improves sensitivity in one place in a tradeoff for fragility elsewhere In addition theories that integrate tradeoffs between robustness and fragility in a feedback system with a feedback channel of limited capacity have been developed recently thus expanding the horizon of intrinsic tradeoffs involved Martins eta 2004 2007 It has been argued that the same tradeoff may Frequency Normalized fragility W 39 f In amplifier design higher frequency region and less is defined by S 39 I 39 u that such 11G WFVFV m tuuillty m y un Iliu39llidli eu iguili a would be larger asa result i itrie y tern WW I the system anu roa I extended using NFB to 100kHz by reducing the gain to 10 GBWP 100 X 10 the same time using NFB reduces the overall gain of the amplifier Tradeoffs between robustness and performance have also been thoroughly investigated In amplifier design it is well known that the gainbandwidth product GBWP is conserved C In this case the gain of the amplifier is considered as performance of r Y region where the sensitivity is reduced by the feedback loop For an amplifier with a gain of 1000 at 1 kHz GBWP 1000 X 1 nagin sete and Doyle 2002 At 39 in en illle quot this circuitwithinthefrequency 1000 the bandwidth can be 1000 This highfrequency cutoff is extended due to the feedback loop large systems This is an important undertaking as it may bring abstract theory to practical utility by providing speci c constraints underlying the organization of biological organ isms and subsystems As progresses in theoretical research will derive more concrete constraints we should be able to better predict and reverseeengineer the structures of biological networks Combined with various highethroughput experi mental data we should be able to derive the structures an dynamics of networks with higher accuracy The current mathematical formulations are mostly con cerned with the stability and maintenance of the system s functions against perturbations As discussed at the outset robustness is a broader concept than stability A theory 6 Molecular Systems Biology 2007 that would account for phase transition and instability as means to achieve robustness would need to be formulated and integrated with theories on stability Although instability based robustness involves survival of the fittest under selective pressure it needs to be integrated with mathematical framework on evolution genetics and game theory Maynard Smith 1982 Ultimately the theory will have to be interfaced with ermodynamics Studies on nonequilibrium dissipative systems are mostly focused on chemical reactions and some are trying to extend theories on nonequilibrium dissipative systems to the principles of life Prigogine et al 1974 However the theories still do not take into account the 2007 EIVIBO and Nature Publishing Group heterogeneity and structured nature of biological systems as well as selection through evolution and it is a major challenge to attempt bridging this gap The situation is similar for the elds of nonlinear dynamics and chaos for which theories that embrace the characteristics of biological systems are yet to emerge Formulation of a fundamental theory of biological systems is one of the grand challenges in biology In very general terms this will involve resolving the gap between the level of description used in t sciencesifor example the properties of ensemb e 0 mo e7 cules in a mediumiand the abstraction level used to define the concepts elaborated in this article which involve networks of biological interactions Hopefully the ideas and concepts discussed in this article will stimulate discussions and provide some stepping stones for research directed towards these ambitious objectives Acknowledgements This research is supported in part by ERATOSORST program Japan Science and Technology Agency JST Sweden Japan Strategic Collaboration Program JST Genome Network Project Ministry of Education Sports Culture Science and Technology References Alon U Surette MG Barkai N Leibler S 1999 Robustness in bacterial chemotaxis Nature 397 168 171 Baisse B Bouzourene H Saraga EP Bosman FT Benhattar J 2001 Intratumor genetic heterogeneity in advanced human colorectal adenocarcinoma Int Cancer 93 346 352 Balaban NQ Merrin J Chait R Kowalik L Leibler S 2004 Bacterial persistence as a phenotypic switch Science 305 1622 1625 Barkai N Leibler S 1997 Robustness in simple biochemical networks Nature 387 913 917 Bhalla US lyengar R 2001 Robustness of the bistable behavior of a biological signaling feedback loop Chaos 11 221 26 Bode HW 1945 Network Analysis and Feedback Ampli er Design Melbourne FL Krieger Cannon W 1932 The Wisdom of the Body New York WW Norton 8 C mpany Inc Carlson JM Doyle J 1999 Highly optimized tolerance a mechanism 39 39 rns Phys Rev E Stat Phys Plasmas 27 Carlson JM DoyleJ 2002 Complexity and robustness ProcNatlAcad Sci USA 99 Suppl 1 2538 2545 Clegg JS 2001 Cryptobiosis a peculiar state of biological organization Comp Biochem Physiol B Biochem Mol Biol 128 4 Crowe JH Crowe LM 2000 Preservation of mammalian cells learning nature s tricks Nat Biotechnol 18 145 146 Csete ME Doyle J 2004 Bow ties metabolism and disease Trends Biotechnol22 446 450 Csete ME Doyle JC 2002 Reverse engineering of biological complexity Science 295 1664 1669 DeRisi JL lyer VR Brown PO 1997 Exploring the metabolic and quot of 39 ona 39 Q ien 4778 680 686 2007 ElVlBO and Nature Publishing Group Editorial H Kitano Eigen M 1993 Viral quasispecies Sci Am 269 42 49 Fell DA 1992 Metabolic control analysis a survey of its theoretical and experimental development Biochem J286 Part 2 313 330 Kitano H 2004a Biological robustness Nat Rev Genet 5 826 837 Kitano H 2004b Cancer as a robust system implications for anticancer therapy Nat Rev Cancer 4 227 235 Kitano H 2007a Biological robustness in complex host pathogen systems ProgDrugRes 64 239 241 263 Kitano H 2007b A robustness based approach to system oriented drug design Nat Rev Drug Disc 6 202 210 Kitano H Oda K 2006 Robustness trade offs and host microbial symbiosis in the immune system Mol Syst Biol 2 0022 Kitano H Oda K Kimura T Matsuoka Y Csete M Doyle J Muramatsu M 2004 Metabolic syndrome and robustness tradeoffs Diabetes 53 Suppl3 S6 S15 Larder BA Kemp SD 1989 Multiple mutations in H1V1 reverse transcriptase confer high level resistance to zidovudine AZT cience 246 1155 1158 Martins NC Dahleh MA Doyle JC 2007 Fundamental limitations of disturbance attenuation in the presence of side information IEEE ansactions on Automatic Control 52 S6 66 Martins NC Dahleh MA Elia N 2004 Stabilization of uncertain systems in the presence of a stochastic digital link IEEE Conference on Decision and Control Nassau 1 Maynard Smith J 1982 Evolution and the Theory of Games Cambridge UK Cambridge University Press Mazurek S Eigenbrodt E 2003 The tumor metabolome Anticancer R s 23 114 1 Prigogine 1 Nicolis G Babloyantz A 1974 Nonequilibrium problems in biological phenomena Ann NYAcad Sci 231 99 105 Rasnick D 2002 Aneuploidy theory explains tumor formation the ce of immune surveillance and the failure of chemotherapy Cancer Genet Cytogenet 136 66 72 Reynolds D Carlson JM Doyle J 2002 Design degree of freedom and mechanisms for complexity Phys Rev E 016108 Savageau MA 1985a Mathematics of organizationally complex systems Biomed Biochim Acta 44 839 844 Savageau MA 1985b A theory of alternative designs for biochemical 39 39 him Acta 44 875 880 ry of gene regulation 1 Quantitative development of the theory Genetics 149 1665 1676 Singer MA Lindquist S 1998 Thermotolerance in Saccharomyces cerevisiae the Yin and Yang of trehalose ends Biotechnol 16 460 4 68 StellingJ Sauer U SzallasiZ Doyle 111 FJ DoyleJ 2004 Robustness of cellular functions Cell 118 675 685 Tisdale M Kemp SD Parry NR Larder BA 1993 Rapid in v39dro 39 human immunodeficiency virus type 1 resistant to 3 thiacytidine inhibitors due to a mutation in the YMDD region of reverse transcriptase Proc Natl Acad Sci USA 90 S653 5656 von Dassow G Meir E Munro EM Odell GM 2000 The segment polarity network is a robust developmental module Nature 406 188 192 YiTM Sauro HM 1ngalls B 2002 Limits of control and Bode s integral formula Basic Control Theo or Biologists 2002 Zhou K Doyle J 1997 Essentials OfRobust Control New Jersey USA Prentice Hall Hiroaki Kitanol392393 Sony Computer Science Laboratories Inc Shinagawa Tokyo Japan The Systems Biology Institute Shibuya Tokyo Japan and 3Department of Cancer Systems Biology Cancer Institute of Japanese Foundation of Cancer Research Koutouku Tokyo Japan Molecular Systems Biology 2007 7 Biclustering Algorithms A Survey Amos Tanay Roded SharanJr Ron Shamir May 2004 Abstract Analysis of large scale geonomics data notably gene expression has initially focused on clustering methods Recently biclustering techniques were proposed for revealing submatrices showing unique patterns We review some of the algorithmic approaches to biclustering and discuss their properties 1 Introduction Gene expression pro ling has been established over the last decade as a standard technique for obtaining a molecular ngerprint of tissues or cells in different biological conditions 18 7 Based on the availability of whole genome sequences the technology of DNA chips or microarrays allows the measurement of mRNA levels simultaneously for thousands of genes The set or vector of measured gene expression levels under one condition or sample are called the pro le of that condition Gene expression pro les are powerful sources of information and have revolutionized the way we study and understand function in biological systems 1 Given a set of gene expression pro les organized together as a gene expression matrix with rows corresponding to genes and columns corresponding to conditions a common analysis goal is to group conditions and genes into subsets that convey biological signi cance In its most common form this task translates to the computational problem known as clustering Formally given a set of elements with a vector of attributes for each element clustering aims to partition the elements into possibly hierarchically ordered disjoint sets called clusters so that within each set the attribute vectors are similar while vectors of disjoint clusters are dissimilar For example when analyzing a gene expression matrix we may apply clustering to the genes as elements given the matrix rows as attributes or cluster the conditions as elements given the matrix columns as attributes For reviews on clustering see an earlier chapter in this book Analysis via clustering makes several a priori assumptions that may not be perfectly adequate in all circumstances First clustering can be applied to either genes or samples implicitly directing the analysis to a particular School of Computer Science Tel Aviv University Tel Aviv 69978 Israel amosrshamir posttauacil TInternational Computer Science Institute 1947 Center St Berkeley CA 94704 USA rodedicsiberkeleyedu conditions conditions conditions D saua gene clusters condition clusters biclusters Figure 1 Clustering and biclustering of a gene expression matrix Clusters correspond to disjoint strips in the matrix A gene cluster must contain all columns and a condition cluster must contain all rows Biclusters correspond to arbitrary subsets of rows and columns shown here as rectangles Note that since gene condition clusters are disjoint the rows columns of the matrix can be reordered so that each cluster is a contiguous strip Similar reordering of rows and columns that shows all the biclusters as rectangles is usually impossible aspect of the system under study e g groups of patients or groups of co regulated genes Second clustering algorithms usually seek a disjoint cover of the set of elements requiring that no gene or sample belongs to more than one cluster The notion of a bicluster gives rise to a more exible computational framework A bicluster is de ned as a submatrix spanned by a set of genes and a set of samples compare Figure 1 Alter natively a bicluster may be de ned as the corresponding gene and sample subsets Given a gene expression matrix we can characterize the biological phenomena it embodies by a collection of biclusters each representing a different type of joint behavior of a set of genes in a corresponding set of samples Note that there are no a priori constraints on the organization of biclusters and in particular genes or samples can be part of more than one bicluster or of no bicluster The lack of structural constrains on biclustering solutions allows greater freedom but is consequently more vulnerable to over tting Hence biclustering algorithms must guarantee that the output biclusters are meaningful This is usually done by an accompanying statistical model or a heuristic scor ing method that de ne which of the many possible submatrices represent a signi cant biological behavior The biclustering problem is to nd a set of signi cant biclusters in a matrix In clinical applications gene expression analysis is done on tissues taken from patients with a medical condition Using such assays biologists have identi ed molecular ngerprints that can help in the classi cation and diagnosis of the patient status and guide treatment protocols 2 16 In these studies the focus is primarily on identifying pro les of expression over a subset of the genes that can be associated with clinical conditions and treatment outcomes where ideally the set of samples is equal in all but the subtype or the stage of the disease However a patient may be a part of more than one clinical group e g may suffer from syndrome A have a genetic background B and be exposed to environment C Biclustering analysis is thus highly appropriate for identifying and distinguishing the biological factors affecting the patients along with the corresponding gene subsets In functional genomics applications the goal is to understand the functions of each of the genes operating in a biological system The rationale is that genes with similar expression patterns are likely to be regulated by the same factors and therefore may share function By collecting ex pression pro les from many different biological conditions and identifying joint patterns of gene expression among them researchers have characterized transcriptional programs and assigned pu tative function to thousands of genes 23 ll 8 Since genes have multiple functions and since transcriptional programs are often based on combinatorial regulation biclustering is highly appro priate for these applications as well An important aspect of gene expression data is their high noise levels DNA chips provide only rough approximation of expression levels and are subject to errors of up to two fold the measured value 1 Any analysis method and biclustering algorithms in particular should therefore be robust enough to cope with signi cant levels of noise Below we survey some of the biclustering models and algorithms that were developed for gene expression analysis Our coverage is not exhaustive and is biased toward what we believe are the more practical methods We attempt to cover at least one method from each class of algorithms under development We do not review methods that are based on extended biological models eg inferring regulation or integrating data types 19 24 but focus on algorithms for biclustering per se Throughout we assume that we are given a set of genes V a set of conditions U and a gene expression matrix E 60 where cm is the expression level of gene 71 in sample 71 We assume that the matrix is normalized though some of the algorithms below perform additional normalization A bicluster B U V is de ned by a subset of genes V C V and a subset of conditions or samples U C U Different algorithmic approaches to the biclustering problem use different measures for the quality of a given biclustering solution We therefore de ne the goal function of each algorithm as part of its description 2 Cheng and Church s Algorithm Cheng and Church were the rst to introduce biclustering to gene expression analysis 6 Their algorithmic framework represents the biclustering problem as an optimization problem de ning a score for each candidate bicluster and developing heuristics to solve the constrained optimization problem de ned by this score function In short the constraints force the uniformity of the matrix the procedure gives preference to larger submatrices and the heuristic is a relaxed greedy algorithm Cheng and Church implicitly assume that gene condition pairs in a good bicluster have a constant expression level plus possibly additive row and column speci c effects After removing row column and submatrix averages the residual level should be as small as possible More formally given the gene expression matrix E a subset of genes I and a subset of conditions J ZjEJe i j M column subset average and 6U 6 we de ne 61 26 Z I39OW SUbSCt average 6L 2 III e 39 I I w submatr1x average We de ne the reszdue score of an element eij in a submatrix E I J as RSUQ j eij e 1 eU e 1 J and the mean square residue score of the entire submatrix as R82 H I J 26146 The intuition beh1nd th1s de n1tion can be understood v1a two examples a completely uniform matrix will have score zero More generally any submatrix in which all entries have the form eij b 03 would also have score zero Given the score de nition the maximum bicluster problem seeks a bicluster of maximum size among all biclusters with score not exceeding a threshold 6 The size can be de ned in several ways for example as the number of cells in the matrix 1 Ml or the number of rows plus number of columns 1 l lJl The maximum bicluster problem is NP hard if we force all solutions to be square matrices l1 Ml or if we use the total number of submatrix cells as our optimization goal Reductions are from Maximum Balanced Biclique or Maximum Edge Biclique Cheng and Church suggested a greedy heuristic to rapidly converge to a locally maximal submatrix with score smaller than the threshold The algorithm presented in Figure 2 can be viewed as a local search algorithm starting from the full matrix Given the threshold parameter 6 the algorithm runs in two phases In the rst phase the algorithm removes rows and columns from the full matrix At each step where the current submatrix has row set I and column set J the algorithm examines the set of possible moves For rows it calculates l l 296 RSIJ77 j and for columns it calculates ej l h 261 RSIJ77 j It then selects the highest scoring row or column and removes it from the current submatrix as long as H I J gt 6 The idea is that rowscolumns with large contribution to the score can be removed with guaranteed improvement decrease in the total mean square residue score A possible variation of this heuristic removes at each step all rowscolumns with a contribution to the residue score that is higher than some threshold In the second phase of the algorithm rows and columns are being added using the same scoring scheme but this time looking for the lowest square residues d ej at each move and terminat ing where none of the possible moves increases the matrix size without crossing the threshold 6 Upon convergence the algorithm outputs a submatrix with low mean residue and locally maximal size To discover more than one bicluster Cheng and Church suggested repeated application of the biclustering algorithm on modi ed matrices The modi cation includes randomization of the values in the cells of the previously discovered biclusters preventing the correlative signal in them to be bene cial for any other bicluster in the matrix This has the obvious effect of precluding the identi cation of biclusters with signi cant overlaps An application of the algorithm to yeast and human data is described in 6 The software is available at httparepmedharvardedubiclustering Cheng ChurchU V E 6 U conditions V genes E Gene expression matrix 6 maximal mean square residue score eij De ne e 2261 I III e De ne 6U w eij De ne e quotEI JEJ U IIIIJI De ne RSJZj eij j 6139 J R53 De ne HL J ZiEIJEJ W Initialize a bicluster I J With I U J V Deletion phase While HI J gt 6 do Compute forz E I l h 236 RSJ7Jj Compute forj E J ej l h Zia RSJ7Jj If mawiejd gt maijJeU assign I Ia7 gma13id7 Else J Ja7 gma13jej Addition phase assign I I J J While HI J lt 6 do Assign I I J J39 Compute forz E U I l h 236 RSJ7Jj Compute forj E V J ej l h Zia RSJ7Jj If mawiejd lt maijJeU assign I39 I U argmaidi Else J39 J Ua7 gma13jej Report I J Figure 2 The Cheng Church algorithm for nding a single bicluster 3 Coupled Twoway Clustering Coupled two way clustering CTWC introduced by Getz Levine and Domany 9 de nes a generic scheme for transforming a one dimensional clustering algorithm into a biclustering algo rithm The algorithm relies on having a one dimensional standard clustering algorithm that can discover signi cant termed stable in 9 clusters Given such an algorithm the coupled two way clustering procedure Will recursively apply the one dimensional algorithm to submatrices aiming to nd subsets of genes giving rise to signi cant clusters of conditions and subsets of conditions giving rise to signi cant gene clusters The submatrices de ned by such pairings are called stable submatrices and correspond to biclusters The algorithm Which is shown in Figure 3 operates on a set of gene subsets V and a set of condition subsets 21 Initially V V and 21 U The algorithm then iteratively selects a gene subset V E V and a condition subset U E 21 and applies the one dimensional clustering algorithm twice to cluster V and U on the submatriX U x V If stable clusters are detected their genecondition subsets are added to the respective sets 1221 The process is repeated until no new stable clusters can be found The implementation makes sure that each pair of subsets is not encountered more than once Note that the procedure avoids the consideration of all rows and column subsets by starting from an established row subset when forming subclusters of established column subsets and vice versa The success of the coupled two way clustering strategy depends on the performance of the given one dimensional clustering algorithm We note that many popular clustering algorithms e g K means Hierarchical SOM cannot be plugged as is into the coupled two way machinery as they do not readily distinguish signi cant clusters from non signi cant clusters or make a priori assumption on the number of clusters Getz et al have reported good results using the SPC hierarchical clustering algorithm 10 The results of the algorithm can be viewed in a hierarchical form each stable gene condition cluster is generated given a condition resp gene subset This hierarchical relation is important when trying to understand the context of joint genes or conditions behavior For example when analyzing clinical data Getz et al have focused on gene subsets giving rise to stable tissue clusters that are correlative to known clinical attributes Such gene sets may have an important biological role in the disease under study The CTWC algorithm has been applied to a variety of clinical data sets see e g 17 the software can be downloaded via the site httpctwcweizmannacil TWOWAYU V E ALG U conditions V genes E Gene expression matrix ALG one dimensional clustering algorithm Inputs a matrix and outputs signi cant stable clusters of columns or rows Initialize a hash table weight Initialize U1 2 U V1 2 V Initialize U Z V Z Initialize the sets hierarchy table HV storing for gene clusters the condition subsets used to generate them Initialize the sets hierarchy table HU storing for condition clusters the gene subsets used to generate them While U1 75 Z or V1 75 ll do Initialize empty sets U2 V2 For all U V E U1 gtlt V1 U U1 x V U U x V1 do Run ALGEUV to cluster the genes in V Add the stable gene sets to V2 Set HVVquot U for all new clusters Vquot Run ALGEUV to cluster the conditions in U Add the stable condition sets to U2 Set HUUquot V for all new clusters Uquot AssignU UUU1V VUV1 Assign U1 2 U2 V1 V2 Report U V and their hierarchies HU H V Figure 3 Coupled two way clustering 4 The Iterative Signature Algorithm In the Iterative Signature Algorithm ISA 12 5 the notion of a signi cant bicluster is de ned intrinsically on the bicluster genes and samples the samples of a bicluster uniquely de ne the genes and vice versa The intuition is that the genes in a bicluster are co regulated and thus for each sample the average gene expression over all the biclsuter s genes should be surprising unusually high or low and for each gene the average gene expression over all biclusters samples should be surprising This intuition is formalized using a simple linear model for gene expression assuming normally distributed expression levels for each gene or sample as shown below The algorithm presented in Figure 4 uses two normalized copies of the original gene expres sion matrix The matrix E0 has rows normalized to mean 0 and variance 1 and the matrix EC has columns normalized similarly We denote by 63V the mean expression of genes from V in the sample 11 and by 6 the mean expression of the gene 71 in samples from U A bicluster B U V is required to have U u e U legwl gt T000 V v e V egg gt TGaG 1 Here T0 is the threshold parameter and 00 is the standard deviation of the means 63 where v ranges over all possible genes and U is xed Similarly TC 00 are the corresponding parameters for the column set V The idea is that if the genes in V are up or down regulated in the conditions U then their average expression should be signi cantly far ie TG standard deviations from its expected value on random matrices which is 0 since the matrix is standardized A similar 1 1 being a linear sum of lU l or IV I independent standard random variables Alternatively the argument holds for the conditions in U The standard deviations can be predicted as standard deviations can be estimated directly from the data correcting for possible biases in the statistics of the speci c condition and gene sets used In other words in a bicluster the z score of each gene measured wrt the bicluster s samples and the z score of each sample measured wrt the bicluster s samples should exceed a threshold As we shall see below ISA will not discover biclusters for which the conditions 1 hold strictly but will use a relaxed version The algorithm starts from an arbitrary set of genes V0 Vin The set may be randomly gener ated or selected based on some prior knowledge The algorithm then repeatedly applies the update equations UZ39 I u E U I gt TCO39C W1 Z 7 E V 63 gt TGag The iterations are terminated at step 77 satisfying iVn i Vn i I iVn i U Vn i I for all 239 smaller than some 777 The ISA thus converges to an approximated xed point that is considered to be a bicluster The actual xed point depends on both the initial set Vin and the threshold parameters TC T0 To generate a representative set of biclusters it is possible to run ISA with many different initial conditions including known sets of associated genes or random sets and to vary the thresholds After eliminating redundancies xed points that were encountered several times the set of xed points can be analyzed as a set of biclusters The ISA algorithm can be generalized by assigning weights for each genesample such that genes samples with a signi cant behavior higher z score will have larger weights In this case the simple means used in l and 2 are replaced by weighted means The signature algorithm has been applied for nding cis regulatory modules in yeast 12 and for detecting conserved transcriptional modules across several species 4 For software see httpbarkai servweizmannacilGroupPage ISAU V E Vin TC TC m e U conditions V genes E Gene expression matrix Vin Initial gene set TC TC gene and condition z score thresholds m e stopping criteria Construct a column standardized matrix EC Construct a row standardized matrix EC Initialize counters n 0 n 0 Initialize the current genes set V 2 Vi Initialize an empty condition set U While n n39 lt m do Compute 65V 2 206V 65 for u E U U u e U legwl gt T0 xIV I Compute 6371 ZueU 65 for v E V VII VI G T VI UEV eUv gt VI VII 1f lt 6 then 77 n n n 1 Report U V Figure 4 The ISA algorithm for nding a single bicluster 5 The SAMBA Algorithm The SAMBA algorithm Statistical Algorithmic Method for Bicluster Analysis 24 20 uses prob abilistic modeling of the data and graph theoretic techniques to identify subsets of genes that jointly respond across a subset of conditions where a gene is termed responding in some condition if its expression level changes signi cantly at that condition wrt its normal level Within the SAMBA framework the expression data are modeled as a bipartite graph whose two parts correspond to conditions and genes respectively with edges for signi cant expression changes The vertex pairs in the graph are assigned weights according to a probabilistic model so that heavy subgraphs correspond to biclusters with high likelihood Discovering the most signi cant biclusters in the data reduces under this weighting scheme to nding the heaviest subgraphs in the model bipartite graph SAMBA employs a practical heuristic to search for heavy subgraphs The search algorithm is motivated by a combinatorial algorithm for nding heavy bicliques that is exponential in the maximum gene degree in the graph In the following we describe the probabilistic model used by SAMBA and the theoretical algo rithm on which the search method is based Finally the full SAMBA algorithm is presented Applications of SAMBA for gene expression data are described in 25 SAMBA was also applied to highly heterogeneous data including expression phenotype growth sensitivity protein protein interaction and ChIP chip data 24 The software is available as part of the Expander package 20 21 51 Statistical Data Modeling The SAMBA algorithm is based on representing the input expression data as a bipartite graph G U V In this graph U is the set of conditions V is the set of genes and u o E E iff o responds in condition 71 that is if the expression level of 71 changes signi cantly in u A bicluster corresponds to a subgraph H U V E of G and represents a subset V of genes that are co regulated under a subset of conditions U The weight of a subgraph or bicluster is the sum of the weights of gene condition pairs in it including edges and non edges Coupled with the graph representation is a likelihood ratio model for the data Let H U V E be a subgraph of G and denote F U x V E For a vertex w E U U V let dw denote its degree in G The null model assumes that the occurrence of each edge u o is an independent Bernoulli variable with parameter mm The probability pm is the fraction of bipartite graphs with degree sequence identical to G that contain the edge u o In practice one estimates pm using a Monte Carlo process This model tries to capture the characteristics of the different genes and conditions in the data The alternative model assumes that each edge of a bicluster occurs with constant high prob ability 190 This model re ects the belief that biclusters represent approximately uniform relations between their elements The log likelihood ratio for H is therefore logLH 2 log pc 2 log 1 190 uvEE puav WEE 1 pun Setting the weight of each edge u o to log p0 gt O and the wei ht of each non ed e u o to puw g g log i lt 0 one concludes that the score of H is simply its weight 1 puv 52 Finding Heavy Subgraphs Under the above additive scoring scheme discovering the most signi cant biclusters in the data reduces under this scoring scheme to nding the heaviest subgraphs in the bipartite graph Since the latter problem is NP hard SAMBA employs a heuristic search for such subgraphs The search uses as seeds heavy bicliques and we now present the underlying algorithm to nd good seeds In the rest of the section it will be convenient to assume that the degree of every gene is bounded by d Let G U V E be a bipartite graph with n lVl genes Let w U x V e 72 be a weight function For a pair of subsets U Q U V g V we denote by wU V the weight of the subgraph induced on U U V ie wU V ZuEUWEV The neighborhood of a vertex v denoted N v is the set of vertices adjacent to o in G 10 The Maximum Bounded Biclique problem calls for identifying a maximum weight complete subgraph of a given weighted bipartite graph G such that the vertices on one side of G have degrees bounded by d This problem can be solved in 0n2d time and space as we show next Observe that a maximum bounded biclique H U V E in G must have lUl g d Figure 5 describes a hash table based algorithm that for each vertex v E V scans all 02d subsets of its neighbors thereby identifying the heaviest biclique Each hash entry corresponds to a subset of conditions and records the total weight of edges from adjacent gene vertices The algorithm can be shown to spend 0n2d time on the hashing and nding U best Computing Vbest can be done in 0nd time so the total running time is 0n2d MaxBoundBiCliqueU V E d Initialize a hash table weight weightbest e O For all u E V do For all S g Nv do weigh S eweigh SH max0 wS If weight3 gt weightbest Ubest w 3 weightbest e weigh S Compute Vbest UEUbestN OUtPUt Ubesta Vbest Figure 5 An algorithm for the maximum bounded biclique problem Note that the algorithm can be adapted to give the k condition subsets that induce solutions of highest weight in 0712 l log k time using a priority queue data structure 53 The Full Algorithm Having described the two main components of SAMBA we are now ready to present the full algorithm which is given in Figure 6 SAMBA proceeds in two phases First the model bipartite graph is formed and the weights of vertex pairs are computed Second several heavy subgraphs are sought around each vertex of the graph This is done by starting with good seeds around the vertex and expanding them using local search The seeds are found using the hashing technique of the algorithm in Figure 5 To save on time and space the algorithm ignores genes with degree exceeding some threshold D and hash for each gene only subsets of its neighbors of size ranging from N1 to N2 The local improvement procedure iteratively applies the best modi cation to the current bicluster addition or deletion of a single vertex until no score improvement is possible The greedy process is restricted to search around the biclique without performing changes that would eliminate vertices in it or make vertices in it redundant having a total negative contribution ll to the bicluster score To avoid similar biclusters whose vertex sets differ only slightly a nal step greedily lters similar biclusters with more than L overlap SAMBAU V E w d N1 N2 k U conditions V genes E graph edges w edgenon edge weights N1 N2 condition set hashed set size limits k maX biclusters per genecondition Initialize a hash table weight For all 1 E V with lNvl g ddo For all S Q Nv with N1 g lSl g N2 do weigh S E weigh S wS For each 71 E V set bestv1 k to the k heaviest sets S such that v E S For each 71 E V and each of the k sets S bestv V uESNUJJ B E S U V Do a argmaxmevodww U 10 b argmawx63wB grI IfwBUa gtwBbthenBBUaelseBBb while improving Store B Post process to lter overlapping biclusters Figure 6 The SAMBA biclustering algorithm 6 Spectral Biclustering Spectral biclustering approaches use techniques from linear algebra to identify bicluster structures in the input data Here we review the biclustering technique presented in Kluger et al 13 In this model it is assumed that the eXpression matriX has a hidden checkerboard like structure that we try to identify using eigenvector computations The structure assumption is argued to hold for clinical data where tissues cluster to cancer types and genes cluster to groups each distinguishing a particular tissue type from the other types To describe the algorithm suppose at rst that the matriX E has a checkerboard like structure see Figure 7 Obviously we could discover it directly but we could also infer it using a technique from linear algebra that will be useful in case the structure is hidden due to row and column shuf ings The technique is based on a relation between the block structure of E and the block structure of pairs of eigenvectors for EET and ETE which we describe next First observe that the eigenvalues of EET and ETE are the same Now consider a vector 1 that is stepwise ie piecewise constant and whose block structure matches that of the rows of E Applying E to 1 we get a stepwise vector 3 If we now apply ET to y we get a vector with the same block structure 12 as 13 The same relation is observed when applying rst ET and then E see Figure 7 Hence vectors of the stepwise pattern of 1 form a subspace that is closed under E TE This subspace is spanned by eigenvectors of this matrix Similarly eigenvectors of E E T span the subspace formed by vectors of the form of 3 More importantly taking now 1 to be an eigenvector of E TE with an eigenvalue A we observe that y E3 is an eigenvector of EET with the same eigenvalue 0 8866 39 887788 a d 8866 d 39 887788 6 d T 7744 d 639 Z E Z E33 664455 6 e y y 7744 e 639 3 664455 c e 8855 e c39 c 3355 c Figure 7 An example of a checkerboard like matrix E and the eigenvectors of E E T and ETE The vector 1 satis es the relation ETEa ETy 13 A513 Similarly 3 satis es the equation EETy EM Ag In conclusion the checkerboard like structure of E is re ected in the stepwise structures of pairs of EET and ETE eigenvectors that correspond to the same eigenvalue One can nd these eigenvector pairs by computing a singular value decomposition of E Singular value decom position is a standard algebraic technique cf 15 that expresses a real matrix E as a product E AABT where A is a diagonal matrix and A and B are orthonormal matrices The columns of A and B are the eigenvectors of EET and ETE respectively The entries of A are square roots of the corresponding eigenvalues sorted in a non increasing order Hence the eigenvector pairs are obtained by taking for each 239 the 2th columns of A and B and the corresponding eigenvalue is the 42 For any eigenvector pair one can check whether each of the vectors can be approximated using a piecewise constant vector Kluger et al use a one dimensional k means algorithm to test this t The block structures of the eigenvectors indicate the block structures of the rows and columns of E In the general case the rows and columns of E are ordered arbitrarily and the checkerboard like structure if E has one is hidden To reveal such structure one computes the singular value decomposition of E and analyzes the eigenvectors of EET and ETE A hidden checkboard struc ture will manifest itself by the existence of a pair of eigenvectors one for each matrix with the same eigenvalue that are approximately piecewise constant One can determine if this is the case by sorting the vectors or by clustering their values as done in 13 Kluger et al further discuss the problem of normalizing the gene expression matrix to reveal checkerboard structures that are obscured e g due to differences in the mean expression levels of genes or conditions The assumed model for the data is a multiplicative model in which the expression level of a gene 239 in a condition j is its base level times a gene term which corresponds to 13 the gene s tendency of expression under different conditions times a condition term that represents the tendency of genes to be expressed under condition j The normalization is done using two normalizing matrices R a diagonal matrix with the mean of row 239 at the ith position and C a diagonal matrix with the mean of column j at the jth position The block structure of E is now re ected in the stepwise structure of pairs of eigenvectors with the same eigenvalue of the normalized matrices M R 1E01ET and M T These eigenvector pairs can be deduced by computing a singular value decomposition of R 12E012 Due to the normalization the rst eigenvector pair corresponding to an eigenvalue of l is constant and can be discarded A summary of the biclustering algorithm is given in Figure 8 The spectral algorithm was applied to human cancer data and its results were used for classi cation of tumor type and identi cation of marker genes 13 SpectralU V E U conditions V genes Enxm Gene expression matrix Compute R diagE 1m and C diag1 E Compute a singular value decomposition of R 12EC12 Discard the pair of eigenvectors corresponding to the largest eigenvalue For each pair of eigenvectors u v of R 1EC 1ET and C 1ETR1E with the same eigenvalue do Apply k means to check the t of u and v to stepwise vectors Report the block structure of the p u v with the best stepwise t Figure 8 The spectral biclustering algorithm 7 Plaid Models The Plaid model 14 is a statistically inspired modeling approach developed by Lazzeroni and Owen for the analysis of gene expression data The basic idea is to represent the genes conditions matrix as a superposition of layers corresponding to biclusters in our terminology where each layer is a subset of rows and columns on which a particular set of values takes place Different values in the expression matrix are thought of as different colors as in false colored heat maps of chips This metaphor also leads to referring to color intensity in lieu of expression level The horizontal and vertical color lines in the matrix corresponding to a layer give the method its name The model assumes that the level of matrix entries is the sum of a uniform background grey and of k biclusters each coloring a particular submatrix in a certain way More precisely the expression matrix is represented as K Aij 0 Z Qijkpik jk k1 14 where a0 is a general matrix background color and 9131 ak oak 63 where m describes the added background color in bicluster k or and 5 are row and column speci c additive constants in bicluster k pi 6 01 is a gene bicluster membership indicator variable ie Oik 1 iff gene 239 belongs to the gene set of the k th bicluster Similarly EM 6 0 l is a sample bicluster membership indicator variable Hence similar to Cheng and Church 6 a bicluster is assumed to be the sum of bicluster background level plus row speci c and column speci c constants When the biclusters form a k partition of the genes and a corresponding k partition of the samples the disjointrtess constraints that biclusters cannot overlap can be formulated as 2 HM g l for all j 2k pi g l for all i Replacing g by would require assignment of each row or column to exactly one bicluster Generalizing to allow bicluster overlap simply means removing the disjointness constraints The general biclustering problem is now formulated as nding parameter values so that the re sulting matrix would t the original data as much as possible Formally the problem is minimizing K EMU Z Qijkpik jklZ 4 t7 k20 where aO dig0 If oak or 63 are used then the constraints Zipikaik O or 27 ij jk O are added to reduce the number of parameters Note that the number of parameters is at most k l kn km for the 9 variables and kn km for the H and 0 variables This is substantially smaller than the am variables in the original data if k ltlt maan 71 Estimating Parameters Lazzeroni and Owen propose to solve problem 4 using an iterative heuristic New layers are added to the model one at a time Suppose we have xed the rst K 1 layers and we are seeking for the K th layer to minimize the sum of squared errors Let K l Zzg39K l Aij Z Qtjkptj jk 5 k0 be the residual matrix after removing the effect of the rst K 1 layers In iteration K we wish to solve the following quadratic integer program min Qm i321 1Zi39K 1 QinPz39K leV piK E 0717 K jK E 07 The proposed heuristic method to solve 6 is again iterative To avoid confusion we call the iterations for xed K cycles and indicate the cycle number by a superscript in parentheses e g W The integrality constraints are ignored throughout and the goal is to solve corresponding relaxation of it A cycle is done as follows Compute the best values of the 9 parameters given xed 0 and K values Compute the best values of the 0 parameters given new 9 and the old K values 15 Compute the best values of the H parameters given the new 9 and the old 0 values In order to avoid locking in of the membership variables to O or 1 their values are changed only modestly on the rst cycle and they are allowed to become integral only at the nal cycle The following optimal parameter values in the relaxed version of 6 are obtained by using Lagrange multipliers ZiZjPiKHjKZg1 K 2 p3Kgtlt n31 ZjltZiltK 1 MKPiKHjK jK O iK 8 piK ZjK 33K 7 ZiltZ JK1 MKpiKHjK0iK 2 HjK ZiK 10m jK 9 So in cycle 8 we use these equations to update 98 using the old values p31 and mfg 1 The values for pm and HjK that minimize Q are K l Zj einKjKZz39j 2 2 Zj 9 in Zz39 9inP KZg1 H 39K 7 2 2 2i szsz39K MK I 10 11 At cycle 8 we use these equations to update 0 from 93 and H81 and update HS from 93 and p31 The complete updating process is repeated a prescribed number of cycles 72 Initialization and Stopping Rule The search for a new layer K in the residual matrix Zij ZzgK requires initial values of 0 and H These values are obtained by nding vectors 1 and v and a real value A so that AuvT is the best rank one approximation of Z We refer the readers to the original paper for details Intuitively each iteration peels off another signal layer and one should stop after K 1 it K erations if the residual matrix Zij Zi contains almost only noise Lazzeroni and Owen de ne the importance of layer k by 03 21 21 pikcjkdfjk The algorithm accepts a layer if it has signi cantly larger importance than in noise To evaluate a on noise repeat the following pro cess T times Randomly permute each row in Z independently and then randomly permute each column in the resulting matrix independently Apply the layer nding algorithm on the resulting matrix and compute the importance of that layer If a exceeds the importance obtained for all the T randomized matrices add the new layer K to the model The complete algorithm is outlined in gure 9 Plaid models have been applied to yeast gene expression data 14 The software is available at httpwww statstanfordeduNowenplaid l6 PlaidU V E S U conditions V genes E Gene expression matrix S maximum cycles per iteration Set K 0 adding a new layer KK1 Compute initial values of leg2 pg Set 3 1 While 3 g S do Compute Mg 04 using equations 7 9 8 Compute HZK using equations 1 1 Compute pg using equations 10 If pg gt 05 set pg 2 05 328 else set pg 2 05 325 If Hg gt 05 set Hg 2 05 328 else set Hg 2 05 325 If the importance of layer K is non random then record the layer and repeat Else exit Report layers 1 K 1 Figure 9 The Plaid model algorithm 8 Discussion The algorithms presented above demonstrate some of the approaches developed for the identi ca tion of bicluster patterns in large matrices and in gene expression matrices in particular One can roughly classify the different methods a by their model and scoring schemes and b by the type of algorithm used for detecting biclusters Here we brie y review how different methods tackle these issues 81 Model and score To ensure that the biclusters are statistically signi cant each of the biclustering methods de nes a scoring scheme to assess the quality of candidate biclusters or a constraint that determines which submatrices represent signi cant bicluster behavior Constraint based method include the iterative signature algorithm the coupled two way clustering method and the spectral algorithm of Kluger et al In the rst two we search for gene property sets that de ne stable subsets of proper ties genes In the last the requirement is for compatibility of certain eigenvectors to a hidden checkboard like matrix structure Scoring based methods typically rely on a background model for the data The basic model assumes that biclusters are essentially uniform submatrices and scores them according to their deviation from such uniform behavior More elaborate models allow different distributions for each 17 condition and gene usually in a linear way Such are for example the Cheng Church algorithm and the Plaid model and the alternative formulation in 22 A more formal statistical model for an extended formulation of the biclustering problem was used in 19 3 In this family of algorithms a complete generative model including a set of biclusters and their regulation model is optimized for maximum likelihood given the data Another approach for the modeling of the data is used in SAMBA where a degree preserving random graph model and likelihood ratio score are used to ensure biclusters signi cance 82 Algorithmic approaches The algorithmic approaches for detecting biclusters given the data are greatly affected by the type of scoreconstraint model in use Several of the algorithms alternate between phases of gene sets and condition sets optimization Such are for example the iterative signature algorithm and the coupled two way clustering algorithm Other methods use standard linear algebra or optimization algorithms to solve key subproblems Such is the case for the Plaid model and the Spectral algo rithm A heuristic hill climbing algorithm is used in the Cheng Church algorithm and is combined with a graph hashing algorithm in SAMBA Finally EM or sampling methods are used for for mulations introducing a generative statistical model for biclusters 19 3 22 The overall picture seems to support a view stressing the importance of statistical models and scoring scheme and re stricting the role of the searchoptimization algorithm to discovering relatively bold structures A current important goal for the research community is to improve our understanding of the pros and cons of the various modeling approaches described here and to enable more focused algorithmic efforts on the models that prove most effective 83 Quo vadis biclustering Biclustering is a relatively young area in contrast to its parent discipline clustering that has a very long history going back all the way to Aristo It has great potential to make signi cant contributions to biology and to other elds Still some of the dif culties that haunt clustering are present and are even exacerbated in biclustering Multiple formulations and objective functions lack of theoretical and complexity analysis for many algorithms and few criteria for comparing the quality of candidate solutions Still the great potential of the paradigm of biclustering as demonstrated in studies over the last ve years guarantees that the challenge will continue to be addressed In time the concrete advantages and disadvantages of each formulation and algorithm will be made clearer We anticipate an exciting and fruitful next decade in biclustering research 9 Acknowledgment R Shamir was supported in part by the Israel Science Foundation grant 30902 R Sharan was supported in part by NSF ITR Grant CCR 0121555 A Tanay was supported in part by a scholar ship in Complexity Science from the Yeshaia Horvitz Association 18 References 1 The chipping forecast II Special supplement to Nature Genetics Vol 32 2002 2 AA Alizadeh et al Distinct types of diffuse large B cell lymphoma identi ed by gene expression pro ling Nature 4036769503 511 2000 3 A Battle E Segal and D Koller Probabilistic discovery of overlapping cellular processes and their regulation In Proceedings of the Sixth Annual International Conference on Com putational Molecular Biology RECOMB 2002 2004 4 S Bergman J Ihmels and N Barkai Similarities and differences in genome wide expression data of six organisms PLoS 21E9 2004 5 S Bergmann J Ihmels and N Barkai Iterative signature algorithm for the analysis of large scale gene expression data Phys Rev E Stat Nonlin Soft Matter Phys 673 Pt 103190201 18 2003 6 Y Cheng and GM Church Biclustering of expression data In Proc ISMB OO pages 93 103 AAAI Press 2000 7 J DeRisi L Penland PO Brown et al Use of a cDNA microarray to analyse gene expression patterns in human cancer Nat Genet 14457 460 1996 8 AP Gasch et al Genomic expression responses to DNA damaging agents and the regulatory role of the yeast ATR homolog mec1p Mol Biol Cell 12102987 3003 2001 9 G Getz E Levine and E Domany Coupled two way clustering analysis of gene microarray data Proc Natl Acad Sci USA 9722 12079 84 2000 10 G Getz E Levine E Domany and MQ Zhang Super paramagnetic clustering of yeast gene expression pro les Physica A279457 2000 11 J D Hughes PE Estep S Tavazoie and GM Church Computational identi cation of cis regulatory elements associated with groups of functionally related genes in Saccharomyces Cerevisiae J Mol Biol 296 1205 1214 2000 http atlas med harvard edu 12 J Ihmels G Friedlander S Bergmann O Sarig Y Ziv and N Barkai Revealing modular organization in the yeast transcriptional network Nature Genetics 314370 7 2002 13 Y Kluger R Barsi JT Cheng and M Gerstein Spectral biclustering of microarray data coclustering genes and conditions Genome Res 134703 16 2003 14 L Lazzeroni and A Owen Plaid models for gene expression data Statistica Sinica 1261 86 2002 15 William H Press Brian P Flannery Saul A Teukolsky and William T Vetterling Numerical Recipes The Art of Scienti c Computing Cambridge University Press Cambridge UK and New York 2nd edition 1992 16 S Ramaswamy et al Multiclass cancer diagnosis using tumor gene expression signature Proc Natl Acad Sci USA 9826 15149 15154 2001 17 T Rozovskaia O Ravid Amir S Tillib G Getz E Feinstein H Agrawal A Nagler EF Rappaport 1 Issaeva Y Matsuo UR Kees T Lapidot F Lo Coco R Foa A Mazo T Nakamura CM Croce G Cimino E Domany and E Canaani Expression pro les of acute lymphoblastic and myeloblastic leukemias with all 1 rearrangements Proc Natl Acad Sci USA 100137853 8 2003 18 M Schena D Sharon RW Davis et al Quantitative monitoring of gene expression patterns with a complementary DNA microarray Science 270467 470 1995 19 E Segal M Shapira A Regev D Pe er D Botstein D Koller and N Friedman Module networks identifying regulatory modules and their condition speci c regulators from gene expression data Nat Genet 342 166 76 2003 20 R Sharan A Maron Katz N Arbili and R Shamir EXPANDER EXPres sion ANalyzer and DisplayER 2002 Software package Tel Aviv University http www cs tau ac ilrshamirexpanderexpander html 21 R Sharan A Maron Katz and R Shamir CLICK and EXPANDER a system for clustering and visualizing gene expression data Bioinformalics 2003 22 Q Sheng Y Moreau and B De Moor Biclustering gene expression data by Gibbs sampling Bioinformalics 19Supp 2i196 i205 2003 23 P T Spellman G Sherlock et al Comprehensive identi cation of cell cycle regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization Mol Biol Cell 93273 3297 1998 24 A Tanay R Sharan M Kupiec and R Shamir Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous genomewide data Proc Natl Acad Sci U S A 10192981 6 2004 25 A Tanay R Sharan and R Shamir Biclustering gene expresion data Submitted for publi cation 2002 20 The Emergence of Systems Biology Concepts and Issues COMP572 Spring 2008 Outline 0 What is systems biology 0 What is a system 0 Types of networks and databases for systems biology What Is Systems Biology uuuuuuuuuuuuuuuuuuuu 08 Systems Biology Systems biology is an approach by which a system of interacting entities is analyzed as a whole rather than by analyzing its individual constituent entities separately Systems biology is not new in I934 the Austrian biologist Ludwig von Bertalanffy applied general systems theory to biology as well as other elds Sunday February 10 2008 Systems Biology 0 To fully understand the functioning of cellular processes whole cells organs and even organisms it is not enough to simply assign functions to individual genes proteins and other cellular components 0 We need to analyze the organization and control of the system in an integrated way by looking at the dynamic networks of genes and proteins and their interactions with each other 0 These interacting pathways are complex dynamic systems and often behave in a nonlinear and adaptive way Sunday February 10 2008 5 Systems Biology 0 Nonlinearity means for example that doubling a stimulus the input does not necessarily double the response and may even cause a qualitatively different response 0 Adaptive systems can modify themselves to respond in a more appropriate way in the light of previous stimuli Sunday February 10 2008 Systems Biology 0 The general goal of theoretical systems biology is to develop computer models that predict the properties of the large adaptive interconnected networks that are found in living things Sunday February 10 2008 Converging Towards Systems Biology H nu tnml hm un i t wri u F Ir39rlrnlwmnl 1 Val 1111igt i l h J39liJ 1lii i39v39 quotl l39l lr n Mchngloqv quot lm l l l JiilJlfi mnnrm wiM 39 UNA 39 mquirlu ma l 39 39 lr l y quotJCI HJ quotshill llSl Hiltn Liv V l I I l L I 139 I I V r 1 V y v m in H qmti rcuqnput at L l IJHK vLl l LIL lll lquot Lupingr 39me 1300 39VI mica miclogy Feedback regulation will l l39r39ll lwk In nmiaboiism rm lkllyznl 111 l r l 39Ml WM l L A A A A A L I l l l I l i 7 Maw11ml mum in l 39 39 39 Uw39mod39vmnms orqammuon 739 39 llquot liq 395 WWW quot39539 quot1 V I39 1015leittl 1Lill l39 ml 1 aimrtmra quot H l39q wanting kl quotllZZ r39wdrla quot w v J xrmli 57 aluminium 4quot quotr quot Hquot is m 39 1133 lac up won Two lines of inquiry led from the approximate onset of molecular biological thinking to presentday systems biology The top timeline represents the root of systems biology in mainstream molecular biology with its emphasis on individual macromolecules Scaled up versions of this effort then induced systems biology as a way to look at all those molecules simultaneously and consider their interactions The lower timeline represents the lesser known effort that constantly focused on the formal analysis of new functional states that arise when multiple molecules interact simultaneously Source HVWesterhoff and BC Palsson The evolution of molecular biology into systems biology Nature Biotechnology 22l0 l249l252 2004 Sunday February 10 2008 8 What Is a System uuuuuuuuuuuuuuuuuuuu 08 A System Is More Than the Sum of Its Parts SYSTEMS LEVEL interactions of components in the biological system are studied cells tissues etc BIOLOGICAL genome RESEARCH APPROACH bioinformatics sequencing DNA arrays proteomics systems biology modelingsimulation MOLECULAR LEVEL based analysis geneprotein structure and function is studied at the molecular level The description of a fully functional system must take into account the spatial organization of elements their interactions and their response to external stimuli including those processes that control and stabilize the system Sunday February 10 2008 A Biological System Is a Living Network The components eg proteins of the system are connected by interactions along which signals pass and have an effect on the function of the whole network These networks too have inbuilt controls that activate or terminate particular signals The biological system can also be perturbed by outside interference ranging from the food we eat to therapeutic drugs Once we have a quantitative mathematical model that can be manipulated and understand the normal functioning of the system we can see what the effects of various perturbations are in silico Sunday February 10 2008 11 Executable cell biology nun amp I39lmnnsA Heluingu In Camp glyimn hiulagical heluvms In Ms leview we dis nzmshhelween1mm clhialngicalmndelHnnlhemnlicaiandcnmvula amk mm Mn 41 thequot mnmsmla nm n bialagical mmmm w call me apumm almnmucti zcnmvumtlonal mndels n h w m mudelm mndels in bmluaiml march and highlvalu 39 39 39 ablehmln x A damn bmlogvmxmm mmblms mowmmmpmzs Mm mum mums haduimy msmmmmmnilrnllnzanunslw mnnmlhtmnur Explore inquot mm mum bmlugymadnls and mm some mungw bung m M mm Amman mndels an 1 5mm 1nd pagan mm m m lnmmhtmnutdmodtlvnhlmmkxiumnun hnzh Sign gffugm jggm an my m m ompm gummy Camper mmcmnuml m cl m Emma Lemma Unmwcmmma w mm mnsxmmd mmush m wvosmon simmer WNWYiddms gamma us Emv wan enu 5mm be adde m r mm quotmark a mkldkpcndcm quantum 1m mnslmmu w quotmy nmmmr mmn H Hawk aw mmkrlmmm quotmmmmmpk whmammml Source Fm andTA Henzmgen Execuuhe en hm ugy Nature Bmmdmu ugy ZSU 93942412004 Suggest new experiments The components and interactions in W W the glycolysis pathway as obtained if from the KEGG database mm Glycolysis is the sequence of reactions WM that converts glucose into pyruvate with the production of a relativer small amount ofATP which is a source of energy for many cellular functions GLYCOLYS IS 27159 0 Arum6 Amw m exmpeimaro 27169 Sahcm O z 7 i 59 o eximellulq Sam Objecxs Arrows 5 gene DYDGUCI musz Pmusvmsw 5311 gt m m Cinrri a quotJ glycemnerp gr EM j WW Dumcmmg RNA oiecua 1 era 0 o Bah n Bahama i gt imk lo anozney map minimum IE O umevmoiecme masz Gmmnpm Glyczmzrl n i Chewca WG DOUV G gt pointer used in iagend rmtahohsm i gt missinginlerac1icneg by muiauun amine map E i V x gt GlycemirSF IE Gene axpressmn ralanans iigt pnospnwam Hmgtll Wm i P uephasphsryialvnn Bp39ess39m m Ubiquihn iquot gtl Expression i WCI gicasyiaiion h 4i mam em El 4 methylaiinn 341 5mm Enzymarnnzyma mam 391 i lt m we Successive pyruvate inhibiliun 63mm 5mg pmywm Mmmmm 3 id mam eWecx g gt mm mum ii l umumg x associahon 7 dissouaxiun w ES e 77lti mum rmahaksm H tment2P Tm th mmamm ell f7 5 Lygm hmsynlhzsls 1 1 ACEIYI CUA Q dmy mhpnamld o Inn mum QIZ NEII Sunday February10 2008 14 A System Model Is More Than a Network 0 A network structure topology is first de ned typically using information from databases andor literature 0 To complete the network definition the interactions between components must be defined 0 Determine which components interact with each other as well as the kinetics of these interactions Sunday February 10 2008 Three Approaches to Model Construction 0 Bottomup Construct a network and predicts its behavior starting with a collection of experimental data 0 Topdown Starts from observed behavior and then fills in the components and interactions required to generate these observations by iterative experimental results and simulations 0 Middleout Starts at any point for which data are available as long as it is supported by a hypothesis and then expand either up or down in terms of both resolution and coverage Sunday February 10 2008 16 Types of Networks and Network Databases uuuuuuuuuuuuuuuuuuuu 08 Networks 0 A network is a set of vertices or nodes and edges where each edge connects two vertices Sunday February 10 2008 18 Gene Expression Networks 0 Vertices are genes and there is an edge between two vertices if the genes represented by the vertices have similar expression patterns 39 39 0 Edges are undirected ProteinProtein Interaction PPI Networks Vertices are proteins and there is an edge between two vertices if the proteins represented by the vertices have interact Edges are undirected Transcription Regulation Networks 0 Two types of vertices proteins transcription factors orTF s and genes 0 Edges are directed from TF s to genes there is an edge from TF X to geneY if X transcribesY Signal Transduction Networks 0 Vertices are proteins and edges are directed 0 An edge connects vertex X to vertexY if X changes the activity level on under certain conditions NR LPA TGFcx EGF epiregulin cellulin HBEGF G4 NRG39Z NRG3 NRG4 quot 32quot m 0 xx m rm 392 0 quot mg m m V quotquot 5 quot9 d5 my a x xx xx w w w 2 w w x a I HEAV39I 11le SHEAV39I NBGGIH SRC CBL PLC PI3K 5HP2 p21GDP SHC NCK VAV GRB7 CRK d aaaa rs GA gt GRBZ RAC d p21GTP SOS enzymes I RAF AKT MEK JNKiAK ABL signallng PKC BAD 56K MAPK cascade JNK l JUN SP1 MYC F05 ELK ERG1 STAT tttttt ription APOPTOSIS Y HER1 1 HER2 Y HER3 Y HER4 I I gyofCancer Garland ScienceZOU7 MIGRATION GROWTH ADHESION DIFFERENTIATION HEAV I inclan Sunday February 10 2008 22 Metabolic Networks 0 Vertices are metabolites and edges are biochemical reactions O Edges are directed 39 Metabolism is divided into Catabolism breaking down large molecules for example to harvest energy in cellular respiration Anabolism using energy to construct components of cells such as proteins and nucleic acids Sunday February 10 2008 28 Some Molecular Interaction Databases Name Web Link Data Types STRING httpstringemblde PPI MINT httgzllmintbiouniroma2itmint PPI BIND httpwwwbindcaAction PPI DIP httgzlldigdoembiuclaedu PPI MIPS httpmipsgsfdeproippi mammalian PPI KEGG httpwwwgenomeipkegg signaling metabolism BioGRID httpwwwthebiogridorg physical amp genetic BioCyc httpwwwbiocxcorg metabolic amp others Sunday February 10 2008 24 Acknowledgments 0 Materials in this lecture are mostly based on the book Understanding Bioinformatics by M Zvelebil and J Baum lst Edition Garland Science 2007 Sunday February 10 2008 GraphTheoretic Properties of Biological Networks COMP572 Spring 2008 Ehnday Januai y 2mg Outline 0 Architectural features 0 Motifs modules and hierarchical networks 0 Scalefree or geometric Sunday January 6 2008 Architectural Features aaaaaaaaaaaaaaaaaa 08 Cellular Networks Are Scalefree 0 Analysis of cellular networks of various types have indicated scalefree topologies O The rst evidence came from analyzing metabolism vertices are metabolites edges are enzymecatalyzed biochemical reactions and the edges are directed Sunday January 6 2008 UMP j Hf ME V Dmnphmmate 10339 d In ma f E Q 01 104 E E 101 E E E 1 39 2 g 1 E We 3 111B 131 1E 103 1 1 1 1 2 103 m k k ExpBrimmtal ux 1 91 at GLO uptake rate Figuna 1 Chm zhg mewDin mtwarlmTD udytha murkd39iarazzteris muftra mataboiism agaph mra c daacrip nn reads in be astabiishad Hare the graph theorem dammi on fora Binnie Wigwam by infdependant mm is ii ustrahad all Intha moat abslrmtappmech b al hmmhg rmtabdiimam onr sideredequaiht39lii e lin ie between mdea reactior sthat 39rda39mmertmesubstratein a anther Forrnanybbbgicai apd mtiona t is Lisa wa igimm dnrsm1as the high mmdnsphata rderTF whim rasuita in 1 award type D f mapping in 1hat samba mi the train sauna rr ah itaa to 1113 main prancimm dl The Gimme diatribu m Pk tithe manianit rebwkilmtatas itaawia aemdbgf e The mairg nftha chatarirg maff39ciant 231111111 ha degraakmmmsti hmmmmwmw3 39I39ha datasrmnind and rapnaaent Ian average M43 gmmawj f diatth inthamntai matab km dm i m fdm a pumrlawmhbh ir39dbatas 1mm reactions hmesmall mtabdbiiwgmaaaafamem cxmwith gh m my most ofthe rnatahoicaativitygifl39 a plat is basad an daba w mlbcbad by Emmering dd it Bhndd be noted iatonail hee pictatha axis is imaritl39i39nic and admightline in such hug i119 plate ir icaim a marlaw easing CTR wtidina hipimphats GL6 ahaimmune glmuaa UDP Lridha pbnsprata UMP urid39m mapimpi aia UTF uidina triprmphata Sunday January 6 2008 Scalefree Metabolic Networks 0 The analysis of metabolic networks of 43 different organisms from all three domains of life eukaryotes prokaryotes and archaea indicates that the cellular metabolism has a scalefree topology in which most metabolic substrates participate in only one or two reactions but a few such as pyruvate or coenzyme A participate in dozens and function as metabolic hubs Sunday January 6 2008 Scalefree PPI Networks 0 Several recent publications indicate that PPI networks have a scalefree topology PPI network of the yeast Saccharomyces cerevisiae Ehnday Jamal y 2mg Other Types of Networks 0 Genetic regulatory networks vertices are genes and edges are expression correlations also exhibit scalefree topologies O Transcription regulatory networks vertices are genes and transcription factors and edges are interactions exhibit mixed scalefree and exponential distributions The distribution of the number of genes that a transcription factor interacts with follows a powerlaw scalefree MostTFs regulate only a few genes but a few TF s regulate many genes The distribution of the number of transcription factors that interact with a given gene follows an exponential distribution Most genes are regulated by l3 TFs Sunday January 6 2008 0 While establishing scalefree properties is hard when information is available on only a few nodes a salient feature of cellular networks is the presence of hubs from regulatory webs to the p53 module r W Prevenlion m new blood vessel wam mm Apnpmsis immaimn Source Sur ng the p53 network Vogelstein et al Nature 408 3073 l 0 2000 EIAnday Jamal y zuua Smallworld Effect 0 Although the smallworld effect is a property of random networks scalefree networks are ultra small 0 In metabolism paths of only 34 reactions can link most pairs of metabolites implication local perturbations in metabolite concentrations could reach the whole network very quickly 0 Interestingly the metabolic network of a parasitic bacterium has the same mean path length as the much larger and more developed network of a large multicellular organism implication certain evolutionary mechanisms have maintained the average path length during evolution Sunday January 6 2008 Assortativity 0 Cellular networks seem to be disassortative hubs avoid linking directly to each other and instead connect to vertices with only a few connections 0 The origin of disassortativity in cellular networks remains unexplained Sunday Jawualy a 2008 Evolutionary Origin of Scalefree Networks 0 Recall the two processes underlying the development of real networks a growth new nodes joining the network and preferential attachment nodes prefer to connect to nodes that have many edges 0 In protein networks growth and preferential attachment have a possible evolutionary explanation that is rooted in gene duplication Proteins Before duplication After duplication quotExquot 39 if 3 3quot Proteins x ERR 39 x 3 Sunday January 6 2008 Evolutionary Origin of Scalefree Networks 0 Duplicated genes produce identical proteins that interact with the same protein partners 0 Highly connected nodes get more new links not that they have a higher probability of duplicating but a higher probability to have a link to a duplicated gene 0 The role of gene duplication has been shown only for PPl networks but not for regulatory or metabolic networks Sunday January 6 2008 Evolutionary Origin of Scalefree Networks 0 An inspection of the metabolic hubs indicates that the remnants of the RNA world such as coenzyme A NAD and GTP are among the most connected substrates of the metabolic network as are elements of some of the most ancients metabolic pathways such as glycolysis and TCA cycle 0 Recall the correlation between the age and degree of a node in the scalefree model Sunday January 6 2008 Motifs Modules and Hierarchical Networks A Brif Overview Two separate lectures will be dedicated to the issues of modularity and motifs aaaaaaaaaaaaaaaaaa 08 0 Cellular functions are likely to be carried out in a highly modular manner a group of physically or functionally linked molecules nodes work together to achieve a distinct function Biology is full of examples of modularity Questions of interest Is a given network modular What are the modules in a network What are their relationships in a given network Sunday January 6 2008 High Clustering in Cellular Networks In a network representation a module appears as a highly interconnected group of nodes Each module can be reduced to a set of triangles and the clustering coef cient can be computed to quantify modularity In the absence of modularity the clustering coef cient of the real and random networks are comparable Metabolic PPl and protein domain networks have all exhibited high clustering Sunday Januar y6 2008 Motifs in Cellular Networks 0 Not all subgraphs occur with equal frequency 0 Motifs are subgraphs that are overrepresented compared to a randomized version of the same network 0 To identify motifs 0 Identify all subgraphs of n nodes in the network 0 Randomize the network while keeping the number of nodes edges and degree distribution unchanged 0 Identify all subgraphs of n nodes in the randomized version 0 Subgraphs that occur signi cantly more frequently in the real network as compared to the randomized one are designated to be the motifs Sunday January 6 2008 Recall Special directed subgraphs lgtltl Feedforward loop two nonintersecting directed paths from a start to an endpoint Bifan Biparallel two nonintersecting paths of identical length from a start to an endpoint Q Q Feedback loop a directed cycle Sunday Januar y6 2008 Fig ltL Network motifs and themes in the integrated S ccmvisine network Edges denote transcriptional regulation E protein interaction P sequence homology H correlated expression X or synthetic llethal interactions t S a Motifs corresponding to the feed torward theme are based on transcriptional feed fomv39ard loops th motifs in the co pointing theme consist of interacting transcription tactors that regulate the same target gene to motifs corresponding to the regulonie eomplea theme include co regniation of members of a protein complex cl motifs in the protein complex theme represent interacting and coeapressed protein cliques For a given motif NM is the mnnher of corresponding subgraphs in the real network and de is the number of corresponding subgraphs in a randomized network Figure reproduced with permission from BioMed lCentral t Zhang et 21 21005 a Mom sate A Mr H E R I t 5 A H 1 H N errm sntm39 5quot on 6 NV 137530 rc l 543 9 a mom example A theme example Mont set E Ca A moat exam pro 7 mm set I Hirt H H th i infill HhH A mclll exempt r d shin sewn sauna rezxro39 Ale an MP 12CIIEIE1 1 1D hi 1 I A mam example A there 1 example Sunday January 6 2008 Motif Clusters The motifs in a network are not independent 0 The gure shows the 209 bifan motifs in the E coli transcription regulatory network 208 of the 209 motifs form two extended motif clusters and only one motif remains isolated Motif clusters seem to be a a general property of all real 39 o 39 networks Sunday January 5 2008 Hierarchical Organization of Topological Modules 0 At face value the scalefree property and modularity seem to be contradictory the former implies the existence of nodes that are connected to a high fraction of nodes which makes the existence of relatively isolated modules unlikely and the latter implies the existence of groups of nodes that are relatively isolated from the rest of the system 0 However clustering and hubs naturally coexist which indicates that topological modules are not independent but combine to form a hierarchical network Sunday January 6 2008 Hierarchical Networks l a cluster of four densely connected nodes is rst created blue 2 three replicas of this module are generated and the three external nodes of the replicated clusters are connected to the central node ofthe old cluster green 3 three replicas ofthe l6node module are generated and the l6 peripheral nodes are connected to the central node of the old module red The model combines scalefree and modularity properties EIAnday Jamal ye zuua Identifying Modules 0 Various clustering techniques have been developed or adapted to identifying modules in networks 0 Different methods return different decompositions of the networks 0 At present there is no objective mathematical criteria for deciding that one decomposition is better than another Sunday January 6 2008 Scalefree or Geometric aaaaaaaaaaaaaaaaaa 08 0 Priqu et al studied the t of four different network models to PPI networks of Saccharomyces cerevisiae yeast and Drosophila melanogaster fruitfly 0 Findings 0 The scalefree model fails to t the data and a random geometric model provides a much more accurate model Sunday January 6 2008 Geometric Random Graphs A geometric graph GVr with radius r is a graph with node set V of points in a metric space and edge set EuvuvEV 0SduvSr where d is an arbitrary distance norm in this space In other words imagine a set of points in a metric space with an edge between two points if the distance between them is at most r Usually twodimensional space is considered containing points in the unit square 0l2 or unit disc and Oltrltl Typical distance norms between two points XIy and X2y2 Ll norm IXIX2yIy2 L2 norm XIX22yIy22quot2 Loo norm maX IXIXzHyl y2 A random geometric graph Gnr is a geometric graph with n nodes which correspond to n independently and uniformly distributed points in a metric space Sunday January 6 2008 Graphlet Analysis of PPI Networks Graphlets 0 Priqu et aconsidered graphlets connected network with a small number of nodes and used an approach similar to that of identifying motifs to assess the t Laud g pmets Jenede ggyhlets Aieg e 3 same grapi els 3xaeygi 11 7 13 14 1s 16 e ewwe o 1 none none and Senode conneued netwmks gmphr me1easuo me mos dense mm pact e f ednes when compaled to me maxmmm FDSS JlE number of edges 11 me glnphleK they are numbered ham 1 lo 29 Sunday Januai y 2mg Graphlet Analysis of PPI Networks The Network Models Considered 0 Compared the frequency of the appearance of these graphlets in PPI networks with the frequency of their appearance in four different types of random networks 0 ER ErdosR nyi randm networks with same number of nodes and edges as the corresponding PP networks 0 ERDD ErdosR nyi randm networks with same number of nodes edges and degree distribution as the corresponding PP networks 0 SF Scalefree random networks with the same number of nodes and the number of edges within l of those of the corresponding PP networks 0 GEO several types of geometric random graphs with the same number of nodes and the number of edges within l of those of the corresponding PP networks three versions GEO2D GEO3D and GEO4D with Euclidean distance Sunday January 6 2008 Graphlet Analysus of PPI Networks Data and Fitness Measure 0 They analyzed four PPI networks of the yeast C cerevisiae and fruitfly D melanogaster 0 They quanti ed the t using the measure DGHZil39r0i Hi 1 where 755 7103NGTG NAG is the number of graphlets of type i 70 zi iiim G and H are the PPI network and its randomized counterpart Sunday Januai ye 2mg Graphkt Frequencies i1 Yeast PM and ER Ntlwcrks Fqu my High Con dence PPI Graph Rnndcm l Rmdml 2 Random 3 Filndcm 7 V I 39v Graphic Fmauencies in le PP uni GED2D Namath I I I I I I I I I I I I I I I I 9 II ll 3 I3 1 I5 IISI39 15193021222334 Graphic Graphlet Frequenci in Tau PH and EILDD NellInt High Con dence PPIGrzph Random l Rmdnrn 3 R ndorn3 I39Jnd ull 5 I l I I 35352723 3921393 I39m 1395 13139 1393 JIQEIZIE39 39 ESEIE39EEETEBZD Graphic n Guth Frequenti in Yeast F39PI GEDED GEOSD and GEDID Namix ks chncncg H39 h Ccu denoe PFI Gr I Rutqu Anmm 3 Rnndum3 Rnndm 4 Rindcm 7 litquart I I I I39nI39L 151311113 13951397131192025i3 i4239s2527za Guphlr Gruphlet Frequencies iI Yam PM and SF Nam HipCm umePPIGIEh Rnndnm l Rnndnm 3D 1 ILInIInm 4D I I I I I I I I I I I I I I I I I I I I I I 39n i 9 lEIIl1113141516139I 15192 2112132415262723 3 Gt qnhkl GHFIIIEI Frequth in Tau PM and 3 limes dtnstr GED3D Hawk lI39mque my Hi hC on dcme PP39JGI g Rani33h Ili IIC un dence FFIGr E GEDJDd IF Fraque Icy Sunday January 6 2008 l T i B DH1213141516LSLQ EIJELEEEJE ln Gumbel marl I 39n 5 9 l lll z BlHS 1517131931 Grnphlgl Acknowledgments 0 Materials in this lecture are mostly based on 0 Network Biology Understanding the Cell s Functional Organization by Barabasi and Oltvai o Scalefree Networks in Cell Biology by R Albert 0 Modeling interactome Scalefree or Geometric by Priulj Corneil and Jurisica Sunday January 6 2008 shrg mp mp when mumpom 2 S letter Network mot39fs39 ransc network of Escherichia coli nal legula 39on m mro xmmm Wonangm sltUnAlnni2 mhmzdam 22 mzmzvox m mum x1 x2 x3 x W 11 Z2 Z ZA ZM m s mam 2mmquot 1ng vnnz vlstwmvnxnp mm mm Mm max mmm gene uvrssnn m ms m m mm mm mm mummy unvmmnenmnzmun sm mum m regu an new E h 3 2 mmquot w mm mums Mm mwen m nemammmwsxvaur New mmmmmmyammmmm wank was mm mm m m mu mama mqu 51 mg n rws enmnw mm mmmmonemme mmmm ummmng was 3 an mm m m m 1 iv M m M gnmum mmquot E n m msmmmmmm mggeneu n2 Mmgmwm mm mhmmmwm mmndn mum mmde my 3 mm m momma as x mud WWW x Wmmwmmmdtmmnmmnmmmt ndnzummpmdmmmms marks mud nwwmmmmm Whammy WWWWWMWMN kmme mm x w m WWW Y m WW quot9 m m mm mm wmnyvuwmwmwvnmv mv mnwkuw l ngmvmrqmqwmmlm mlt WWWW umww emmmmmnmmn shrg mp mp when mumpom 2 S mmam m nmmamm mhzmmn m m smzmhvm mmmnmmwmmmm WWW am mud 1171201111 me gunm ztm mm m m mlwmmmwdauhwenmmta 5 mm w 1 gm mm mm M m 3 ms m mam m we a z m H mum mum hm mm m z a W MMn m letter m mau mzhv w Wan auxcw 2002 Nature Publishing Group httpgeneticsnaturecom letter 1 1 1 1 1 1 H input X Fig 2 Dynamic features of the coherent feedforward loop and SIM motifs a Consider a coherent feedforward loop circuit with an 39AND gate39 like control of the output operon Z This circuit can reject rapid variations in the activity of the input X and respond only to persistent activation profiles This is because Y needs to integrate the input X over time to pass the activation threshold for Z thin line A similar 1 rejection of rapid fluctuations can be achieved by a cascade X gtY gtZ however the cascade has a slower shutdown than the feedforward loop thin red line in the Z dynamics panel b Dynamics of the SIM motif This motif can show a temporal program of expression accord ing to a hierarchy of activation thresholds of the genes When the activity of X the master activator rises and falls with time the genes with the lowest threshold are activated earliest and deactivated lat output 2 39 est TIme is in units of protein lifetimes or of cell cycles in the case of longlived proteins 0246810121416 time l Xl l Z1 22 Z3 time from X and a delayed one through Y If the activation of X is tran sient Y cannot reach the level needed to signi cantly activate Z and the input signal is not transduced through the circuit Only when X signals for a long enough time so that Y levels can build up will Z be activated Fig 261 Once X is deactivated Z shuts down rapidly This kind of behavior can be useful for making decisions based on uctuating external signals The SIM motif is found in systems of genes that function sto chiometrically to form a protein assembly such as agella or a metabolic pathway such as amino acid biosynthesis In these cases it is useful that the activities of the operons are determined by a single transcription factor so that their proportions at steady state can be xed In addition mathematical analysis sug gests that SIMs can show a detailed temporal program of expres sion resulting from differences in the activation thresholds of the different genes Fig 2b Built into this design is a pattern in which the rst gene activated is the last one to be deactivated Such temporal ordering can be useful in processes that require several stages to complete This type of mechanism may explain the experimentally observed temporal program in the expression of agella biosynthesis genes The motifs allow a representation of the E coli transcriptional network Fig 3 in a compact modular form for an image of the full network see Web Fig A online By using symbols to represent the different motifs Fig 1 the network is broken down to its basic building blocks A single layer of DORs connects most of the transcription factors to their effector operons Feedforward loops and SIMs often occur at the outputs of these DORs The DORs are interconnected by the global transcription factors which typically control many genes in one DOR and few genes in several DORs An important step in visualizing the network was to allow each global transcription factor to appear multiple times whenever it is an input to a structure This reduces the complexity of the inter connections while preserving all the information There are few 66 18 20 long cascades3 usually involving G factors such as cas cades of depth 5 in the agella and nitrogen systems Over 70 of the operons are connected to the DORs the rest of the operons are in small disjoint systems Most disjoint systems have only 1 to 3 operons The remaining disjoint systems have up to 25 operons and show many SIMs and feedforward loops A notable feature of the overall organi zation is the large degree of overlap within DORs between the short cascades that control most operons The layer of DORs may therefore represent the core of the computa tion carried out by the transcriptional network Cycles such as feedback loops are an important feature of regulatory networks Transcriptional feedback loops occur in various organisms such as the genetic switch in k phageS In the E coli data set there are no examples of feedback loops of direct transcriptional interactions except for auto regulatory loops3 However the absence of feedback loops is not statistically signi cant as over 80 of the randomized networks also have no feedback loops Table 1 The many regulatory feedbacks loops in the organism are carried out at the post transcriptional level We considered only transcription interactions speci cally manifested by transcription factors that bind regulatory sites3gt14 This transcriptional network can be thought of as the slow part of the cellular regulation network time scale of minutes An additional layer of faster interactions which include interactions between proteins often subsecond timescale contributes to the full regulatory behavior and will probably introduce additional network motifs Characterization of additional transcriptional interactions may change the present motif assignment for spe ci c systems However our conclusions regarding the high fre quencies of feedforward loops SIMs and overlapping regulation compared with randomized networks are insensitive to the addi tion or removal of interactions from the data set These features are still highly signi cant even when 25 of the connections in the E coli network are removed or rearranged at random The concept of homology between genes based on sequence motifs has been crucial for understanding the function of uncharacterized genes Likewise the notion of similarity between connectivity patterns in networks based on network motifs may be helpful in gaining insight into the dynamic behavior of newly identi ed gene circuits The present analysis may serve as a guideline for experimental study of the functions of the motifs It would be useful to determine whether the net work motifs found in E coli can characterize the transcriptional networks of other cell types In higher eukaryotes for example there will be many more regulators affecting each gene and addi tional types of circuits may be found The ndings presented here also raise the possibility that motifs can be de ned in other biological networks7 such as signal transduction metabolic19 and neuron connectivity networks threshold threshold nature genetics volume 31 may 2002 00 Kew L5 ammon sonaua aimeu l quotw 2002 Nature Publishing Group httpgeneticsnaturecom drug and superoxide DOR osmotic stress DOR carbon utilization DNA metabolism DOR DOR stationary phase DOR E omaltmumm Con rm gm lt1 ma a l gt eag gaeegl heat shock a maltose legend 0 transcription factor TF global TF 223 dense overlapping regulons DOR postive regulation 1 single input module SIM negative regulation A coherent feedfonlvard loop dual regulation A incoherent feedforvvard loop 0 multiinput module 0 single operon Fig 3 Part of the network of direct transcriptional interactions in the E coli data set represented using network motifs Nodes represent operons and lines represent transcriptional regulation directed so that the regulating tran scription factor is above the regulated operons Network motifs are represented by their corresponding symbols Fig 1 The DORs are named according to the common function of their output operons Each transcription factor appears in only a single subgraph except for transcription factors regulating more than ten operons 39global transcription factors39 which can appear in several subgraphs For an image of the entire network see Web Fig A online 121121 letter Mmhad WMKWWWWMMWMWMp WMWMWMR mmmmmn mm mmxmauu wmdqaw mm mm W W W smwmmm WA WW wa mm mmme Mmmm mmwmmw mmm Mammal m mwnw m m 9 39m mm m mime an MAW kw a f lquotquot w39 quotqk 3 1 lawmmmmW 39m w WM WWquot WW w M w Mmg mama Mummy mud mummy mmammmww u mm W 1 my 2 w w MM b w w MM h WM 3 my and b a MWWW MW w m we quotmm quotWi mwmememn m Mm wmmwwmmnwwm mum in m m w W 5 mm M new Mm quotmm hm hm magma wwwmm m M mm mm wmjm wauduWnnwamm Q Waamiwruuuw 1mm mud5231 mm ucm pm ammmiwumdm Jm lwfum a mumm39wnhm w quot53qu mmxmmkmanmmWMmm wwmmmtmmmm shrg mp mp when mumpom mmmmmg m I mmwummm m mummuymmmpaumm mum wm wmmmhw mmwam mmvswmm kmwm 8 me a wwwmmwmw 1 x mm NW mmLsww MN mmm mymm z mmmmmmmmmmm umam x m n E mymws Nahum mkaw my 2nn2mm7 3 2 x g E 2 EE manam w Wham 7me van 4 in Wm A mummxuayoqummv M H mm we mm Anusw MW W M M quotW343L 1 ii mm m x w W mm w Wamw 5 quotmg mm v w w m WWKWMWW wmwm m mmmgmgmz f w quot immm quot 75 W m quot 1 wssrmrammmmm 2mm M mmdmwummww mm mmmmm w MmMMWMWWWWMW emgrpmaw Am quotWWWquot mmmwmmlmxzmm w m WWEMMMW Mu marM mm WWW m m g m m MWWWWWW mm mm mumummmmmmmw rmwwm m wwm u x mm Krnzma 1 mu wa MRWMHMMWMMM mmmzf mwmusrwm WAN wmmmmmmkmm mam gram mmW Ww m wwm rmrsr mmi xswmwmmvm a WMMWMEM ga Emmmmmmmmmmm M K mm mewaumm WWuwvivfmd awf uw mm m mwwmammum a mm 14 w mum msd m 1 1 aming 333331 mmim WWW Wamdede WWWamygumm w M MM v d A m Equot Wm m Wy w jm mgmvgm fm m k w mm x w MW M mmmwmnmmummmm 3231 mm MM W W W