Computational Biophysics and Systems Biology
Computational Biophysics and Systems Biology CSE 60531
Popular in Course
Popular in Computer Science and Engineering
This 0 page Class Notes was uploaded by Mrs. Damaris Hyatt on Sunday November 1, 2015. The Class Notes belongs to CSE 60531 at University of Notre Dame taught by Staff in Fall. Since its upload, it has received 13 views. For similar materials see /class/232757/cse-60531-university-of-notre-dame in Computer Science and Engineering at University of Notre Dame.
Reviews for Computational Biophysics and Systems Biology
Report this Material
What is Karma?
Karma is the currency of StudySoup.
Date Created: 11/01/15
Vector free energy calculation with adaptive biasing force Eric Darve Reaction coordinate g 4 if a will I affqhixlr 7 1 V V 53 i v gtV 39 V7 g r 7 y I H 1 gt l I I Z A I 7 1 A 4 V V V A V equot l r E 1 gag1 l M I v v 39 7 I 739 r r I V quotI r gt V I V J gt t 73 L I g V u V I 7 l a J L O 9m x 93a J r iquot 39 V gt all 39 u 39 J J 1 V I V IV L V W I r t v 1 Z V aq V l 7 IV I gt 391gt A 7 FY R I y 7 39 39 l Y n I l 39 39 V n 39 w I 4 r 39 a I 7 r 1 4 II a I l Free Energy as a function of a reaction coordinate Probability density function of 5 f e BU cgos mde fe dedp Free energy function A kBT log P Pr x i i S Represents the mean force acting on 5 ireot calculation using histograms Transition state Histogram of g Free energy barrier 1 Meta stable sets 5 Applying a bias Biasing force Force acting on g Uniform sampling along g is obtained Force acting on a Derivative of free energy can be shown to be equal to dA d dg 1 1 6 d g lt WE gtg E Z k m k 833k 145 a l Average bin 1 if Histogram of the force Sampling can be done using Molecular Dynamics Computing the adaptive bias i m d3 ngas dt ldeall the biasin force should be D 1N y d gdg lta i S Instead we use an approximation based on the samples obtained so far 1 Mk 055 ngas Nk CZ 1 15 in loin k This leads asymptotically to a uniform sampling External force added to system Ft fiasvg 150 I l I l 100 50 00 02 04 06 08 1 Simulation starts at a minimum of the free energy and progressively moves away from it see Free Energy Estimates during simulation Vector free energy Result can be extended to multiple dimensions This is important since in practice the real reaction coordinate is rarely known Therefore free energy is often computed in terms of collective variables which are assumed to be closely related to the reaction coordinate d d5 t 1 v A lt inv r UTLinV I d uT39 g 6 g g l 6 where J5 is a fat matrix defined by Jamilj J Alanine dipeptide quot39 degr ssj Free Energy fkcalfmul 15D 1 10 4ch I 2513 m degrees1 quotI39 ianLg 39E 159 Fm Em39rgjg39 lIzmb39rnull 5i 151 Eli Emil i254 BED Elli HQ 264 1 Mfrg BEE Insertion of LS peptide in a membrane A 33 4 1 fr m 3mg Il m 12 1399 v 34 39 aquot k l wt mfquot 39 v v I 5x r Calculation with shortrange cutoff Free Energyquot Emma 1 5 al 4 v1 1 Cl 8 E 4 2 D E Angstrom Energy barrier is too high because we lack long range electrostatic forces which stabilize the system Choice of reaction coordinates Simplest choices Distance 2 between center of mass of helix and membrane Hydration parameter number of water molecules in contact with hydrophilic residues Angle between helix and normal to the water octane interface Angle Eigenvector of moment of inertia tensor corresponding to the smallest eigenvalue This vector points in the direction of maximum extension 2239 miyz2 222 Zr Zr m iZz A Zr mia7iyi 2239 mi ZZ Zr Zr m z Zz Zr miyizi Zr milt y Reaction coordinate defined by angle between this eigenvector and a vector normal to the wateroctane interface Gradient of reaction coordinate can be computed from the gradient ofthis eigenvector 1 R u j u v A Jquot v5 Haw Q This definition avoids spurious artifacts due to definition ofthe reaction coordinate Example of bad de nitions direction de ned by two atoms fast coordinate or defined by two groups of atoms nonsmooth acceleration field Z 7 4 Properties ofthis gradient Contained in the plane de ned by the eigenvector and the interface normal Small nearthe center of mass Corresponds to a solid body rotation Alchemical transformations Glycine Alanine H0 H1 Free energy difference Free energy difference defined as f 9 le dmdp f e bHO datdp AA An auxiliary variable 2L along with momentum p1 can be introduced which defines a path from H0 to H1 2 HA pk 1 MHO AH1 2m Hamiltonian equations d d 8U 8U 33 p 1 0 A 1 dt m dt 813 81 dA d p m 2 U0 U1 dt m di The same method can be applied to this extended system dA dpAgt U U dA lt dt A lt 1 on Even simpler 1 This is page 1 Printer Opaque this Biomolecular Structure and Modeling Historical Perspective Chapter 1 Notation SYMBOL DEFINITION Vectors h unit cell identi er crystallography 1 position Fh structure factor crystallography 1 phase angle crystallography Scalars d distance between parallel planes in the cr stal 1h intensity magnitude of structure factor crystallography V cell volume crystallography 0 re ection angle crystallography A Wavelength of the Xray bearn physics chemistry and biology have been connected by a web of causal explanation organized by inductionbased theories that tele scope into one ano er Thus quantum theory underlies atomic physics which is the foundation of reagent chemistry and its special ized offshoot biochemistry which interlock with molecular biology i essentially the chemistry of organic macromolecules i and hence through successively higher levels of organization cellular organismic and evolutionary biology Such is the unifying and 2 1 Biomolecular Structure and Modeling Historical Perspective highly productive understanding of the world that has evolved in the natural sciences Edward O Wilson Resuming the Enlightenment Quest in The Wilson Quarterly Winter 1998 11 A Multidisciplinary Enterprise 11 Consilience The exciting eld of modeling molecular systems by computer has been steadily drawing increasing attention from scientists in varied disciplines In particular modeling large biological polymers 7 proteins nucleic acids and lipids is a truly multidisciplinary enterprise Biologists describe the cellular picture chemists ll in the atomic and molecular details physicists extend these views to the electronic level and the underlying forces mathematicians analyze and formulate appropriate numerical models and algorithms and computer scien tists and engineers provide the crucial implem entational support for running large computer programs on highspeed and extendedcommunication platforms The many names for the eld and related disciplines underscore its crossdisciplinary nature computational biology computational chemistry in silico biology com putational structural biology computational biophysics theoretical biophysics theoretical chemistry and the list goes on As the pioneer of sociobiology Edward O Wilson re ects in the opening quote some scholars believe in a unifying knowledge for understanding our universe and ourselves or consilience1 that merges all disciplines in a biologicallygrounded framework 225 Though this link is most striking between genetics and hu man behavior 7 through the neurobiological underpinnings of states of mind and mental activity with shaping by the environment and lifestyle factors 7 such a uni cation that Wilson advocates might only be achieved by a close interaction among the varied scientists at many stages of study The genomic era has such immense rami cations on every aspect of our lives 7 from health to technology to law 7 that it is not dif cult to appreciate the effects of the biomolecular rev olution on our letcentury society Undoubtedly a more integrated synthesis of biological elements is needed to decode life 101 In biom olecular modeling a multidisciplinary approach is important not only because of the many aspects involved 7 from problem formulation to solution 7 but also since the best computational approach is often closely tailored to the 1Conn39lience was coined in 1840 by the theologian and polymath William Whewhell in his syn thesis The Philosophy of the Inductive Sciences It literally means the alignment or jumping together a E K a o p o E E g 9 g E m n i E 3 51 o m o n 5 lt7 8 E 73 11 E E a 0 g 1 539 o w 5 m 1 9 8 1 E 3 ther recently by advocating in his 1998 book Consilience 225 that the world is orderly and can be explained by a set of natural laws that are fundamentally rooted in biology 11 A Multidisciplinary Enterprise 3 biological problem In the same spirit close connections between theory and ex periment are essential computational models evolve as experimental data become available and biological theories and new experiments are performed as a result of computational insights See 157 22 7 for example in connection to the hlldldhtcll dLlUll Of protein fu ulimg I Although few theoreticians in the eld have expertise in experimental work as well the classic example of Werner Heisenberg s genius in theoretical physics but naivete39 in experimental physics is a case in point Heisenberg required the resolving power of the microscope to derive the uncertainty relations In fact an error in the experimental interpretations was pointed out by Niels Bohr and this eventually led to the Copenhagen interpretation of quantum mechanics lf Wilson s vision is correct the interlocking web of scienti c elds rooted in the biological sciences will succeed ultimately in explaining not only the func tioning of a biomolecule or the workings of the brain but also many aspects of modern society through the connections between our biological makeup and hum an behavior 112 What is Molecular Modeling Molecular modeling is the science and art of studying molecular structure and function through model building and computation The model building can be as simple as plastic templates or metal rods or as sophisticated as interactive ani mated color stereographics and laserm ade wooden sculptures The computations encompass ab initio and semiempirical quantum mechanics empirical molec ular mechanics molecular dynamics Monte Carlo free energy and solvation metho s 39it 39 quotA SAR 39 39 39 inorma tion and databases and many other established procedures The re nement of experimental data such as from nuclear magnetic resonance NMR or Xray crystallography is also a component of biomolecular modeling 1 often remind my students of Pablo Picasso s statement on art Art is the lie that helps tell the truth This view applies aptly to biomolecular modeling Though our models represent a highlysimpli ed version of the complex cellu lar environment systematic studies based on tractable quantitative tools can help discern patterns and add insights that are otherwise dif cult to observe The key in modeling is to develop and apply models that are appropriatefor the questions being examined with them The questions being addressed by computational approaches today are as in triguing and as complex as the biological systems themselves They range from understanding the equilibrium structure of a small biopolymer subunit to the en ergetics of hydrogenbond formation in proteins and nucleic acids to the kinetics of protein folding to the complex functioning of a supramolecular aggregate As experimental triumphs are being reported in structure determination i from ion 4 1 Biomolecular Structure and Modeling Historical Perspective channels to singlem olecule biochemistry2 7 modeling approaches are needed to ll in many gaps and to build better models and theories that will ultimately make testable predictions 113 Need For CriticalAssessment The eld of biomolecular modeling is relatively young having started in the 1960s and only gained momentum since the mid 1980s with the advent of super computers Yet the eld is developing with dazzling speed Advances are driven by39 r 39 39 e UlLlLlOn an 39 a Jstructuraldata bases as well as in force elds algorithms for conformational sampling and molecular dynamics computer graphics and the increased computer power and memory capabilities These impressive technological and modeling advances are steadily establishing the eld of theoretical modeling as a partner to experiment and a widely used tool for research and development Yet as we witness the tantalizing progress a cautionary usage of molecular modeling tools is warranted as well as a critical perspective of the eld s strengths and limitations This is because the current generation of users and application scientists in the industrial and academic sectors may not be familiar with some of the caveats and inherent approximations in biomolecular modeling and sim ulation approaches Indeed the tools and programs developed by a handful of researchers thirty years ago have now resulted in extensive pro tmaking soft ware for genomic information drug design and every aspect of modeling More than ever a A 39 in the framework is neces sary for sound studies in the exciting era of computational biophysics that lies on the horizon in 111 Ll 2Examples of recent triumphs in biomolecular structure determinations include elucidation of the nucleosome 7 essential building block of the DNAprotein spools that make up the chromo somal material 131 ion channel proteins 7 regulators of membrane electrical potentials in cells thereby generating nerve impulses and controlling muscle contraction hormone production and car diac rhythm 54 132 145 241 59 and the ribosome 7 the cell s proteinsynthesis factory 75 the machine bundle of 54 proteins and three RNA strands that moves along messenger RNA and synthesizes polypeptides The complete 70S ribosome system was rst solved at low 34 and mod erate 237 resolution its larger 36 9 153 and smaller subunits 224 227 31 186 were then solved at moderate resolution see perspective in 21 Other important examples of experimental breakthroughs involve overstretched DNA 7 as seen in singlemolecule force versus extension mea surements 199 28 and competing folding and unfolding pathways for proteins 7 as obtained by kinetic studies using spectroscopic probes eg 240 122 11 A Multidisciplinary Enterprise Table 11 Structural Biology Chronology 5 1865 Genes discovered by Mendel 1910 Genes in chromosomes shown by Morgan s fruit y mutations 1920s Quantum mechanics theory develops 1926 Early reports of crystallized proteins 1930s Reports of crystallized proteins continue and stimulate Pauling amp Corey to compile bond lengths and angles of amino acids 1944 Avery proves genetic transformation via DNA not protein 1946 Molecular mechanics calculations reported Westheimer others 1949 Sickle cell anemia identi ed as molecular disease Pauling 1950 Chargaff determines nearunity AT and GC ratios in many species 1951 Pauling amp Corey predict protein ahelices and 3sheets 1952 Hershey amp Chase reinforce genetic role of DNA phage experiments 1952 Wilkins amp Franklin deduce that DNA is a helix Xray ber diffraction 1953 Watson amp Crick report the structure of the DNA double helix 1959 Myoglobin amp hemoglobin deciphered by Xray Kendrew amp Perutz 1960s Systematic force elds develop Allinger Lifson Scheraga others 1960s Genetic code deduced Crick Brenner Nirenberg Khorana Holley coworkers 1969 Levinthal paradox on protein folding posed 1970s Biomolecular dynamics simulations develop Stillinger Karplus others 1970s Sitedirected mutagenesis techniques developed by M Smith restriction enzymes discovered by Arber Nathans and H Smith 1971 Protein Data Bank established 1974 tRNA structure reported 1975 Fifty solved biomolecular structures available in the PDB 1977 DNA genome of the virus dgtX174 54 kb sequenced soon followed by human mitochondrial DNA 166 kb and A phage 485 kb 1980s Dazzling progress realized in automated sequencing protein Xray crystallography NMR recombinant DNA and macromolecular synthesis 1985 PCR devised by Mullis numerous applications follow 1985 NSF establishes ve national supercomputer centers 1990 lntemational Human Genome Project starts spurs others 1994 RNA hammerhead ribozyme structure reported39 other RNAs follow 1995 First nonviral genome completed bacterium H ir uenzae 18 Mb 1996 Yeast genome Saccharomyces cerevisiae completed 13 Mb 1997 Chromatin core particle structure reported39 con rms earlier structure 1998 Roundworrn genome C elegans completed 100 Mb 1998 Crystal structure of ion channel protein reported 1998 Private Human Genome initiative competes with international effort 1999 Fruit y genome Drosophila melanogaster completed Celera 137 Mb 1999 Human chromosome 22 sequenced public consortium 1999 IBM announces peta op computer to fold proteins by 2005 2000 First draft of human genome sequence announced 3300 Mb 2000 Moderateresolution structures of ribosomes reporte 2001 First annotation of the human genome February 2002 First draft of rice genome sequence 430 Mb April 6 1 Biomolecular Structure and Modeling Historical Perspective Table 1 1 continued 2003 Human genome sequence completed April 114 Text Overview This text aims to provide this critical perspective for eld assessment while in troducing the relevant techniques Speci cally the elementary background for biomolecular modeling will be introduced protein and nucleicacid structure tu torials Chapters 376 overview of theoretical approaches Chapter 7 details of force eld construction and evaluation Chapters 8 and 9 energy minimiza tion techniques Chapter 10 Monte Carlo simulations Chapter 11 molecular dynamics and related methods Chapters 12 and 13 and similaritydiversity problems in chemical design Chapter 14 As emphasized in this book s Preface given the enormously broad range of these topics depth is often sacri ced at the expense of breadth Thus many spe cialized texts eg in Monte Carlo molecular dynamics or statistical mechanics are complementary such as those listed in Appendix C39 the representative articles used for the course Appendix B are important components For introductory texts to biom olecular structure biochemistry and biophysical chemistry see those listed in Appendix C such as 30 44 70 201 18 For molecular simulations a solid grOLmding in classical statistical mechanics thermodynamic ensembles timecorrelation fmetions and basic simulation protocols is important Good introductory texts for these subjects including biom olecular applications are 140 5 76 178 19 81 171 84 137 24 The remainder of this chapter and the next chapter provide a historical context for the eld s development Overall this chapter focuses on a historical account of the eld and the experimental progress that made biomolecular modeling pos sible39 chapter 2 introduces some of the eld s challenges as well as practical applications of their solution Speci cally to appreciate the evolution of biomolecular modeling and simu lation we begin in the next section with an account of the milieu of growing experimental and technical developments Following an introduction to the birth of molecular mechanics Section 12 experimental progress in protein and nucleicacid structure is described Section 13 A selective reference chronology to structural biology is shown in Table 11 The experimental section of this chapter discusses separately the early days of biomolecular instrumentation i as structures were emerging from Xray crys tallography i and the modern era of technological developments 7 stimulating the many sequencing projects and the rapid advances in biomolecular NMR and crystallography Within this presentation separate subsections are devoted to the techniques of Xray crystallography and NMR and to the genome projects 12 Molecular Mechanics 7 Chapter 2 continues this perspective by describing the computational chal lenges that naturally emerge from the dazzling progress in genome projects and experimental techniques namely deducing structure and fmetion from sequence Problems are exempli ed by protein folding and misfolding Students meamil iar with basic protein structure are urged to reread Chapter 2 after the protein minitutorial chapters The sections that follow mention some of the exciting and important biomedical industrial and technological applications that lend enor mous practical utility to the eld These applications represent a tangible outcome of the con uential experimental theoretical and technological advances Since the material presented in these introductory chapters is changing rapidly e g the status of the genome projects theoretical and instrumentational progress the author anticipates periodic updating and placement on the text web page 12 The Roots of Molecular Modeling in Molecular Mechanics The roots of molecular modeling began with the notion that molecular geometry energy and various molecular properties can be calculated from mechanicallike models subject to basic physical forces A molecule is represented as a mechan ical system in which the particles 7 atoms 7 are connected by springs i the bonds The molecule then rotates vibrates and translates to assume favored con formations in space as a collective response to the inter and intram olecular forces acting upon it The forces are expressed as a sum of harmoniclike from Hooke s law terms for bondlength and bondangle deviations from reference equilibrium values trigonometric torsional terms to account for internal rotation rotation of molec ular subgroups about the bond connecting them and nonbonded van der Waals and electrostatic potentials See Chapter 8 for a detailed discussion of these terms as well as of more intricate cross terms 121 The Theoretical Pioneers Molecular mechanics arose naturally from the concepts of molecular bonding and van der Waals forces The BornOppenheim er approximation assuming xed nu clei see Chapter 7 followed in the footsteps of quantum theory developed in the 19205 While the basic idea can be traced to 1930 the rst attempts ofmolecular mechanics calculations were recorded in 1946 Frank Westheimer s calculation of the relative racemization rates of biphenyl derivatives illustrated the success of such an approach However computers were not available at that time so it took several more years for the eld to gather momentum In the early 1960s pioneering work on development of systematic force elds 7 based on spectroscopic information heats of formation structures of small 1 Biomolecular Structure and Modeling Historical Perspective Table 12 The evolution of molecular mechanics and dynamics CPU TimeComputerC System and Size aThe examples for each period are representative The rst ve systems are modeled in vacuum and the others in solvent Except for the dinucleoside simulations refer to molecular d am ics MD The 0 system sizes for the 6heptapeptide 47 re ect two temperaturedependent simulations See text for de nitions of abbreviations and further entry information b 8 Ms 6hairpin simulation in 2001 represents an ensemble or aggregate dynamics simulation as accumulated over several short runs rather than one long simulation 238 T mputational time is given Where possible estimates for the cuum DNA heptapeptide 6hairpin and channel protein simulations 125 47 238 203Were kindly provided by M Levitt W van Gunsteren V Pande andK Schulten respectively compounds sharing the basic chemical groups other experimental data and quantummechanical information i began independently in the laboratories of the late Shneior Lifson at the Weizmann Institute of Science Rehovot Israel 126 Harold Scheraga at Cornell University Ithaca New York and Norman 12 Molecular Mechanics 9 Allinger at Wayne State University Detroit Mchigan and then the University of Georgia Athens These researchers began to develop force eld parame ters for families of chemical compounds by testing calculation results against experimental observations regarding structure and energetics In the early 19705 Rahm an and Stillinger reported the rst molecular dynamics work of a polar molecule liquid water 169 170 results offered insights into the structural and dynamic properties of this life sustaining molecule Rahman and Stillinger built upon the simulation technique described much earlier 1959 by Alder and Wainwright but applied to hard spheres 4 In the late 19705 the idea of using molecular mechanics force elds with en ergy minimization as a tool for re nement of crystal structures was presented 103 and developed 115 This led to the modern versions employing simulated annealing and related methods 40 l 17 It took a few more years however for the eld to gain some legitimacy 3 In fact these pioneers did not receive much general support at rst partly because their work could not easily be classi ed as a traditional discipline of chemistry eg physical chemistry organic chemistry In particular spectroscopists criti cized the notion of transferability of the force constants though at the same time they were quite curious about the predictions that molecular mechanics could make In time it indeed became evident that force constants are not generally transferable still the molecular mechanics approach was sotmd since nonbonded interactions are included terms that spectroscopists omitted Ten to fteen more years followed Lmtil the rst generation of biomolecular force elds was established The revitalized idea of molecular dynamics in the late 19705 propagated by Martin Karplus and colleagues at Harvard University sparked a ame of excitement that continues with full force today with the fuel of supercomputers Most programs and force elds today for both small and large molecules are based on the works of the pioneers cited above Allinger Lifson and Scheraga and their coworkers The water force elds developed in the late l9705early 19805 by Berendsen and coworkers eg 180 and by Jorgensen and coworkers 106 SPC and TIP3PTIP4P respectively laid the grOLmdwork for biom olecular simulations in solution Peter Kollman s legacy is the development and application of force eld methodology and computer simulation to important biomolecular as well as medicinal problems such as enzyme catalysis and proteinligand design 218 his group s free energy methods and combined quantummolecular mechanics approaches have opened many new doors of applications With Kollman s Lui timely death in May 2001 the community mourns the loss of a great leader and innovator 3Personal experiences shared by Norman L Allinger on those early days ofthe eld form the basis for the comments in this paragraph I am grateful for his sharing these experiences with me 10 1 Biomolecular Structure and Modeling Historical Perspective 122 Biomolecular Simulation Perspective Table 12 and Figures 11 and 12 provide a perspective of biomolecular sim ulations Speci cally the selected examples illustrate the growth in time of system complexity size and model resolution and simulation length The threedimensional 3D rendering in Figure 11 shows buildings with heights proportional to system size Figure 12 offers molecular views of the simulation subjects and extrapolations for longtime simulations of proteins and cells based on Representative Progress Starting from the rst entry in the table dinucleoside GpC guanosine339 5 cytidine monophosphate posed a challenge in the early 19705 for nding all minima by potential energy calculations and model building 198 Still clever search strategies and constraints found a correct conformation dihedral angles in the range of helical RNA and sugar in C3 endo form as the lowest energy minimum Global optimization remains a di cultprablem See Chapter 10 The small protein BPTI Bovine Pancreatic Trypsin Inhibitor was the subject of a 1977 pioneering dynamic simulation applied to a protein 136 It showed substantial atomic uctuations on the picosecond timescale The 12 and 24basepair bp DNA simulations in 1983 125 were performed in vacuum without electrostatics and that of the DNA pentam er system in 1985 with 830 water molecules and 8 sodium ions and full electrostatics 189 Sta bility problems for nucleicacids emerged in the early days i unfortunately in some cases the strands untwisted and separated 125 Stability became possible with the introduction of scaled phosphate charges in other pioneering nucleicacid simulations 168 91 207 and the introduction a decade later of more advanced treatments for solvation and electrostatics see 38 for example for a discussion The linear decapeptide GnRH gonadotropinreleasing hormone was studied in 1984 for its pharmaceutical potential as it triggers LH and FSH hormones 200 The 300 ps dynamics simulation of the protein myoglobin in 1985 127 was considered three times longer than the longest previous MD simulation of a protein The results indicated a slow convergence of many thermodynamic properties The largescale phospholipid aggregate simulations in 1989 221 was an am bitious undertaking it incorporated a hydrated micelle ie a spherical aggregate of phospholipid molecules containing 85 LPE molecules lysophosphatiadyl ethanolamine and 1591 water molecules The HIV protease system simulated in solution in 1992 89 captured an inter esting ap motion at the active site See also Figure 25 and a discussion of this motion in the context of protease inhibitor design The 1997 estrogenDNA simulation 118 sought to understand the mech anism underlying DNA sequence recognition by the protein It used the 12 Molecular Mechanics 11 multipole electrostatic treatment crucial for simulation stability and also parallel processing for speedup 185 The 1998 DNA simulation 235 used the alternative Particle Mesh Ewald PME treatment for consideration of longrange electrostatics see Chapter 9 and uncovered interesting properties of Atract sequences The 1998 peptide simulation in methanol used periodic boundary conditions de ned in Chapter 9 and captured reversible temperaturedependent folding 47 the 200 ns time re ects four 50 ns simulations at various temperatures The 1998 l 1us villinheadpiece simulation using periodic boundary con ditions 55 was considered longer by three orders of magnitude than prior simulations A folded structure close to the native state was approached see also 57 The solvated protein bcl embedded in a phospholipid bilayer 102 was simulated in 1999 for over 1 ns by a steered molecular dynamics algorithm 45131 exible atoms to suggest a pathway for proton conduction through a water channel As in villin the Coulomb forces were truncated By 2002 an aquaporin membrane channel protein in the glycerol conducting subclass E cali glycerol channel GlpF in a lipid membrane 106189 total atoms was simulated for 5 ns as well as a mutant with all nonbonded inter actions considered using the PME approach 203 The simulations suggested details of a selective mechanism by which water transport is controlled see also 105 for simulations examining the glycerol transport mechanism By early 2002 the longest simulation published was 38 us but for aggregate or ensemble dynamics 7 usage of many short trajectories to simulate the microsec ond timescale i set for the Cterminal hairpin from protein G 16 residues in 2001 238 Whereas the continuous 1 us villin simulation required months of dedicated supercomputing the hairpin simulation 177 atoms using implicit solvation and Langevin dynamics was perform ed to analyze folding kinetics on a new distributed computing paradigm which employs personal computers from around the world see Foldinghome foldingstanfordedu and 191 About 5000 processors were employed and with the effective production rate of 1 day per nanosecond per processor about 8 days were required to simulate the 38 1us aggregate time See also 195 for a later set of simulations aggregate time of 700 us for mutants of the designed miniprotein BBA5 reviewed in 23 Trends Note from the table and gure the transition from simulations in vacuum rst ve entries to simulations in solvent remaining items Observe also the steady increase in simulated system size with a leap increase in simulation lengths made only recently Large system sizes or long simulation times can only be achieved by sacri cing other simulation aspects For example truncating longrange electrostatic interac tions makes possible the study of large systems over short times 102 or small systems over long times 55 Using implicit solvent and cutoffs for electrostatic 12 1 Biomolecular Structure and Modeling Historical Perspective interactions also allows the simulation of relatively small systems over long times 238 In fact with the increased awareness of the sampling problem in dynamic simulation see Chapter 12 we now see the latter trend more often namely study ing smaller solvated molecular systems for longer times one long simulation is often replaced by several trajectories leading to overall better sampling statistics For reviews and perspectives on dynamics simulations see 45 88 for exam ple The former discusses progress to date and future challenges in longtimescale simulations of peptides and proteins in solution and the latter summarizes progress in various macromolecular systems including membranes and chan nels and simulation methodologies including drug design applications See also the June 2002 issue of Accounts of Chemical Research volume 35 devoted to molecular dynamics simulations of biomolecules Duan et al make an interesting fanciful projection on the computational ca pabilities of modeling in the coming decades 56 they suggest the feasibility in 20 years of simulating a second in the lifetime of mediumsized proteins and in 50760 years of following the entire life cycle of an E Cali cell 1000 seconds or 20 minutes for 30 billion atoms This estimate was extrapolated on the basis of two data points i the 1977 BPTl simulation 136 and the 1998 villin simula tion 55 57 discussed above 7 and relied on the assumption that computational power increases by a factor of 10 every 374 years These projections are displayed by entries for the years 2020 and 2055 in Figure 12 13 Emergence of Biomodeling from Experimental Progress in Proteins and Nucleic Acids At the same time that molecular mechanics developed tremendous progress on the experimental front also began to trigger further interest in the theoretical approach to structure determination 13 Protein Crystallography The rst records of crystallized polypeptides or proteins date back to the late 19205 early 19305 1926 urease James Sumner 1934 pepsin J D Bernal and Dorothy CrowfootHodgkin 1935 insulin CrowfootHodgkin However only in the late 19505 did John Kendrew Perutz rst doctoral student and Max Perutz succeed in deciphering the Xray diffraction pattern from the crystal structure of the protein 1958 myoglobin Kendrew 1959 hemoglobin Perutz This was possible by Perutz crucial 39 J 1954 that t f proteins can be solved by comparing the Xray diffraction patterns of a crystal of a native protein to those associated with the protein bound to heavy atoms like mercury ie by isomorphous replacement The era of modern structural biology began with this landmark development 13 Experimental Progress 13 quapiorin lpF 3 i 100000zquot39 39 90000 quot39 8O OOONEIIIH 7O OOO EII 60000quotquot39 5O OOO Z cmcwomew 40000 quot 39 EStrog39enleg Number of Atoms 23 Bih ptapelotid93 3 HIV Mayer DNA protease quot 39 24b 2 1110ellsi 397 7 39I I 20000 2 quot 39 jmy39dglobirgl H 39 39 I DNA12bp villin 3 100000 30000 quotquot 39 lllllwlllllllmi39i39 10000 39 I fllllllllllllllIlNllllllllNEB139 39 Figure 11 The evolution of molecular dynamics simulations with respect to system sizes and simulation lengths see also Table 12 As glimpses of the rst Xray crystal structures of proteins came into view Li nus Pauling and Robert Corey began in the mid1930s to catalogue bond lengths and angles in amino acids By the early 1950s they had predicted the two basic structures of amino acid polymers on the basis of hydrogen bonding patterns or helices and 0 sheets 163 162 As of 1960 about 75 proteins had been crystal lized and immense interest began on relating the sequence content to catalytic activity of these enzymes By then the exciting new eld of molecular biology was well underway Pe rutz who founded the Medical Research Council Unit for Molecular Biology at the Cavendish Laboratory in Cambridge in 1947 created in 1962 the Laboratory of Molecular Biology there Perutz and Kendrew received the Nobel Prize for Chemistry for their accomplishments in 19624 4See the formidable electronic museum of science and technology with related lectures and books that emerged from Nobelawarded research on the website of the Nobel Foundation WWW nobelse This virtual museum was recently constructed to mark the 100th anniversary in 2001 of Alfred B 13 Experimental Progress 15 Inspired by the 1928 work of the British medical of cer Fred Grif th Oswald Avery and coworkers Colin MacLeod and Maclyn McCarty studied pneumonia infections Gri lth s intriguing experiments showed that mice became fatally ill upon infection from a live but harmless coatless strain of pneumoniacausing bacteria mixed with the DNA from heatkilled pathogenic bacteria thus the DNA from heatkilled pathogenic bacteria transformed live harmless into live pathogenic bacteria Avery and coworkers mixed DNA from virulent strains of pneumococci with harmless strains and used enzymes that digest DNA but not proteins Their results led to the cautious announcement that the transforming agent of traits is made exclusively of DNA5 Their nding was held with skepticism until the breakthrough Nobel prize winning phage experiments of Alfred Hershey and Martha Chase eight years later which demonstrated that only the nucleic acid of the phage entered the bacterium upon infection whereas the phage protein remained outside6 Much credit for the transforming agent evidence is due to the German theo retical physicist and Nobel laureate Max Delbruck who brilliantly suggested to use bacterial viruses as the model system for the genome demonstration principle Delbruck shared the Nobel Prize in Physiology or Medicine in 1969 with Hershey and Salvador Luria for their pioneering work that established bacteriophage as the premier model system for molecular genetics In 950 Erwin Chargaff demonstrated that the ratios of adeninetothymine and guaninetocytosine bases are close to unity with the relative amount of each kind of pair depending on the DNA source7 These crucial data together with the Xray ber diffraction photographs of hydrated DNA taken by Rosalind Franklin8 and Raymond Gosling both af liated with Maurice Wilkins who was engaged in related research 107 led directly to Watson and Crick s ingenious proposal of the structure of DNA in 1953 The photographs were crucial as they suggested a helical arrangement Al ough connecting these puzzle pieces may seem straightforward to us now that the DNA double helix is a household word these two ambitious young Cam bridge scientists deduced from the ber diffraction data and other evidence that the observed basepairing speci city together with steric restrictions can be rec onciled in an antiparallel doublehelical form with a sugarphosphate backbone and nitrogenousbases interior Their model also required a key piece of inform a 5Interested readers can visit the virtual gallery ofPro les in Science at wwwprofilesnmnihgov for a pro le on Avery 6Awonderful introduction to the rather recluse Hershey who died at the age of 88 in 1997 can be enjoyed in a volume edited by Franklin W Stahl titled We can rleep laier Alfred D Herrhey and the origin of molecular biology Cold Spring Harbor Press New York 2000 The title quotes Hershey from his letter to contributors of a volume on bacteriophage A which he edited in 1971 urging them to complete and submit their manuscripts 7Chargaff died in June 2002 at the age of 96 Sadly he was a sardonic man who did not easily t into the sharply focused world ofmost scientists he rrther isolated himselfwhen he denounced the molecular biology community in the late 1950s 8See an outstanding study on the dark lady ofDNA in a recent biography 133 and 62 16 1 Biomolecular Structure and Modeling Historical Perspective tion from the organic chemist Jerry Donahue regarding the Iautameric states of the bases9 Though many other DNA forms besides the classic Crick and Watson BDNA form are now recognized including triplexes and quadruplexes the B form is still the most prevalent under physiological conditions Indeed the 50th anniversary in April 2003 of Watson and Crick s seminal paper was celebrated with much fanfare throughout the world RNA crystallography is at an earlier stage but has recently made quantum leaps with the solution of several signi cant RNA molecules and identi ed novel roles for RNA eg gene regulation see Chapter 6 188 53 These developments followed the exciting discoveries in the 19805 that established that RNA like protein can act as an enzyme in living cells Sidney Altman and Thom as Cech re ceived the 1989 Nobel Prize in Chemistry for their discovery of RNA biocatalysts ribazymes The next two subsections elaborate upon the key techniques for solving bio molecular structures Xray crystallography and NMR We end this section on experimental progress with a description of modern technological advances and the genome sequencing projects they inspired 133 The Technique of X ray Crystallography Much of the early crystallographic work was accomplished without computers and was inherently very slow Imagine calculating the Fourier series by hand Only in the 19505 were direct methods for the phase problem developed with a dramatic increase in the speed of structure determination occurring about a decade later Structure determination by Xray crystallography involves analysis of the X ray diffraction pattern produced when a beam of Xrays is directed onto a well ordered crystal Crystals form by vapor diffusion from puri ed protein solutions under optimal conditions See 18 155 for overviews The diffraction pattern can be interpreted as a re ection of the primary beam source from sets of parallel planes in the crystal The diffracted spots are recorded on a detector electronic device or Xray lm scanned by a computer and analyzed on the basis of Bragg s law10 to determine the unit cell parameters 9Proton migrations within the bases can produce a tautomer These alternative forms depend on the dielectric constant of the solvent and the pH of the environment In the bases the common amino group rNng can tautomerize to an imino form NrH and the common hero group 7CO can adopt the enol state CrOrH the fraction of bases in the rare imino and enol tautomers is only about 001 under regular conditions 10The Braggs father WilliamHenry and son Sir WilliamLawrence observed that if two waves of electromagnetic radiation arrive at the same point in phase and produce a maximal intensity the difference between the distances they traveled is an integral multiple of their wavelengths From this they derived what is now known as Bmgg s law specifying the conditions for diffraction and the relation among three key quantities at distance between parallel planes in the crystal A the wave length of the Xray beam and 0 the re ection angle Bragg s condition requires that the difference 13 Experimental Progress 17 Each such recorded diffraction spot has an associated amplitude wavelength and phase all three properties must be known to deduce atomic positions Since the phase is lost in the Xray experiments it must be computed from the other data This central obstacle in crystal structure analysis is called the phase problem see Box 11 Together the amplitudes and phases of the diffraction data are used to calculate the electron density map39 the greater the resolution of the diffraction data the higher the resolution of this map and hence the atomic detail derived from it Both the laborious crystallization process 139 and the necessary mathemat ical analysis of the diffraction data limit the amount of accurate biomolecular data available Wellordered crystals of biological macromolecules are dif cult to grow in part because of the disorder and mobility in certain regions Crystal lization experiments must therefore screen and optimize various parameters that in uence crystal formation such as temperature pH solvent type and added ions or ligands The phase problem was solved by direct methods for small molecules roughly 3 100 atoms by Jerome Karle and Herbert Hauptman in the late 19405 and early 1950s they were recognized for this feat with the 1985 Nobel Prize in Chemistry For larger molecules biomolecular crystallographers have relied on the method pioneered by Perutz Kendrew and their coworkers termed multiple isomorphous replacement MIR introduces new Xray scatters from complexes of the biomolecule with heavy elements such as selenium or heavy metals like osmium mercury or uranium The combination of diffraction patterns for the biomolecule heavy ele ments or elements or metals and biom oleculeheavym etal complex offers more information for estimating the desired phases The differences in diffracted inten sities between the native and derivative crystals are used to pinpoint the heavy atoms whose waves serve as references in the phase determination for the native 5 Stem To date advances in the experimental technological and theoretical fronts have dramatically improved the ease of crystal preparation and the quality of the obtained threedimensional 3D biomolecular models 18 last chapter Tech niques besides MIR to facilitate the phase determination process 7 by analyzing patterns of heavymetal derivatives using mumwavelength anomalous dl racn39on MAD or by molecular replacement deriving the phase of the target crystal on the basis of a solved related molecular system 95 96 have been developed Very strong Xray sources from synchrotron radiation e g with light inten sity that can be 10000 times greater than conventional beams generated in a laboratory have become available New techniques have made it possible to visualize shortlived intermediates in enzymecatalyzed reactions at atomic res olution by timeresolved crystallography 165 78 143 And improved methods in distance traveled by the Xrays re ected from adjacent planes is equal to the wavelength A The associated relationship is A stin 18 1 Biomolecular Structure and Modeling Historical Perspective for model re nement and phase determination are continuously being reported 212 Such advances are leading to highly re ned biom olecular structures11 res olution g 2 A at much greater numbers 11 even for nucleic acids 152 Box 11 The Phase Problem The mathematical phase problem in crystallography 93 110 involves resolving the phase angles in associated with the structure factors Fh when only the intensities squares of the amplitudes of the scattered Xray pattern I h Fh are known The structure factors Fh de ned as Fh Fhl explir h 11 describe the scattering pattern of the crystal in the Fourier series of the electron density distribution pr 2F exp 27rihr 12 h Here denotes position h identi es the three de ning planes of the unit cell eg h k l V is the cell volume and denotes a vector product See 174 for example for details 134 The Technique of NMR Spectroscopy The introduction of NMR as a technique for protein structure determination came much later early 1960s but since 1984 both Xray diffraction and NMR have been valuable tools for determining protein structure at atomic resolution Kurt Wutrich was awarded the 2002 Nobel Prize in Chemistry12 for his pioneering efforts in developing and applying NMR to biological macromolecules Nuclear magnetic resonance is a versatile technique for obtaining structural and dynamic information on molecules in solution The resulting 3D views from NMR are not as detailed as those that can result from Xray crystallography but the NMR information is not static and incorporates effects due to thermal motions in solution In NMR powerful magnetic elds and highfrequency radiation waves are ap plied to probe the magnetic environment of the nuclei The local environment of the nucleus determines the frequency of the resonance absorption The resulting 11The resolution value is similar to the quantity associated with a microscope objects atoms can be distinguished if they are separated by more than the resolution value Hence the lower the resolution value the more molecular architectural detail that can be discerned 12The other half of the 2002 Chemistry prize was split between John B Fenn and Koichi Tanaka who were recognized for their development of ionization methods for analysis of proteins using mass spectrometry 13 Experimental Progress 19 NMR spectrum contains information on the interactions and localized motion of the molecules containing those resonant nuclei The absorption frequency of particular groups can be distinguished from one another when highfrequency NMR devices are used high resolution MMR Until recently this requirement for nonoverlapping signals to produce a clear picture has limited the protein sizes that can be studied by NMR to systems with masses in the range of 50 to 100 kDa However dramatic increases such as a tenfold increase have been possible with novel strategies for isotopic la beling of proteins 236 and detection of signals from disordered residues with fast internal motions by cross correlated relaxationenhanced polarization trans fer 72 For example the Horwich and Wutrich labs collaborated in 2002 to produce a high resolution solution NMR structure of the chaperoninco chaperonin GroELGroES complex N900 kDA 72 Advances in solidstate NMR techniques may be particularly valuable for structure analysis of membrane proteins As in Xray crystallography advanced computers are required to interpret the data systematically NMR spectroscopy yields a wealth of information a network of distances involving pairs of spatiallyproximate hydrogen atoms The distances are derived from Nuclear Overhauser Effects NOEs between neighboring hy drogen atoms in the biomolecule that is for atom pairs separated by less than 76 A To calculate the 3D structure of the macromolecule these NMR distances are used as conformational restraints in combination with various supplementary in formation primary sequence reference geometries for bond lengths and bond angles chirality steric constraints spectra and so on A suitable energy func tion must be formulated and then minimized or surveyed by various techniques to nd the coordinates that are most compatible with the experimental data See 40 for an overview Such deduced models are used to back calculate the spec tra inferred from the distances from which iterative improvements of the model are pursued to improve the matching of the spectra Indeed the dif culty of us ing NMR data for structure re nement in the early days can be attributed to this formidable re nement task formally an overdetermined or underdetermined global optimization problem 13 The pioneering efforts of deducing peptide and protein structures in solu tion by NMR techniques were reported between 1981 and 198639 they re ected yearlong struggles in the laboratory Only a decade later with advances on the experimental theoretical and technological fronts 3D structures of proteins in solution could be determined routinely for monomeric proteins with less than 200 amino acid residues See 65 231 for texts by modernNMR pioneers 83 13Solved N39MR structures are usually presented as sets of structures since certain molecular seg ments can be overdetermined while others underdetermined The better the agreement for particular atomic positions among the structures in the set the more likely it is that a particular atom or component is well determined 20 1 Biomolecular Structure and Modeling Historical Perspective for a historical perspective of biomolecular NMR and 40 193 41 for recent advances Today s clever methods have been designed to facilitate such re nem ents from formulation of the target energy to conformational searching the latter using tools from distance geometry molecular dynamics simulated annealing and many hy brid search techniques 26 40 83 32 The ensemble of structures obtained is not guaranteed to contain the best global one but the solutions are generally satisfactory in terms of consistency with the data The recent technique of resid ual dipolar coupling also has great potential for structure determination by NMR spectroscopy without the use of NOE data 208 194 14 Modern Era of Technological Advances 14 From Biochemistry to Biotechnology The discovery of the elegant yet simple DNA double helix not only led to the birth of molecular biology39 it led to the crucial link between biology and chemistry Namely the genetic code relating triplets of RNA the template for protein syn thesis to the amino acid sequence was decoded ten years later and biochemists began to isolate enzymes that control DNA metabolism One class of those enzymes restriction endonucleases became especially im portant for recombinant DNA technology These molecules can be used to break huge DNA into small fragments for sequence analysis Restriction enzymes can also cut and paste DNA the latter with the aid of an enzyme ligase and thereby create spliced DNA of desired transferred properties such as antibioticresistant bacteria that serve as inform ants for human insulin makers The discovery of these enzymes was recognized by the 1978 Nobel Prize in Physiology or Medicine to Werner Arber Daniel Nathans and Hamilton 0 Smith Very quickly Xray NMR recombinant DNA technology and the synthesis of biological macromolecules improved The 19705 and 19805 saw steady advances in our ability to produce crystallize image and manipulate macromolecules Sitedirected mutagenesis developed in 19705 by Canadian biochemist Michael Smith 1993 Nobel laureate in Chemistry has become a fundamental tool for protein synthesis and protein function analysis 142 PCR and Beyond The polymerase chain reaction PCR devised in 1985 by Kary Mullis winner of the 1993 Chemistry Nobel Prize with lLichael Smith and coworkers 149 revolutionized biochemistry small parent DNA sequences could be ampli ed exponentially in a very short time and used for many important investigations DNA analysis has become a standard tool for a variety of practical applications Noteworthy classic and current examples of PCR applications are collected in 14 Modern Era 21 Box 12 See also 172 for stories on how genetics teaches us about history justice diseases and more Beyond ampli cation PCR technology made possible isolation of gene frag ments and their usage to clone whole genes these genes could then be inserted into viruses or bacterial cells to direct the synthesis of biologically active products With dazzling speed the eld of bioengineering was born Automated sequencing efforts continued during the 1980s leading to the start of the International Hum an Genome Project in 1990 which spearheaded many other sequencing projects see next section Macromolecular Xray crystallography and NMR techniques are also improv ing rapidly in this modern era of instrumentation both in terms of obtained structure resolution and system sizes 144 Stronger Xray sources higher frequency NMR spectrometers and re nement tools for both data models are leading to these steady advances The combination of instrum entational advances in NMR spectroscopy and protein labeling schemes is suggesting that the size limit of proteinNMR may soon reach 100 kDa 232 236 In addition to crystallography andNMR cryogenic electronmicroscopy cryo Ell contributes important macroscopic views at lower resolution for proteins that are not amenable to NMR or crystallography see Box 13 and Figure 13 74 211 Together with recombinant DNA technology automated software for structure determination and supercomputer and graphics resources structure determi nation at a rate of one biomolecule per day or more is on the horizon Box 12 PCR Application Examples 0 Medical diagnoses of diseases and traits DNA analysis can be used to iden tify gene markers for many maladies like cancer eg BRCA12p53 mutations schizophrenia late Alzheimer s or Parkinson s disease A classic story of cancer markers involves Vice President Hubert Humphrey who was tested for bladder can cer in 1967 but died of the disease in 1978 In 1994 a er the invention of PCR his cancerous tissue from 1976 was compared to a urine sample from 1967 only to reveal the same mutations in the p53 gene a cancer suppressing gene that es caped the earlier recognition Sadly if PCR technology had been available in 1967 Humphrey may have been saved 0 Historical analysis DNA is now being used for genetic surveys in combination with archaeological data to identify markers in human populations14 Such analy ses can discern ancestors of human origins migration patterns and other historical events 179 These analyses are not limited to humans the evolutionary meta morphosis of whales has recently been unraveled by the study of fossil material 14Time can be correlated with genetic markers through analysis of mitochondrial DNA or seg ments of the Y chromosome Both are genetic elements that escape the usual reshuf ing of sexual reproduction their changes thus re ect random mutations that can be correlated with time 22 1 Biomolecular Structure and Modeling Historical Perspective combined with DNA analysis from living whales 229 Historical analysis by French and American viticulturists also recently showed that e entire gene pool of 16 classic wines can be conserved by growing only two grape varieties Pinot noir and Gouais blanc Depending on your occupation you may either be comforted or disturbed by this news PCR was also used to con rm that the fungus that caused the Irish famine since potato crops were devastated in 184571846 was caused by the fungus P infes tans a water mold infected leaves were collected during the famine 175 Studies showed that the Irish famine was not caused by a single strain called USl which causes modern plant infections as had been thought Signi cantly the studies taught researchers that further genetic analysis is needed to trace recent evolutionary history of the fungus spread Forensics and crime conviction DNA pro ling i comparing distinctive DNA sequences aberrations or numbers of sequence repeats among individuals 7 is a powerful tool for proving with extremely high probability the presence of a person or related object at a crime accident or another type of scene In fact from 1989 through 2002 123 prisoners have been exonerated in the US includ ing 12 from death row and several others serving more than a decade in prison and many casualties from disasters like plane crashes and the 11 September 2001 New York World Trade Center terrorist attacks were identi ed from DNA anal ysis of assembled body parts In 39s connection personal objects analyzed for DNA 39 39 m m g m o W 03 2 D O H 39 s FD E m m m E a a a m m 1 new breed of hightech detectives is emerging with modern scienti c tools see www uiono 39 39 39mml for example for the use of bugs in crime research and the 11 August 2000 issue of Science volume 289 for related news articles Family lineage paternity identi cation DNA ngerprinting can also be used to match parents to offspring In 998 DNA from the grave con rmed that President Thomas Jefferson fathered at least one child by his slave mistress Sally Hernrnings 200 years ago The remains of Tsar Nicholas family executed in 1918 were re cently identi ed by DNA In April 2000 French historians with European scientists solved a 205yearold mystery by analyzing the heart of Louis XVII preserved in a crystal urn con rming that the 10year old boy died in prison after his parents Marie Antoinette and Louis XVI were executed rather than spirited out of prison by supporters Antoinette s hair sample is available Similar postmortem DNA analysis proved false a paternity claim against Yves Montand See also the book by Reilly 172 for many other examples 15George Johnson OJ s Blood and The Big Bang Together at Last The New York Timer Sunday May 21 1995 14 Modern Era 23 Box 13 Cryogenic Electron Microscopy CryoEM Proteins that are di icult to crystallize or study by NMR may be studied by cryogenic elec tron microscopy cryoEM 74 204 211 This technique involves imaging rapidlyfrozen samples of randomlyoriented molecular complexes at low temperatures and reconstructing 3D views from the numerous planar EM projections of the structures Adequate particle detection imposes a lower limit on the subject of several hundred kDa but cryoEM is especially good for large molecules with s metry as size and symmetry facilitate the puzzle gathering 3D image reconstruction process Though the resolution is low compared to crystallography and NMR new biological insights may be gained as demonstrated for the ribosome 196 21 and the recent cryo EM solution ofthe 520kDa tetramer ofalatrotoxin at 14 A resolution 159 as shown in Figure 13 This solution represents an experimental triumph as the system is relatively small for cryo imaging This toxic protein in the venom of black widow spiders so called because the cruel females eat their mates forms a tetramer only in the presence of divalent cations This organization enables the toxin to adhere to the lipid bilayer membrane and form channels through which neurotransmitters are discharged This intriguing system has long been used to study mechanisms of neurotransmitter discharge synaptic vesicle exocytosis by which the release of particles too large to diffuse through membranes triggers responses that lead oc Latrotoxih uquot a Tetramer top view b ide view Figure 13 Top and side cryoEM views of the 520kDa tetramer of ozlatrotoxin solved at 14 A resolution 159 Images were kindly provided by Yuri Ushkaryov 24 1 Biomolecular Structure and Modeling Historical Perspective to catastrophic neuro and cardiovascular events With faster computers and improvements in 3D reconstruction algorithms cryoEM should emerge as a greater contributor to biomolecular structure and function in the near l 5 Genome Sequencing 151 Projects Overview From Bugs to Baboons Spurred by this dazzling technology thousands of researchers worldwide have been or are now involved in hLmdreds of sequencing projects for species like the cellular slime mold rOLmdworm zebra sh cat rat pig cow and baboon Limited resources focus efforts into the seven main categories of genomes besides Homo sapiens viruses bacteria fngi Arabidapsis Ihaliana the weed Drasaphila melanagasler fruit y Caenarhabdilis elegans rowid worm and M musculus mouse For an overview of sequence data see wwwmcbinlmnihgoventrezqueryfcgidbGenome and for genome land marks readers can search the online collection available on WWW sciencemag orgfeatureplussfgspeoiaVindexshtml The Human Genome Project is de scribed in the next section The rst completed genome reported was of the bacterium Haemaphilus in uenzae in 1995 see also Box 14 Soon after came the yeast genome Saccha ramyces cerevisiae 1996 see genomeWWWstanfordeduSaccharomycesl the bacterium Bacillus sublilis 1997 and the tuberculosis bacterium Mycabac Ierium tuberculosis in 1998 Reports of the worm fruit y mustard plant and rice genomes described below represent important landmarks in addition to the human genome next section Romdworm C elegans 1998 The completion of the genome deciphering of the rst multicellular animal the onemillimeterlong soil worm C elegans made many headlines in 1998 see the 11 December 1998 issue of Science volume 282 and wwwwormbaseorgl It re ects a triumphant collaboration of more than eight years between Cambridge and Washington University laboratories The nematode genome paves the way to obtaining many insights into ge netic relationships among different genomes their fmetional characterization and associated evolutionary pathways A comparison of the worm and yeast enomes in particular offers insights into the genetic changes required to support a multicellular organism A comparative analysis between the worm and human genome is also impor tant Since it was fOLmd that roughly one third of the worm s proteins gt 6000 l 5 Genome Sequencing 25 are similar to those of mamm als automated screening tests are already in progress to search for new drugs that affect worm proteins that are related to proteins in volved in human diseases For example diabetes drugs can be tested on worms with a mutation in the gene for the insulin receptor Opportunities for studying and treating human obesity by targeting relevant proteins also exist in early 2003 7 biologists have identi ed the genes that regulate fat storage and metabolism in the romdworm ie 305 that reduce and 1 12 genes that increase body fat in a landmark experiment that inactivated nearly all of the anim al s genes ie expression was disrupted for 16757 worm genes out of the the predicted total of 19757 that code for proteins in a single experiment using new RNA interference technology 108 see Chapter 6 on RNA The remarkable romdworm also made headlines in the Fall of 2002 when the Nobel Prize in Physiology or Medicine was awarded to Sydney Brenner Robert Horvitz and John Sulston for their collective work over 30 years on C elegans on programmed cell death apaptasz39s This process by which healthy cells are instructed to kill them selves is vital for proper organ and tissue development and also leads to diseases like cancer and neurodegerative diseases Better knowledge of what leads to cell death and how it can be blocked helps to identify agents of many hum an disorders and eventually to develop treatments Fruit y Drasaphila 1999 The deciphering of most of the fruit y genome in 2000 by Celera Genomics in collaboration with academic teams in the Berkeley and European Drasaphila Genome projects made headlines in March 2000 see the 24 March 2000 issue of Science volume 287 and wwwfruitfyorg in large part due to the ground breaking annotation jamboree employed to assign fmetional guesses to the identi ed genes Interestingly the millioncelled fruit y genome has fewer genes than the tiny 1000celled worm C elegans though initial reports of the number of worms genes may have been overestimated and only twice the number of genes as the Lmicellular yeast This is surprising given the complexity of the fruit y i with wings blood kidney and a powerful brain that can compute elaborate behavior patterns Like some other eukaryotes this insect has developed a nested set of genes with alternate splicing patterns that can produce more than one meaning from a given DNA sequence ie different mRNAs from the same gene Indeed the number of core proteins in both fruit ies and worms is roughly similar 8000 vs 9500 respectively Fly genes with human comiterparts may help to develop drugs that inhibit encoded proteins Already one such y gene is 753 a tumorsuppressor gene that when mutated allows cells to become cancerous The humble baker s yeast proteins are also being exploited to assess activity of cancer drugs 26 1 Biomolecular Structure and Modeling Historical Perspective Mustard Plant Arabidapsis 2000 Arabidapsis Ihaliana is a small plant in the mustard family with the small est genome and the highest gene density so far identi ed in a owering plant 125 million base pairs and roughly 25000 genes Two out of the ve chromo somes of Arabidapsis were completed by the end of 1999 and the full genome representing 92 of the sequence published one year later a major mile stone for genetics See the 14 December 1999 issue of Nature volume 408 and wwwarabidopsisorgl for example This achievement is important because genedense plants 25000 genes versus 19000 and 13000 for brain and nervoussystem containing roundworm and fruit y p have J r J J complexrepertoire ofgenes for the needed chemical reactions involving sunlight air and water Understanding these gene functions and comparing them to human genes will provide insights into other owering plants like corn and rice and will aid in our understanding of human life Plant sequencing analysis should lead to improved crop produc tion in terms of nutrition and disease resistance by genetic engineering and to new plantbased ingredients in our medicine cabinets For example engineered crops that are larger more resistant to cold and that grow faster have already been produced Arabidapsis s genome is also directly relevant to human biological function as many fundamental processes of life are shared by all higher organisms Some common genes are related to cancer and premature aging The much more facile manipulation and study of those diseaserelated genes in plants compared to human or animal models is a boon for medical researchers Interestingly scientists found that nearly twothirds of the Arabidapsis genes are duplicates but it is possible that different roles for these apparentlyduplicate genes within the organism might be eventually found Others suggest that dupli cation may serve to protect the plants against DNA damage from solar radiation a spare could become crucial if a gene becomes mutated lntriguingly the plant also has little ie not gene coding DNA unlike hum The next big challenge forArabidapsz39s a cionados is to determine the function of every gene by precise experimental manipulations that deactivate or overac tivate one gene at a time For this purpose the community launched a 10year genedeterminationproject a 201 0 Project in December 2000 Though guesses based on homology sequences with genes from other organisms have been made for roughly one half of the genes by the time the complete genome sequence was reported much work lies ahead to nail down each function precisely This large number of mystery genes promises a vast world of plant biochemistry awaiting exploration Mouse 2001 2002 The international Mouse Genome Sequencing Consortium MGSC formed in late fall of 2000 followed in the footsteps of the human genome project 48 The mouse represents one of ve central model organisms that were planned at that l 5 Genome Sequencing 27 time to be sequenced Though coming in the backdrop of the human genome draft versions of the mouse genome were announced by both the private and public consortia in 2001 and 2002 respectively In the end of 2002 the MGSC published a highquality draft sequence and analysis of the mouse genome see the 5 December 2002 issue of Nature volume 420 The 25 billion size of the mouse genome is slightly smaller than the human genome 3 billion in length and the number of estimated mouse genes around 30000 is roughly similar to the number believed for humans lntriguingly the various analyses reported in December 2002 revealed that only a small percentage 1 of the mouse s genes has no obvious human comiterpart This similarity makes the mouse genome an excellent organism for studying hum an diseases and proposed treatments But the obvious dissimilarity between mice and men and women also begs for further comparative investigations Why are we not more like mice Part of this question may be explained through an Lmderstanding of how mouse and hum an genes might be regulated differently Related to this control of gene activation and fmetion are newlydiscovered 39 fol nan cliptiunregulation quotJ 395 the mouse genome analyses suggested that a novel class of genes called RNA genes 7 RNA transcripts that do not code for proteins 7 has other essential regulatory fmetions that may play signi cant roles in each organism s survival see discussion on RNAs in Chapter 6 on noncoding RNAs As details of these mechanisms as well as comparisons among human and other closelyrelated organisms will emerge explanations may arise In the mean time genetic researchers have a huge boost of resources and are already exploiting similarities to generate expression patterns for genes of entire chromosomes like chromosome 21 as a way to research speci c human diseases like Down s syndrome which occurs when a person inherits three instead of two copies of chromosome 21 Rice 2002 The second largest genome sequencing project i for the rice plant see a de scription of the human genome project below i has been Lmderway since the late 1990s in many groups around the world The relatively small size of the rice genome makes it an ideal model system for investigating the genomic sequences of other grass crops like corn barley wheat rye and sugarcane Knowledge of the genome will help create higher quality and more nutritious rice and other cereal crops Signi cant impact on agriculture and economics is thus expected By May 2000 a rough draft around 85 of the rice genome 430 million bases was annOmeed another exemplary cooperation between the St Louis based agrobiotechnology company Monsanto now part of Pharmacia and a University of Washington genomics team In April 2002 two groups a Chinese team from the Bejing Genomics Institute led by Yang Huanming and the Swiss agrobiotech giant Syngenta led by Stephen Goff published two versions of the rice genome see the 5 April 2002 issue of Science volume 296 for the rice subspecies indica and japanica respectively 28 1 Biomolecular Structure and Modeling Historical Perspective Both sequences contain many gaps and errors as they were solved by the whole genome shotgun approach see Box 14 but represent the rst detailed glimpse into a world food staple The complete sequences of chromosomes 1 and 4 of rice were reported in late 2002 see the 21 November 2002 issue of Nature volume 420 Puffer sh F ugu 2002 The highly poisonous delicacy from the tiger puffer sh prepared by trained Japanese chefs has long been a genomic model organism for Sydney Brenner a founder of molecular biology and recipient of the 2002 Nobel Prize in Phys iology or Medicine for his work on programmed cell death see above under Roundworm The compact Fugu rubripes genome is only oneninth the size of the human genome but contains approximately the same number of genes shorter introns and less repetitive DNA account for this difference The whole genome shotgun approach see Box 14 was used to sequence Fugu see the 23 August 2002 issue of Science volume 297 Through comparative genomics analyses of this ancient vertebrate genome and of many others help understand the extent of protein evolution through common and divergent genes and help interpret many human genes Other Organisms Complete sequences and drafts of many genomes are now known check websites such as www nobi nlm nihgove ntrezqueryfcgid bGeno me for status re ports lncluded are bacterial genomes of a microbe that can survive environments lethal for most organisms and might turn useful as a metabolizer of toxic waste D radiadurans R1 a nasty little bacterium that causes diseases in oranges grapes and other plants Xylellafasn39diasa decoded by a Brazilian team and the bugs for hum an foes like cholera syphilis tuberculosis malaria the plague typhus and SARS severe acute respiratory syndrome Proteins unique to these pathogens are being studied and disease treatments will likely follow eg cholera vaccine Implications The genomic revolution and the comparative genomics enterprises now under way will not only provide fundamental knowledge about the organization and evolution of biological systems in the decades to come 116 but will also lead to m edical breakthroughs Already the practical bene ts of genomic deciphering have emerged eg 151 92 A dramatic demonstration in 2000 was the design of the rst vaccine to prevent a deadly form of bacterial meningitis using a twoyear genehunting process at Chiron Corporation Researchers searched through the computer data base of all the bacterium 5 genes and found several key proteins that in laboratory experiments stimulated powerful immune responses against all known strains of the Neisseria meningilidis Serogroup B Strain MC58 bug 166 1 5 Genome Sequencing 29 In April 2003 just two months after the rst inklings of a deadly disease called SARS emerged from Asia a global effort coordinated by the World Health Or ganization annouced that it had mapped the coronavirus genome that causes this highly infectious disease This was made possible by the new hightech science era of internet links and sequencing methods Finding a treatment for SARS re mains a challenge but the global virus hLmt is a model par excellence of cm rent potential of genomics initiatives 152 The Human Genome The International Human Genome Project was lamehed in 1990 to sequence all three billion bases in human DNA 48 The public consortium has contribu tions from many parts of the world such as the United States United Kingdom Japan France Germany China and more and is coordinated by academic cen ters fLmded by NIH and the Wellcome Trust of London headed by Francis Collins and 1Iichael Morgan with groups at the Saenger Center near Cambridge UK and four centers in the United States see WWW nhgri nihgovHGP In 1998 a competing private enterprise led by Craig Venter s biotechnology rm Celera Genomics and colleagues at The Institute for Genomic Research TIGR both at Rockville Maryland owned by the PE Corporation see wwwceleracom entered the race Eventually this race to decode the hu man genome tLuned into a collaboration in part not only due to international pressure but also because the different approaches for sequencing taken by the public and private consortia are complementary see Box 14 and related articles 219 82 150 comparing the approaches for the human genome assembly Mile stones A rst milestone was reached in December 1999 when 97 of the second smallest chromosome number 22 was sequenced by the public consortium the missing 3 of the sequence is due to 11 gaps in contiguity see the 2 December 1999 issue of Nature volume 402 Though small 43 million bases lt 2 of genomic DNA chromosome 22 is gene rich and accounts for many genetic diseases eg schizophrenia Chromosome 21 the smallest was mapped soon after 11 May 2000 issue of Nature volume 405 and fOLmd to contain far fewer genes than the 545 in chromo some 22 This openedthe possibility that the total number of genes in hum an DNA is less than the 100000 previously estimated Chromosome 21 is best known for its association with Down s syndrome affected children are born with three rather than two copies of the chromosome Learning about the genes associated with chromosome 21 may help identify genes involved in the disease and eventually develop treatments See a full account of the chromosome 21 story in wwwsoiamcomexplorationsZOOOOS 5000hrom21l Completion of the rst draft of the human genome sequence project broke worldwide headlines on 26 JLme 2000 see for example the July 2000 issue of 30 1 Biomolecular Structure and Modeling Historical Perspective Scienti c American volume 283 This draft re ects 97 of the genome cloned and 85 of it sequenced accurately that is with 5 to 7fold redundancy Actually the declaration of the draft status was arbitrary and even fell short of the 90 gure set as target Still there is no doubt that the human genome represents a landmark contribution to hum ankind joined to the ranks of other Big Science projects like the Manhattan project and the Apollo space program The June 2000 announcement also represented a truce between the principal players of the public and private hum an genome efforts and a commitment to continue to work together for the general cause A New York Times editorial by David Baltimore on the Sunday before the Monday announcement was expected underscored this achievement but also emphasizedthe work that lies ahead The very celebration of the completion of the human genome is a rare day in the history of science an event of historic signi cance is recognized not in retrospect but as it is happening i e it is a moment worthy of the attention of every hum an we should not mistake progress for a solution There is yet much work to be done It will take many decades to fully comprehend the magni cence of the DNA edi ce built over four billion years of evolution and held in the nucleus of each cell of the body of each organism on earth David Baltimore New York Times 25 June 2000 Baltimore further explains that the number of proteins not genes determines the complexity of an organism The gene number should ultimately explain the complexity of humans Already in June 2000 the estimated number of total genes 50000 is not too far away from the number in a y 14000 or a worm 18000 Several months after the June announcement this estimate was reduced to 30000740000 see the 15 February 2001 issue ofNature volume 409 and the 16 February 2001 issue ofScience volume 291 This implies an equivalence of sorts between each hum an and roughly two ies However the estimated num ber has been climbing since then39 see 29 for a discussion of the shortcomings of the gene identi cation process on the basis of the available genomic sequence Intriguing recent ndings that human cells make far more RNA than can be accounted for by the estimated 30000 to 40000 human genes may also suggest that the number of human genes is larger 109 181 Another explanation to this observation is that hidden levels of complexity exist in the human genome39 for example there may be more genes than previously thought that produce RNA as an end product in itself rather than as messenger for protein production as for other known roles of RNA 135 Clearly the nal word on the number of 16It has been said 202 that this day happened to be free in the diaries ofUS President Bill Clinton and Britain s Prime Minister Tony Blair who proclaimed victory over the genome along with leading scientists 1 5 Genome Sequencing 31 human genes and the conserved genes that humans share with ies mice or other organisms awaits further studies Three years after the working draft of the human genome sequence was an nOmeed with much fanfare the Human Genome Project as originally devised was declared complete to an accuracy of 999 the international consortium of genome sequencing centers put all the fragments of the 31billion DNA Lmits of the human genome in order and closed nearly all of the gaps The month of April 2003 for this declaration was timed to coincide with the 50th anniversary of Watson and Crick s report of the structure of the DNA double helix The number of hum an genes still remains meertain but most scientists believe it to be around 30000 Some chromosome segments of the human genome like in chromosome Y are more dif cult to characterize as they are highly repetitive39 forthately these segments may be relatively insigni cant for the genomes overall fmetion With determination of the hum an genome sequence for several individuals variations polymorphisms in the DNA sequence that contribute to disease in different pop ulations will be identi ed and analyzed The next task is to de ne the proteins produced by each gene and Lmderstand the cellular interactions of those proteins This Lmderstanding immediately impacts the development of disease diagnostics and treatments and opens new avenues for designer drugs Undoubtedly the deter mination of sequences for 1000 major species in the next decade will shed further insights on the hum an genome Box 14 Different Sequencing Approaches Two synergistic approaches have been used for sequencing The public consortium s ap proach relies on a clonebyclone approach breaking DNA into large fragments cloning each fragment by inserting it into the genome of a bacterial arti cial chromosome BAC sequencing the BACs once the entire genome is spanned and then creating a physical map from the individual BAC clones The last part i rearranging the fragments in the order they occur on the chromosome 7 is the most difficult It involves resolving the overlapped fragments sharing short sequences of DNA sequencetagged sites The alternative approach pioneered by Venter s Celera involves reconstructing the entire genome from small pieces of DNA without a prior map of their chromosomal positions The 39 i Aquot ug quot 39 quot1 t p 39 equi ment Es sentially this gargantuan jigsaw puzzle is assembled by matching sequence pieces as the larger picture evolves The rst successful demonstration of this piecemeal approach was reported by Celera for decoding the genome of the bacterium Haemophilus ir uenzae in 1995 This bacterium has a mere 18 million base pairs with estimated 1700 genes versus three billion base pairs for human DNA with at least 30000 genes The sequence of Drosophila followed in 1998 140 million base pairs 13000 estimated genes and released to the public in early 2000 see the 24 March 2000 issue ofScience volume 287 32 1 Biomolecular Structure and Modeling Historical Perspective This shotgun approach has been applied to the human genome more challenging than the above organisms for two reasons The human genome is larger 7 requiring the puzzle to be formed from 70 million pieces 7 and has many more repeat sequences compli cating accurate genome assembly For this reason the public data were incorporated into the whole genome assembly 82 The wholegenome shotgun approach has also been ap plied to obtain a dra of the mouse 2001 rice 2002 and puifer sh 2002 genomes for example The two approaches are complementary since the rapid deciphering of small pieces by the latter approach relies upon the larger picture generated by the clonebyclone ap proach for overall reconstruction See a series of articles 219 82 150 scrutinizing those approaches as applied to the human genome assembly A Gold Mine of Biodata The most uptodate information on sequencing projects can be obtained from the U S National Center for Biotechnology Inform ation N CB1 at the U S National Library of Medicine which is developing a sophisticated analysis network for the human genome data For information see the Human Genome Resources Guide WWW nobi nlm nih govgenomeguidehuman click on Map Viewer the US National Human Genome Research lnstitute s site WWW nhgri nihgov that of Department of En ergy DOE at WWWOrngOVhgmiS the site of the University of California at Santa Cruz at genomeucseedul and others 17 Since 1992 NCBI has maintained the GenBank database of publicly avail able nucleotide sequences WWWaninmnihgOV A typical GenBank entry includes information on the gene locus and its de nition organism inform a tion literature citations and biological features like coding regions and their protein translations Many search and analysis tools are also available to serve researchers As the sequencing of each new human chromosome is being completed the bi ological revolution is beginning to affectmany aspects of our lives 43 perhaps not too far away from Wilson s vision of consilience A gold mine of biologi cal data is now amassing likened to orchards just waiting to be picked 18 This rich resource for medicine and technology also provides new foundations as never before for computational applications 17Some use rl web sites for genomic data include wwwarabidopsisorg wwwncbinmnihgovSitemap ndexhtml the Agricultural Genome Information System Caenorhabditir elegam Genetics and Genomics Crop Genome Databases at Cornell University FlyBase The Genome Database Genome Sequencing Center Washington University GenomeNet US National Agricultural Library Online Mendelian Inheritance in Man Pseudomonas aeruginom Community Annotation Project Sacchammycer Genome Database The Saenger Centre Taxonomy Browser and UniGene 18B Sinclair in The Sciemm 19 March 2001 l 5 Genome Sequencing 33 Consequently in fty years time we anticipate breakthroughs in protein fold ing medicine cellular mechanisms regulation gene interactions development and differentiation history population genetics origin of life and perhaps new life forms through analysis of conserved and vital genes as well as new gene products See the 5 October 2001 issue of Science volume 294 for a discussion of new ideas projects and scienti c advances that followed since the sequencing of the human genome Among the promising medical leaps are personalized and molecular medicine perhaps in large part due to the revolutionary DNA microarray technology see 73 and Box 15 and gene therapy Of course information is not knowledge but rather a road that can lead to perception Therefore these aforementioned achievements will require concerted efforts to extract information from all the sequence data concerning gene products Many initiatives are Lmderway to process genetic data in the goal of wider standing and eventually treating human diseases For example in 2003 Britain lamehed a genetic census Biobankprojecti assembly of a database of medical information based on 500000 Britons representing Britain s demographics aim ed at quantifying the combined genetic and environmental eg pollution smok ing exercise diet in uence on comm on hum an ailments Other national genetic database projects with corresponding numbers of participants are Lmderway in Iceland 275000 Sweden 80000 Estonia 1 million andLatvia 50000 Pri vate genomic database projects are also being assembledby the American Cancer Society 110000 Mayo Clinic 200000 and CARTaGENE 50000 Compa nies like the Icelandic Decode Genetics are hLmting for disease genes in these genealogies When leading scientists were queried in 2002 by the publisher of the web site Edgeorg devoted to science to advise on action to take concerning the most pressing scienti c issues in the world physicist Freeman Dyson boldly sug gested a planetary genome sequencing project to identify all the segments of the genomes of all the millions of species that live together in the planet Dyson s vision for completing the sequencing of the biosphere within less than half a cen tury aims to profoundly increase our Lmderstanding of the ecology of the planet which could lead to environmental improvements and cures for hum an diseases There is no doubt that a creative and well engineered project combining techno logical innovations with biological data can have enormous rami cations on our lives Many societal ethical economic legal and political issues will also have to be addressed with these developments Still like the relatively minor Y2K Year 2000 anxiety these problems could be resolved in stride through multidisciplinary networks of expertise For more on the ethical legal and social implications of human genome research visit www nhgri nihgovELSI In a way sequencing projects make the giant leap directly from sequence to function possible only when a homologous sequence is available whose fme tion is known However the crucial middle aspect 7 structure 7 must be relied 34 1 Biomolecular Structure and Modeling Historical Perspective upon to make systematic functional links This systematic interpolation and ex trapolation between sequence and structure relies and depends upon advances in biom olecular modeling in addition to highthroughput structure technology the hum an proteomics project The next chapter introduces some current challenges in modeling macro molecules and mentions important applications in medicine and technology Box 15 Genomics amp Microarrays DNA microarrays i also known as gene chips DNA chips and biochips i are be coming marvelous tools for linking gene sequence to gene products They can provide in a single experiment an expression pro le of many genes 73 As a result they have impor tant applications to basic and clinical biomedicine Particularly exciting is the application of such genomic data to personalized medicine or pharmacogenomics i prescribing med ication based on genotyping results of both patient and any associated bacterial or viral pathogen 66 Prescribing speci c diets to affect health based on genetic responses to diet nutritional genomics or nutrigenomics is another application gaining momentum Essentially each rnicroarray is a grid of DNA oligonucleotides called probes prepared with sequences that represent various genes These probes are directed to a speci c gene or mRNA samples called targets from tissues of interest eg cancer cells Binding be tween probe and target occurs if the RNA is complementary to the target nucleic acid Thus probes can be designed to bind a target mRNA if the probe contains certain muta tions Single nucleotide polymorphisms or SNPs which account for 01 of the genetic difference among individuals can be detected this way 138 The hybridization event 7 amount of RNA that binds to each cell grid 7 re ects the extent of gene expression gene activity in a particular cell Such measurements can be detected by uorescence tagging of oligonucleotides The color and intensity of the result ing basepair matches reveal gene expression patterns Different types of rnicroarray technologies are now used eg using different types of DNA probes each with strengths and weaknesses The technique of principal compo nent analysis PCA see Chapter 14 has shown to be useful in analyzing rnicroarray data rately For example false positives or false negatives can result from irregular targeUprobe binding eg mismatches or from selffolding of the targets respectively The problem of accuracy of the oligonucleotides has stimulated various companies to develop appropri ate design techniques A ymetrix Corporation for example has developed technology for designing silicon chips with oligonucleotide probes synthesized directly onto them with thousands of human genes on one chip All types of DNA microarrays rely on substantial computational analysis of the experimental data to determine absolute or relative patterns of gene expression 15 Genome Sequencing 35 Such patterns of gene expression induction and repression can prove valuable in drug design An understanding of the affected enzymatic pathway by proven drugs for exam ple may help screen and design novel compounds With similar effects This potential was demonstrated for the bacterium M tuberculosis based on experimental pro les obtained before and after exposure to the tuberculosis drug isoniazid 226 For further information on microarray technology and available databases see wwwgenechipscom wwwncbinmnihgovgeo ihomecuhkeduhkb400559arrayhtml and industryebiacukaanMicroArray for example Computational Biophysics and Systems Biology lntroducton to Systems Biology Prof Jes s A lzaguirre Overview Stochastic vs deterministic models Example simple model of circadian cycle Reference VilarJMG Kueh HY Barkai N Leibler S Mechanisms of noise resistance in genetic oscillators PNAS April 30 2002 vol 99 no 9 59885992 Stochastic vs deterministic differential equations Explicit solvent vs implicit solvent molecular dynamics simulations Transcription and Translation Prokaryotic Cell Eukaryotic Cell m Dnlijmeraae Fig 1e Eiozhemltal network nfthezirtadmn osdllatcr modeL Dquot and DA dewom hE number of anivamr genes mm and without A buund to its Lransmpliun a the me cf translation Me rates ofspuntzneaus deg ada7 um ythe rates of binding 01A In one umponenu and n denotes the rates of unbindmg ofA from those eamponenu Exzept I ulheNvIse named mm paaer we have assumed me foHuang values for me reactiun lates A so h a 7 scohr39 Di 7 am hquot an39 a Saran1 7 Schquot 5 7 5 h a 7 hquot 3 Ur 3 2 hquot 34 molquot hr my muquot u 0 him 7100 hquotwhere mo meznsnumber 39 cnsareD D lnmI DA n A Mn 7 whichrequiredvatthecanhasa nglecopyoftheanivatarand repressorgenes DA 7 11 7 mol an DR 7 DR 7 mclTh2 eHuhrvolume sassumed m betheunitysolhalmnientratiunsand numberol molezu e sare eq va ent No ze thatwa assume lhzlthe complex breaks intu R been 0 the degradation of A and therefore the parammer 3 appears twee m he made Original set of equations m m 7 JUNL1 m quot11 2 gm 7 AM m 7 1 an 7 1de 7 mm 7 and 7 91D mU HnDk YIch I no A 7 91 7RDRA HRH afxD39x ADI u M 12ml H 1 7 Hum 7AlyD 7 yRI we 51 aka quotupquot awn BRMR 7 Wm a 39 7 we yPlR 7 5 C Transforming to stochastic equation Reaction rates such as 3 into probability transition rates Concentrations such as A into number of molecules Result is called a Master Equation This can be simulated using for example Gillespie s algorithm Fluctuations or noise arise due to probabilistic behavior ZUDU Comparison of results Deterministic equations i iii ii iii f ii i i Stochastic equations iii iiii39 i39gc iimi iimiii i W i ih i iiii i i i l N i 1 i DD 200 3GB 2mm 50 1000 sou G i i i ii 4 00 Him i ii ii i M WWV iOU Model simplification d HR HER hWr MR 39 A R7H5C 5 R it 5 HR TRA R Vt In J A R Ill d r 2 Fig4 u a Aiinji gicxli1piRii 1 i Edi 3 wilafmi i Kaiquot 4iiI IP IRJKJ NR r iaiuLqij39Y CIiI 5 54 1 and Ed Hngi39fmi Results of model reduction Full model 2variable model Steady state vs oscillation 3000 2000 1000 3000 2000 100G number of repressor mo ecules Fig 5 Time evolution of Rforthe delenm39m39slk Eql1aand stcxham39 m versionsoflhe modaThevalues ohheparametersareasinhetaption ofFig t Exmptthat new we set 3 005 hquot rarmese parametewames r lt 0 so hatthe fixed point t m ble Biological model databases We will find this model in wwwbiomodelsnet Equations of motion Atomic positions 777 obey d2 a a a a Wi ma V7LU7 1157 2157 Nt7 z 1727 77 where my are masses the potential energy U071 Fg FN is a sum of 0N potentials for bonded forces 0N2 potentials for nonbonded forces NSF Summer School on Theoretlcal and Computatlonal Blophyslcs p6 Concise notation a collection of positions 77 M diagonal matrix of masses 11 collection of velocities p Mo collection of momenta Fx VUx collection of forces Equations are a Hamiltonian system i t M 1 t i t F t dtx p 7 dtp 93 7 with Hamiltonian Hxp pTM1p Ux NSF Summer School on Theoretlcal and Computatlonal Blophyslcs p7 Implicit solvent Dynamical equations are those of Langevin dynamics 9305 W M 1415 VUxt kBTDxt1vt kBTpl2xt TWt where Ux includes a PoissonBoltzmann solution D DlZDlT2 is a diffusion tensor and Wt is standard white noise i e Wt is Gaussian With EWt 0 and EWsWjt mins 75M Implicit solvent deviates from explicit solvent unless four layers of explicit solvent are included Cheaper and less accurate is a generalized Born potential NSF Summer School on Theorettcal and Computattonal Btophystcs p13 Conclusions 0 Random fluctuations might be important to describe robustness of biological systems as ilustrated in the circadian cycle oscillator model of Vilar et 0 Stochastic descriptions such as the Master Equation or stochastic differential equations generally do not have analytical solutions so simulation is imperative Computational Biophysics Amino Acids amp Proteins Instructor Prof Jes1391s A Izaguirre Textbook Tamar Schlick Molecular Modeling and Simulation An Interdisciplinary Guide Springer Verlag BerlinNew York 2002 Chapter 3 References Various websites indicated in the text Outline Machinery of life Amino acids Protein conformation framework Basic Definitions Gene expression whether or not the product a gene codes for is produced and in what amounts Basic dogma of molecular biology DNA is transformed into mRNA transcription mRNA codes for amino acids converted into a protein translation Proteins may undergo post translational modi cations for activity etc see below Protein The Machinery of Life Genes to Proteins DNA Lifetime Planquot mRNA Task List Protein Machines Wmms39rys39rmmammz Figure by MiT ocw Figu39e by M OCW Identi cation Relative Expresswn sin n modi cation S ants Levels Relative expression levels HarvardMIT Division of Health Source HPCGG Science amp Technoio Source MIT Open Course Ware MIT DNA Sequencing Transcription DNA gtRNA GGuanine ne TThymine DNA only UUraci RNA only DNA T U RNA DNA RNA sequence of nucleotide bases Parity Bitquot Analogy 397 Redundant information in second strand for error correc ion DNA deoxyribonucleic acid Figure by MIT OCW RNA ribonucleic acid Source MIT Open Course Ware MIT 6092 Translation SECOND BASE U C A G um pm u an I Lev WW Carbm39leuv Live a r A r coon m r mm mm m Lem r r 39 swam ma mm c AU AC w m Slum e AGE MC Asparngne AGE Sam M A 5quotquot 1 Subaru xud GU sac 65c mm A W n an n mum E E A BSA 5G m mm m iopnc mime swan Can Figure by MIT OCW HarvardMIT RNA Protein D n o Heanh Protein Sequence of Amino Acids Smence amp Technolo Source MIT Open Course Ware MIT 6092 nNg jMVAVAVA Uri er m Source Figure 3 from L Hunter ibidern Proteins Machine Examples We present some examples of proteins membranebound in these cases that serve a number of roles in the cell besides enzymes All of these have been modeled using computers Types of Proteins Type Examples Structural tendons cartilage hair nails Contractile muscles Transport hemoglobin Storage milk Hormonal insulin growth hormone Enzyme catalyzes reactions in cells Protection immune response Biological Membranes 39 Sfr ucfur e Func on Com osi on Physlcochemlcnl properh es A pa rch of simulated POPE bifuyer Source httpwvwvksuiuceduemadBIOPHYS490M ProteinMembrane In rer ac rion Source httpwwwksuiuceduemadBOPHYS490M Rhodopsins Visual Receptors Vision Gpr ofain coupled FECE Tor Sensitivify swgnahng cascade Color vision spectra mnmg Photo mxis Sensory rhodopsms 39 Ligh r energy storage Source httpWWNksuiuceduemadBOPHYS490M ATPsyn rhase ProTon gradient ATP energy of life Mechanical and chemical energy conversion Hydrolysis synthesis Source httpwwwksuiuceduemadBIOPHYS490M Rota39rion of Fi ATPase Cen39rral 5139qu SubnrnNprnafn imrmion Tn nta yne nn m A 327000 moms New Source httpwww ks uiuoed uemadBIOPHYS490M Aquapor in Wafer Channels Wafer Transporf Glycerol Transport Permea rion raTe Substra re selec vify STareoselecTiviTy Filtering ions and protons Source httpwwwks uiuceduemadBIOPHYS490M MscL Mechanosensi rive Channel MscL is a bacterial snfeTy valve Roles in Higher Organisms Osmnhc Mambrunz iznsmn hwmg duwnshock mum bu uncz amenL f gt n I lt3 9 K MscL guhng Excrzhon cardmvuscumr regulahnn Source httpwwwksuiuceduemadBOPHYS490M Outline 1 Machinery oflife 2 Amino acids 3 Protein conformation framework Amino Acids Proteins are macromolecules made up from 20 different amino acids The heart of the amino acid is the socalled Cor To which are bound an amino group a carboxyl group a hydrogen and the side chain The Con C N and O atoms are called backbone atoms Amino Acids Building blocks of proteins Carboxylic acid group Amino group Side group R gives unique characteristics R side chain H2N C COOH I H I oool oz oool oz 2304 0594 0 wwEmem The 20 Amino Acids lt ltWWOWZZWWMEQ 1WUOgt Ala Cys Asp Glu Phe Gly His Ile Lys Met Asn Pro Gln Arg Ser Thr Val Trp Tyr Alanine Cysteine Aspartic acid Aspartate Glutamic acid Glutamate Phenylalanine Glycine Histidine lsoleucine Lysine Leucine Methionine Asparagine Proline Glutamine Arginine Serine Threonine Valine Tryptophan Tyrosine Some Amino Acid Links httpprowlrockefellereduaainfocontentshtm Chemical structures Chemical classifications Peptide bond geometry httpenwikipediaorqwikiAmino acid Peptide bond formation Table of side chain properties Learn the 20 proteinoqenic amino acids online Posttranslational modifications 20 Amino Acids The 20 amino acids can for example be classified as follows Aliphatichydrophobic Ala Leu lle Val Polar Asn Gln Alcoholic Ser Thr Tyr Sulfurcontaining Met Cys Aromatic Phe Tyr Trp His Charged Arg Lys Asp Glu His Special Gly no R Pro cyclic imino acid Several amino acids belong in more than one category Types of Amino Acids R H CH3 alkyl groups aromatic 0 polar groups with O SH N R CH2COOH or COOH Learning Check AA1 Identify each as 1 polar or 2 nonpolar A NHz CHz COOH Glycine CH3 H OH B NHz CH COOH Serine Solution AA1 Identify each as 1 polar or 2 nonpolar A NHz CHz COOH Glycine CH3 CH OH B NHz CH COOH Serine Amino Acids as Acids and Bases Ionization of the NH2 and the COOH group Zwitterion has both a and charge Zwitterion is neutral overall glycine Zwitterion of glycine pH and ionization H OH H3N CH2 COO H3N CH2 COO39 HzN CHz COO39 zwitterion Negative ion neutral pH High pH Learning Check AA2 CH3 CH3 H3N CH COOH HzN CHz COO39 Select from the above structures A Alanine in base B Alanine in acid Solution AA2 CH3 CH3 H3N CH COOH HzN CHz COO39 Select from the above structures Alanine in base Alanine in acid The Peptide Bond Amide bond formed by the COOH of an amino acid and the NH2 of the next amino acid 0 CH3 NH3 CH2 COH H3N CH COO CH3 Dipeptide Amino acids bind to form a protein Upon binding two protons from the NH3 and one oxygen from the carboxyl join to form a water So the peptide bond has at the one side a CO and at the other side an NH Only the ends ofthe chain are NH3 or carboxylic Which dipeptide is this The Peptide Bond 0 R2 H2NCH CSNCHCOH w 0 R H O phi p 1 torsion angle around NCA bond psi w torsion angle around CA C bond omega 0 always 180 degrees sometimes O bond between two amino acids peptide bond PhiPsi Learning Check AA4 Write the threeletter abbreviations for the following tetrapeptide 93 93 T Wm lt3 Hs CHO M 942 lI3N CllC rlq CHC rlq CHC rl l CH C O39 H H H Solution AA4 ltHs 9H3 CHCH3 H CH2 I 93 M 9st M H3NCHCfII CHCIIJ CHCfr CHCO39 H H H AlaLeuCysMet Outline 1 Machinery oflife 2 Amino acids 3 Protein conformation framework Ramachadrans of all and non Gly non Pro A Created by 021 at Thu Aug 26 22 59 271999 fax an UNKNOWN use XPsiGRAF magma2 1 2 c G Kleywegl 19927199913 Createdby 021 at Thu Aug 26 22 59 A0 1999 fax an UNKNOWN user XPSLRAF mama2 1 2 c G leeway 199271999 39 File allirampps 39 File nonglyproirampps Colom mmpe i IOB WIEd PIO aquotJamli Pl Colourrramped logrscaled plot nonglyproJamppn Toml nr ofresidues 81782 Tom m of residue 71319 Helices m l 310 helix l nhelix l m l Polyproline helix l Collagen helix Extended strand l Turnl Beta hajggin l Beta bulge l m Supersecondary Coiled coil l Helixturnhelix l EF hand 11 structure propensities of amino acids Helix favoring Methionine l Alanine l Leucine l Glutamic acid l Glutamine l Lysine Extended favoring Threonine l Isoleucine l Valine l Phenylalanine l Tyrosine l T ptophan Disorder favoring Glycine l Serine l Proline l Asparagine l Aspartic acid No preference Cysteine l Histidine l Arginine lt Prima structure Tertia structure gt Retrieved omquothttpenwikipediaorgWikiBeta sheetquot Ramachadrans of Ala Val A A I File alairampips File vaLrampips 150 100 7100 7150 X Phi X Phi Y Psi Y Psi Culmmmped lugrscaled pm dump p11 Culuummped Ingrsczled plat vaLramp p11 Tmalm Ufresidues 7179 Tutalnr Ufresidues 5331 Ramachadran Plots httpxravbmcuuseqerardsupmatramarevhtml httpedsbmcuuseramachanhtm try PDB entries 3PTE 6PTI 1I6C Representation Represented by a Graph 3 GFIVE is a set of vertices and E is a set of edges between the vertices namely E u v u V CI V NodeVertex ArcEdge Directed vs Undirected no directionality assume bidirectional Cyclic vs Acyclic no path exists from any vertex to itself Direct Acyclic Graph Bayesian Network Harvard MIT Division of Health Science amp Technology Networks Communication Networks Nodes are routersphones Edges are phone lines Image removed due to copyright considerations 7 i Harvard MIT Division of Health Science amp Technology N etwo rks Biological Networks a Protein Interaction Networks Nodes are yeast proteins Image removed due to Edges are prOte39n prOte39n copyright considerations interactions a Gene regulation network a Metabolism Biochemical reactions Yeast Protein Interaction Network Harvard MIT Division of Health Science amp Technology Typ e Types Detail Correlation graph undirected graph The information about the positive negative correlation between genes is described Two related genes are connected with an undirected arc Cause effect graph direct graph Describing the relationship caused by a gene acting upon another gene Causality is represented by a directed arc whose direction shows the cause and effect Weighted graph in the broad sense Some qualitative meaning is attached to a graph within its arcs Eg S system or a Bayesian network Harvard MIT Division of Health Science amp Technology G 191101119 Infbarn t1quot CS 1 4 1 0471 1 3 213033 Ad jacency Represent as n x n matrix led where nNumber of Vertices M 3 Place a 1 or other weight for each edge in matrix element Aij where edge goes from i gtj From 7 Harvard MIT Division of Health Science amp Technology HOW many Edges 7 n72 elements in matrix no edges between self ie no edge from to etc n2 n elements However since edges are bidirectional we are double Counting each edge Use only one of triangles Number of edges for k nodes M 3 From n n 2 Harvard MIT Division of Health Science amp Technology Properties Neighbors Vertices that have an edge between them Degree Number of edges linking a given vertex to its neighbors Eg Degree is 3 for vertex C Degree From 7 Harvard MIT Division of Health J Science amp Technology P I u ste ring Coeffi ci e n t lus tear reflects tendency for neighbors of given vertex to be ccnnected Cluster Coefficient Number of edges between neighbors of vertex i divided by total possible edges between ki neighbors of vertex i l f iA then k3 and 23 1 3 3 1 Average Cluster Coefficient tendency of graph to form clusters meanCi for all vertices i Harvard MIT Division of Health Science amp Technology Erdos Renyi Model Random Network a Growth model Edges to new nodes added from existing nodes with equal probability U Degree distribution Pk where k Is the degree of node a Average path length In N where N is number ofnodes Poisson distribution Figure by MIT OCW R Albert AL Barabasi Statistical mechanics of complex networks Rev Mod Phys 74 47 2002 Scale free N etwork Pk kY ylt3 implies scale free E Scale free E Growth model Add a new node with In edges to existing network Probability 1 1 of adding edge from vertex i to a new vertex increases as to vertex i s degree ki increases C Average path length ln In N where N is number of nodes Therefore more ef cient signaling than random network Figure by MT ocw Scale free Network Power law distribution A Random network B Scale free network C Hierarchical network CS f Most vertices have degree close to the 39average degree Pv k j x l 9x 5 Hubs high degree vertices I I 1 1000 10000 k Figure by MT OCW Jul 274066794378 82 x w Courtesy ofAmenca West Axrlmes Used thh permission Hmmmn I Division of Health 5 Ience ampTechnology u stne U n d er Atta ck u re of n etw39o rk 0 para t ion number of Vertices in largest subg39raph path exists any vertex to any other vertex 8 is above number normalized by the original size of the graph If failure is random hit R emov e random node Scale free network is more likely to survive than random netWork L Q If failure is targeted hit Remove node that causes maximum damage Scale free network is more vulnerable than random network 0 eOO Q Failure and Image removed due to copyright considerations Random Image removed due to copyright considerations Scale free ication P rotei n P rote i n Interactions Proteins Ve rtices with high degree interact with many other proteins directly are more essential than ones with a Image removed due to low degree39 copyright considerations Knocking out high degree proteins more likely to result in catastrophic system failure Drug target applications Sample Protein Interaction Network From Yeast Study Lethality and igf entrality for Yeast Proteins Vertices 2240 interactions edges of proteins are degreeS5 21 are essential to yeast survival 07 of proteins are deg reegt 1 5 62 are essential Positively correlated Correlation coefficient between lethality and connectivity is 075 Image removed due to copyright considerations Complete Yeast Protein Interaction Network Nature 2001 May 341 1 683341 2 I l 7 l Harvard MIT Division of Health Science amp Technology count 10 betweenness node degree 791 Amy Keating The Protein Interactome A critical framework underlying systems biology 1 Overview the many levels of systems biology Experimental methods for measuring proteinprotein interactions and their limitations Data sources for information about proteins and their interactions Computational methods for assessing and predicting proteinprotein interactions Spectrum of Systems Biology detailed models describe rates concentrations structure lowresolution models describe information flow logic mechanism circuitry logiccontrol positive and negative regulation connectivitytopology who talks to who interaction scaffold parts list protein and DNA sequences amp structures Recommended reading Ideker amp Lauffenburger TRENDS in Biotechnology 2003 21 255262 Spectrum of Systems Biology allow simulation amp detailed models differential equations comparison with data lowresolution models Boolean amp Markov models circuitry logiccontrol Bayesian networks connectivitytopology graph theory parts list databases Recommended reading Ideker amp Lauffenburger TRENDS in Biotechnology 2003 21 255262 Spectrum of Systems Biology detailed models rates of individual reactions protein concentrations in the cell extent of phosphorylation diffusion rates low resolution models which elements are most crucial combinatorial dependencies circuitry logiccontrol Expression profiling posttranslational modifications in response to different stimuli Identify pathways and clusters does an interaction activate or repress are multiple components required connectivitytopology proteinprotein proteinDNA and proteinsmall molecule interactions parts list genome sequencing projects gene finding algorithms EST libraries Recommended reading Ideker amp Lauffenburger TRENDS in Biotechnology 2003 21 255262 Proteinprotein and proteinDNA interactions at the genomic level Saccharomyces cerevisiae as a model organism A very simple eukaryote yeast as a model for human Genome 12053 kb sequenced in 1996 5800 proteincoding genes Easy to do genetics in yeast Many regulatory and metabolic pathways are at least partly conserved between yeast and higher eukaryotes Many human disease genes have yeast orthologs Saccharomyces cerevisiae image from SGDTM provided by Peter Hollenhorst and Catherine Fox Used with permission Smallscale interaction experiments Proteinprotein interactions pulldown GST Ni affinity coimmunoprecipitation crosslinking more biophysical amp quantitative fluorescence CD calorimetry surface plasmon resonance ProteinDNA interactions mostly by gel shift assay Many many thousands of such experiments have been done and reported in the literature but how do you get the information out This is hard and an important problem in modern biology PreBIND is a machine learning application that can extract information about whether two proteins interact from the literature automatically httpwwwblueprintorgproductsprebindprebindhtml Smallscale experiment are generally the most reliable though still rife with false negatives and false positives Molecular Interaction Networks Importance of Identifying interactions Define protein function Types Protein Protein Interactions Regulatory Circuits Complex pathways of metabolic reactions f u N w E R3 I r Y OF Laboratory for Computational NOTRE DAME Life Sciences LCLS Types of Molecular Networks in the Cell ProteinProtein Physical Interactions Networks Physical Interaction ProteinProtein Genetic Interaction Networks Enhanced or Suppressed phenotype mutation 17 U N IVE R3 I T Y OF Laboratory for Computational 4 L 395 NOTRE DAME Life Sciences LCLS Types of Molecular Networks in the Cell Expression Networks Coexpressed genes connected Not necessarily coreguated Regulatory Networks ProteinDNA Interactions Ex mRNA translation DNA transcription 39 RNA Polymerase C5 6 Large Subunit Transcription factors mRNA i i Subunit 16rRNA E U N IVE R3 I T Y OF Laboratory for Computational 5 NOTRE DAME Life Sciences LCLS Types of Molecular Networks in the Cell Metabolic Networks Biochemical reactions with metabolic pathways Metabolic Reactions e e 7 i u N IVE RS 1 TY OF Laboratory for Computational 6 NOTRE DAME Life Sciences LCLS Types of Molecular Networks in the Cell Signaling Networks Signal Transduction pathways Ex Cytosolic Internalization ligands Transduction events Extra Cell domain receptors I internalization Cytosolic domain 7 i u N IVE RS 1 TY OF Laboratory for Computational 7 NOTRE DAME Life Sciences LCLS Mapping Networks Techniques Yeast Twohybrid Screens No protein purification isolation or manipulation Applicable to any protein pair Based on the Independent Domain Model IDM RNA Polymerase II V 5 E i 7 Reporter Gene U N IVERS I TY OF Laboratory for Computational 8 NOTRE DAME Life Sciences LCLS Mapping Networks Techniques Y2H Screens Approaches Matrix oneone Array onemany Pooling manymany Y2H in other organisms C elegans u N IVE RS 1 TY OF Laboratory for Computational 9 jr39x NOTRE DAME Life Sciences LCLS Yeast 2 hybrid assay Vector with activation domainORF fusion g quot35 o W AMMUng Vector with DNAbinding domainB fusion pate on His media gt f lt gt images nttpMeptsvvasnington EdusfieidsypiinteractionsVF39LM mmi Courtesy of Staniey Fieids Used With permission Yeast 2hybrid assay Pros easyfast no puri cation required in VVO conditions can be adapted for highthroughput screens can detect transient interactions Cons prone to false negatives protein doesn t fold protein doesn t localize to nucleus interference from endogenous protein fusion protein doesn t interact like native protein fusion may be toxic to cell prone to false positives autoactivation indirect ineractions not quantitative no control over posttranslational modification only test binary interactions not quantitative Yeast 2hybrid assay for an entire genome Uetz et al Nature 2000 403 623627 Two strategies 1 array approach 6000 activation domain hybrid transformants mated to 192 DNA binding domain fusion transformants only 20 of interactions 281 reproducible many autoactivate 33 positives per interactioncompetent protein 2 highthroughput screenquot approach 5345 ORFs cloned separately into DNAbinding and activation domain plasmids 2 reporter genes DBD fusions pooled and mated to AD fusions 12 clones per pool sequenced gave 692 unique interactions 472 seen more than once 18 positives per interactioncompetent protein Ito et al PNAS 2001 98 45694574 For both DBD and AD make 62 pools of 96 proteins Mate all pools against all Gave 4549 interactions 841 observed 2 3 times core data The potential number of interactions is huge and the number of real interactions is probably very large gt 10000 these studies only characterize a tiny fraction low coverage Example Screen of the ADArray with RPC19 as a bait i lhc mimicquot Julmin an 39 mmung rm 1 yam u expm The arm a nmud m m RPCIQ hull und diploidx me pinned m aeleciive plum Tm a u w 39ueleclive pmu 7 cur 71 7 can whiclwnly column grow um cunmn RFC an u L39im cmL m ml Smn Field39s web Site http Udepts Washmglon eduS e dSwmageSRPCWQ mm Couneay of SIamey Hews Used Wm perrmaswon Additional cons when you do a large scale 2hybrid screen PCR amplification gives mutations generally don t sequence everything to confirm Cloning amp transformation inefficiencies If baits are pooled slowgrowing cells will lose to faster ones giving false negatives All vs all assay contains many implausible interactions proteins that aren t colocalized or expressed at the same time Can only sequence a small fraction of the positive clones Highthroughput YZH screens miss as many as 90 0f YZH interactions observed in focused smallscale studes Affinity Purification What do you mean by an interaction Most proteins interact with several other proteins estimate 210 Many proteins in the cell are found in complexes For some purposes knowing the identities of the members of the clusters is as useful or more useful than knowing the directly interacting partners Affinity purification is a method for characterizing the clusters directly rather than one interaction at a time Affinity Purification Mass spectrometry DNA encodes bait tag bait expressed in I N f h f cell forms part ofa yse Ce 1 39S 0r complex complex With affinity column that binds the tag separate abcdeBA1T by SDS PAGE gel extract bands Identities digest with trypsin of in the mass spec complex data base search Affinity purificationmass spectrometry for an entire genome Gavin et al Nature 2002 415 141147 Cellzome 1167 bait proteins TAP tag inserted at 3 end of gene proteins under endogenous promoter 2 rounds of purification 232 distinct complexes with 2 to 83 proteins per complex new cellular role proposed for 344 proteins To assess confidence Repeat the experiment only 70 reproducible using the same bait Use different proteins in the complex as the bait see if you recover the same proteins in the complex Ho et al Nature 2002 415 180183 MDS Proteomics 725 bait proteins 1578 interacting proteins FLAG tag proteins transiently overexpressed To assess confidence 74 of interactions reproducible in small scale coIPblot Affinity ms assay Pros get the whole complex proteins that purify together are likely to share a function very sensitive can detect 15 copies per cell in VVO conditions can be adapted for highthroughput screens Cons doesn t determine direct interactions not reliable for small proteins lt 15 kD affinity tag may interfere with interactions or with the function of essential proteins prone to false positives eg sticky proteins prone to false negatives won t get every protein every time complex must survive purification not quantitative Array Detection of ProteinProtein Interactions 39 t t i ghggeoor Ni surface purified peptides or proteins N x N label with l l I 33365 5 m5 m5 1mm Highly purified proteins were denatured using GdnHCl and printed onto aldehyde derivatized glass slides using a commercial split pin arrayer GdnHCl was to prevent homodimerization on the surface 49 human proteins plus 3 duplicates plus 10 yeast proteins were printed in quadruplicate 62 times The 62 proteins were independently labeled with Cy3 dye and denatured with GdnHCl Peptides were diluted from GdnHCl as they were added to the arrays Following a brief incubation slides were washed dried and scanned yielding NxN measurements in quadruplicate of cc interactions The assay was repeated at concentration ranging from 160 pM to XXX nM Array Detection of ProteinProtein Interactions MacBeath amp Schreiber Science 2000 proofofprinciple for three types of interactions proteinprotein protein G with lgG FRAP with FKBP12 p50 with IKBOL proteinsmall molecule biotin with steptavidin Ab with DIG steroid ligand enzymesubstrate kinases PKA Erk2 Zhu et al Science 2001 assay of 5800 yeast genes with calmodulin phospholipids Newman amp Keating Science 2003 assay of 48 X 48 human bZlP transcription factor coiled coils plus 10 X 10 yeast Protein microarrays Pros Cons N X N interactions at once tedious puri cation required or else interactions may not be direct direct interaction assay surface may perturb folding or interactions reagents can be well characterized doesn t mimic in vivo conditions solution conditions are controlled not yet a mature technology possibly not a can be quantitative good general approach requires very little protein can be adapted for highthroughput screens few false positives Overlap of highthroughput interaction studies is LOW Ito Uetz Gavin Ho Y2H Y2H TAPms FLAGms Ito 2hybrid 4363 186 54 63 Uetz 1403 54 56 2hybrid Gavin affinity 3222 198 Ho affinity 3596 Small scale 442 415 528 391 data from Salwinski amp Eisenberg Current Opinion in Structural Biology 2003 13 377382 Lesson Lots of proteinprotein interaction data is now available for yeast but it is not very reliable and it is not nearly comprehensive Nevertheless these data have inspired the development of many computational methods To facilitate computational analysis need to disseminate the data in a usable form This is often a rate limiting step in systems biology Databases that store interaction data Database of Interacting Proteins DIP Biomolecular Interaction Network Database BIND Molecular Interactions Database MINT INTERACT MIPS contains interaction data both direct and clusters for yeast Dam BROWSE LINKS Binary Complex Func unul Irwin NunuDmriplinn partners of murine p53 Lab of David Eisenberg httpdipdoe mbiuclaedu DIP interaction details Pm a l Swmml mam n NemeDeeerlpuun luwge T nudgequot L W39 1 GmBankw NameDeserlpmn cellular humor enlagen p53 Help39 e 39 Melhm39l i Details i 39 Source Tuuhybndlzn i 7 iPMlel i39 w i mm m i 7 if FIRM Uh SwLquli39 ll NameDescriptiun eellulenumor amigen p53 39139 ml i l 247 Namemcscripunn plQARF lumnr suppm pmlem Evid nce Hcip Sunrte l Type i Memnd i lmupmpmlm l i H l i httpdipdoembiucaedu DIP interaction statistics as of May 2004 DATABASE STATISTICS Numhcmrpmmim 17043 Numbcnxl39o am 107 Numbcrul39inlemclium 44349 Numbcrufdimnclexpedmmh describing an inter 491m Numbcrufduu Amnch macs 2594 N umber uf um muer mm 34 NISM PROTELNS INTERACTIONS EXPERIMENTS Dual Dmmpmm WWWm 7052 20m 210 a mun y Sua39lljg g N C13939L39uw 4m 5658 I I 43 a httpdipdoembiuclaedu l 2 1270 4749 15553 3 m 4 165 5 81 6 98 Yeast interactions by cxperimcnl type ssm ovz ap rplbple Bars mark inkmmiuns ha Wm indznli nd in mm than mermaian httpdipdoembiuclaedu BIND Designed to hold direct interaction cluster and pathway data interactio 5 written in ASN1 Abstract Syntax Notation for computational efficiency DNA taindiuv n mam dcliviw prmcin quotding unclequot BaderGD Betel D Hogue CW 2003 BIND the Biomolecular Interaction Network Database Nucleic Acids Res 311248 50 Gene Ontology GO an organizational framework for storing interaction and function data httpwwwgeneontologyorg What is the function of a protein Not an easy question to answer There are many aspects to function and different people might and do describe the function of one protein many different ways In GO gene products mostly proteins are described using descriptors in three categories Molecular function the activity carried out by the gene product at the molecular level histidine kinase alcohol dehydrogenase Biological process a multistep process such as cell division DNA replication signal transduction Cellular component part of a cell that is a part of a larger structure ribosome spindle pole kinetochore The hierarchical structure of GO Molecular function biological process and the cellular compartment where a gene product is found can be described at many levels of detail GO uses hierarchical description where each protein gets a set of terms at different levels The ontology structure is that of an acyclic directed graph less detailed description more detailed description MAP kina se actw lty Aumiu 600004707 Synunyuu MAl K mimgen Acuvnled kinme Du niLiun lepm Um phmphuryluuun ufpmwim Milogz39n quled pm kimac A family fpmlcin kimm lhnlper xrm Acrucm ap in relaying 39umlx mm he plnamn mam me to me nuc em The are ncuvale by A wide range ofpm i mlinnrordxl39ferenlmlinninducing gum nclivnlion i alran 39 39gmlNa min as Fulypeplidc growth r um and mnmppmmm v phurm men but weak quotI um cell buckgmunda by we mum Erm m mange Gmph aw 0011110357 Gem nmlwn Haw mommm muleLulur runuiun 973917 GOMUMXNmmhuc 4 vi 5m o n 5075 mcmm 7 p lin Mummiumsnsu m 1 pr mu anncmwnnmc me ACLWIU uxSx 6000 rch k vnd u l u 39nulhruumlk 1 main n M Acuumh 257 no n AP Mum m livin 90 GOMHhHU Lmnq emc mum emu GOMHhWZ mum uulv uuw en m hm hunwunm39n w x mm 1 m 773phmphmnxm1 emc nclnuulcuh0 gm p g r Gonmur z G 1 n gnu m nch or 35x2 1 Emmiquot Hume ucuwu 3030 n i n n cnnu lhrummc Mum Gonmmnz m 70 39Lmu mlcm wnm uh uxSx 0r Nwmllm39 ruumm Hm c L m K GOAMHXH mhumdmm u v 6011005057rcccuunivnnhng pmwm L39Lmu 592K GDMUUWUZ c mnimh v m 39nwri r mm L nu lhrcumlk kinmcunmih 257 r n MAPkinmrncliulv a r n w W p38 fly regulation of innate immune response 0 u 0sz Auv iunG ms Synun m None AL mudulw Lhe l39requency mle urcxlenlul39the mm m 56 um urgunialn nm line urdcfense ugmml mfecuun Gm h v De n m p n un Any pnm n e mm ErlennLinea 3011001071 Gcnc Onluluvv Hun u GOUUUKlS mumva pmcm 923 we 11 ghumlwwlgmcm 6mm mG unsuxm mammalghumhmmlgmcw mm yommuau xmmunemgnmu 1044 mGOUtHSUM mmmmn mmgomc 52m 39 390 NEW reulutmn ur39nnamimmunu m mm 37 r mulmun Uflmmune mgume IUA 39 88 Im Hananimmune mgam J7 mu u xxw 1 e I G 7r Ml Ommsnsx roulmiun ul39inmmummum 0 1 w 00011 77 an 110430 reulmi wouuuiuxunmgumeunu I w Gomwnui mgume m cxb39nm x mulm bum 1mm 3 9 mGOMH 955 immunerespnn N44 moonwium mmmmm creagmhc il VF 045088roulmiun urinmminununu n1 mm 37 m rexvumlmn Uflmmune mgume 39 magnum J7 p38 fly CELLULAR COMPONENT nucleus AccesmiunG39L39J UIDSEiSal Synonyms Nune De niti n A membrunEbounded organelle ul eukaryulic cells than uuntuinx the chmmnsnmes In is the primary Nile ml DNA replication and RNA synthesis in the cell ElTerm Lineage GOZUULBOTB Gene Ontologv 140200 IJE39GOIM39JUSSTS L ellulur cum Imam 397 1 30100115623 tel 56534 39l39ii39 GOiJDUS EE intracellular quot 39 if 100005634 nucleus 117231 Where do the GO terms come from A team of experts is responsible for assigning GO annotation Every term comes with an evidence code describing where the annotation came from Sample evidence codes IDA inferred from direct assay enzyme assay cell fractionation IPI inferred from physical interaction 2hybrid IGI inferred from genetic interaction suppressor synthetic lethal IEP inferred from expression pattern microarray IMP inferred from mutant phenotype ISS inferred from sequence or structure similarlity TAS traceable author statement NAS nontraceable author statement Advantages of GO Controlled vocabulary Everyone can communicate using the same terms Designed to apply across species Linked to genomic databases like TIGR FlyBase SGD yeast WormBase MGD mouse RGD rat TAIR Arabidopsis ZFIN zebrafish and more Makes it possible to compute on protein function and localization Can define formal relationships between proteins based on their GO annotation Flexible GO is always growing and changing New terms can be added to the hierarchy Free and open for the use of the community Computational methods for improving the quality of interaction data 1 Assessment and validation improve accuracy 2 Prediction improve coverage Assessing and filtering interaction data 1 Promiscuity criteria In most highthroughput interaction studies a few proteins are observed to interact promiscuously Generally these are removed from the analysis Problem some interactions may be real Examples Affinity purification ms Even with no bait 17 proteins were found in pulldowns by Gavin et al 49 other proteins found to have a similar frequency of interaction to these false positives were thrown out Yeast 2hybrid Proteins observed to make many interactions in many screens usually discarded as probably false positives Assessing and filtering interaction data 2 Overlap criteria A with other interaction data intersection is low In 2001 2000 highthroughput measurements were confirmed by small scale experiments B with noninteraction data eg annotations in YPD yeast protein databank YPD now proprietary at Incyte Please see figures 1 and 2 of Deane Charlotte M Lukasz Salwinski loannis Xenarios and David Eisenberg quotProtein Interactions Two Methods for Assessment of the Reliability of High Throughput Obsenationsquot Mol Cell Proteomics 1 May 2002 349356 Overlap with expression data Expression Profile Reliability EPR Please see figure 4 of Deane Charlotte M Lukasz Salwinski loannis Xenarios and David Eisenberg quotProtein Interactions Two Methods for Assessment of the Reliability of High Throughput Obsenationsquot Mol Cell Proteomics 1 May 2002 349356 Note proteins involved in true protein protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set of interactions is Assessing and filtering interaction data Expression Profile Reliability EPR Assume the observed distribution observed results from the true interactions and false positive interactions The Please see figure 4 of observed distribution is expressed as a weighted sum of these contributions Deane Charlotte M Lukasz Salwinski loannis Xenarios and David Eisenberg quotProtein Interactions Two Methods for Assessment of the Reliability of High Throughput Observationsquot Mol Cell Proteomics 1 May 2002 349356 Estimate the distribution for non interactions using all proteinprotein pairs assume interactions are rare Estimate distribution for true interactions using smallscale experiments Fit a parameter aEPR to estimate how many highthroughput interactions are true positive vs false positive Fexpd2 aEPRoFintd2 1aEPRoFnointd2 Best fit a 53 31 gt 70 of highthroughput pairs are false positives But method doesn 2 tell you which interactions these are Other methods have estimated that 50 of yeast 2 hybrid pairs are true positives Assessing and filtering interaction data Homology methods Paralogous Verification PVM Sequence A candidate interact0n Sequence B PSIBLAST Win genome list ofparaogs PVM score 2 non AB interactions Deane et al Mol amp Cell Proteomics 2002 15 349356 PVM is very specific but not very sensitive three different highconfidence interaction datasets WP indicates proetin paris with Z 1 paralog for A or B Please see figure 5 of Deane Charlotte M Lukasz Salwinski loannis Xenarios and David Eisenberg quotProtein Interactions Two Methods for Assessment of the Reliability of High Throughput Observationsquot Mol Cell Proteomics 1 May 2002 349356 Points on this plot come from using different PVM score cutoffs to designate a true interaction It is an example of a receiveroperator characteristic ROC curve which is commonly used to illustrate the tradeoff between sensitivity vs specificity PVM is very selective if a pair scores by PVM it is almost certainly a true positive xaxis gt low false positive rate However PVM does not achieve good coverage it is not sensitive yaxis At most PVM can confirm 50 of highconfidence examples This is at least partly because many examples of paralogous complexes are sparse Assessing and ltering interaction data DIPCORE is a set of 3003 interactions considered higher confidence DIPCORE interactions either 1 Have been observed in a smallscale experiment 2246 2 Have been observed in more than one experiment 1179 3 Have been confirmed by PVM 1428 Pmmns V Swasrmw u deDestriplion cellulanumor Amigaquot s w rm Namemescriplinn plQARF mum snppm e Hcip Meduan DrlaiJs Suurce venflcatlon eld indicates that one 1 smallscale experiment supports this interaction Deane et aI Mol ampCell Proteomis 2002 15 349356 Predicting proteinprotein interactions 1 Sequence methods How can you predict that an Interact0n might occur between two protems based purely on sequence data Review Valencia amp Pazos Current Opinion in Structural Biology 2002 12 368373 Predicting proteinprotein interactions 1 Sequence methods phvlooenetic profiles based on the joint presenceabsence of a pair of proteins in a large number of genomes recall the first literature discussion class Pellegrini et al coevolution as assessed by similarity of phylogenetic trees mirrortree method compares the distance matrices for generating trees requires lots of sequences and a good alignment gene fusions genes encoding interacting proteins in one organism are sometimes fused into a single gene in another Look for these occurrences gene neighborhood for bacteria the arrangement of genes in operons means that interacting proteins are often encoded in adjacent sites in the genome Review Valencia amp Pazos Current Opinion in Structural Biology 2002 12 368373 Predicting proteinprotein interactions 1 Sequence methods correlated mutations the idea is that interacting positions on different proteins should co evolve so as to maintain the interface Look for correlation between sequence changes at one position and those at another position in a multiple sequence alignment Recall S39Liel et al Evolutionarin conserved networks of residues mediate allosteric communication in proteins AAGLJ AGJ AGjli where AG kTInPx at jPMSAx Pazos amp Valencia In silico twohybrid systems for the selection of physically interacting protein pairs PROTEINS 2002 47 219227 Pearson coefficient rij 2Silkll ltSgtSjk ltSjgtnormalization describes the correlation between amino acid positions i and j in two proteins Here SW is a measure of the similarity of the aa at position i in sequences k and l and ltSigt is the average of these values k and l are sequences taken from a MSA that has the same number of sequences from the same species for sites i and j Probem need lots of sequences and the method is very sensitive to the alignment used Review Valencia amp Pazos Current Opinion in Structural Biology 2002 12 368373 Predicting proteinprotein interactions 2 Structurebased methods Docking is a large field in and of itself which involves predicting how two known structures will interact It even has its own prediction contest CAPRI like CASP The main issues in docking are as always when modeling structure 1 sampling the conformational space and 2 selecting the correct solution Docking approaches require structures of both interacting components Frequently conformational changes accompany protein interactions Docking methods generally require a structure of the bound conformation to predict interactions correctly Modeling conformational flexibility is hard We don t have enough structures gr good enough docking methods to make hgh throughput prediction of proteinproten interactions practical at this point Predicting proteinprotein interactions 2 Structurebased methods What do you do when you don t have a structure Homoloov modelino methods Aloy amp Russell PNAS 2002 99 58965901 For target proteins that have homologs that form a complex of known structure 1 Identify pairs of positions that form interactions in the known structure 2 align the target proteins to the template proteins and score the interacting residue pairs identified in step 1 with a knowledgebased potential 3 Normalize using the scores for pairs of random sequences 4 Zscores above a certain cutoff indicate that a complex is likely 65 accuracy when assessing Whether different broblast growth factors bind to various receptors 4 structures avalabe 252 possible pairings evaluateoy Vot practical to apply at the genome level due to lack of homologous complexes With structures Predicting proteinprotein interactions 2 Structurebased methods What do you do when you don t have a structure Threading methods Lu et al MULTIPROSCPECT OR Phase I Thread each target sequence onto a library of folds using a permissive cutoff Phase II Take pairs of fold assignments and thread the targets onto complexes of these folds complexes of known structure Evaluate an interfacial score to determine how complementary the fit is Sinterface log NobsinPDBijNexpectbychanceij Used library of 768 complexes predicted 7321 interactions for yeast proteins Hard to assess performance One way is to look at some property that you believe should correlate with interactions eg colocalization or function Lu et al PROTEINS 2002 49 350364 Genome Research 2003 13 11461154 2 Structurebased methods Threading methods Lu L H Lu and J Skolnick quotMULTIPROSPECTOR An Algorithm for The Prediction of Proteinprotein Interactions by Multimeric Threadingquot Proteins 49 no 3 15 November 2002 35064 Colocalization Are the proteins found in the same part of the cell Please see figure 2 of Lu Long Adrian K Arakaki Hui Lu and Jeffrey Skolnick quotMultimeric ThreadingBased Prediction of Protein Protein Interactions on a Genomic Scale Application to the Saccharomyces Cerevisiae Proteome Genome Res13 June 2003 11461154 Predicting proteinprotein interactions 3 Methods based on data Jansen et aI next class Next class literature about proteinprotein interaction assessment and prediction Bader et al Gaining confidence in highthroughput protein interaction networksquot Nature Biotechnology2004 22 7885 Jansen et al A Bayesian networks approach for predicting protein protein interactions from genomic dataquot Science 2003 320 449 453 Focus on 1 What are they trying to do 2 What do they use as a set of positive and negative examples 3 What is their basis for deciding if an interaction is good or not 4 How well do the methods work How can you tell 5 Do they learn anything new or exciting about interactions in the proteome Statistical Mechanics and Molecular Dynamics IMA Workshop on Classical and Quantum Approaches in Molecular Modeling W s 04 at 0 Mark T uckerman Dept of Chemistry and Courant Institute of Mathematical Science 100 Washington Square East New York University New York NY 10003 Lecture Outline Hamiltonian systems and Liouville s theorem The Liouville equation and equilibrium solutions The microcanonical ensemble The canonical ensemble Linear response theory and transport properties Molecular Dynamics A physical system described at an atomistic level consists of N atoms The dynamics of the system can be described using the laws of classical mechanics Each atom experience a force due to all the other atoms in the system and any other external influences Hence at any instant in time there will be Nforces F1FN The forces give rise to accelerations 211 aN according to Newton s second law of motion 2 Fl mlal 2m J i1N dt From these equations we seek to determine the positions r1t I Nt and velocities V1t VNt of all atoms in the system as functions of time Hamiltonian Mechanics Hamiltonian 339quot p2 H 1 Ur rN Equations of motion UH pi I i I dpz39 mi 6H 3U i 7 77Fi 7 P am am 1391 I JN Muff Initial conditions p10 me r10 rNO Energy conservation N dH 2 pi V 31 I 771 pi VI 81 Ii i1 1V7 39 39 Lzm F 1 m7 my 271 Collect all momenta and coordinates into a Cartesian vector X P17 my pN 1391 m TN that lives in a SNdimensional space called phase space For a On eddimenisi bna V s teih wiith coordinate q and momentum I phase spacei39ean be visualized 140110 ply0 p Solution of Hamilton s equations yields xt given initial conditions x0 Phase space volume evolution Generic recasting of Hamilton s equations X 77X dxt p P dxo 0 q q Time evolution as a coordinate transformation oneparameter diffeomorphism X2 z X0 Phasespace volume evolution depends on Jacobian dX Jxtx0a39x0 Phasespace volume evolution Jacobian of the transformation XO gt Xt J Xi X0 detM eTranM 0 M1223 41226 deZaii 139 axg39 J ax dt axg Take the time derivative of both sides d M 1 J J t T M dt X1 X0 X X0 r dt m 1 aXi Tr M t 0 VX V X K X d1 j gan 6X 28X z 77 r t l Phasesgace volume evolution Equation of motion for Jacobian d EJXX0KXIJXIXO JX0X01 Hamiltonian systems incompressible KX Elva15 er m iI VE erH er NEH o i1 i1 th0 q q Phasespace volume conserved Liouvlle s Theorem dXI dXO The ensemble concept 0 Each phasespace point X p1pNl 1l N is a complete specification of a system and is therefore called a microstate Macroscopic matter consists of 1023 particles Macroscopic observables should not depend sensitively on the specific details of each particle s motion Many microstates give rise to the same macroscopic observables eg temperature N p2 T ltgt Z i1 2mi Ensemble concept Imagine a collection of systems governed by the same Hamiltonian H a sharing common macroscopic properties eg same total energy volume numer of particles Each system evolves according to the microscopic laws of motion from a different initial condition so that at each instant in time each system in the ensemble is in a unique microstate Macroscopic observables are expressible as averages over the systems in a given ensemble Ensembes and the Liouville equm Fraction of ensemble members in a phase space volume dx at time t f X tdX fxt 2 0 Idxfxt 1 Fraction of ensemble members in Q L dxt f Xt t Rate of decrease of ensemble members in Q d 6 jgdxmxnr Jgdx grow Flux out of the surface 1st gt2 fltxtrgt JgdxtVxt fem Ensembles and the Liouville equation fxl has a constant normalization J dX VX 0 J dX 6 0 Q t t t Q tat t jg dxt Em VXt fxtt 0 Since V XI 7Xt O and choice of Q is arbitrary obtain Liouville equation fxptgt39ltt VfXpt0 Liouville equation impliesfxz conserved along a trajectory dfdl0 Passive form of Liouville equation fxt77xtVfxt20 Ensembles and the Liouville eguation Poisson bracket new we fltxtgtHltxr Liouville equation in terms of Poisson Bracket 6 5f xt f XJ H x 0 0 Equilibrium conditions a i 0 at Equilibrium solution gt fXHX0 f X F H X Because of Liouville s Theorem we can freeze ensemble at any instant in time and compute an observable according to 0 jdx OXFHX Microcgnonicgl Ensemble A microcanonical ensemble is an ensemble of systems isolated from their surroundings The evolution of each system is therefore governed by Hamilton s equations The macroscopic variables that are invariant in such an ensemble are the total energy E the volume V and total number of particles N We first seek to describe the thermodynamics of this ensemble so we seek a statefunction that depends on N V and E A state function is defined as a thermodynamic function whose change is independent of the path taken in the space of thermodynamic variables Microcanonical Ensemble First law of thermodynamics EQW Q 2 Heat absorbed by system W 2 Work done on system Small changes along a reversible path dE deV dW Heat absorbed related to entropy change at temperature T dS dQl CV T Work performed by compressing or adding particles dW PdV udN ICV for a onecomponent system Microcgnonicgl ensemble Combining work and heat with First Law dE TdS PdV de Thus 1 P dS dE dV dN T T T The entropy SSN VE is the state function we seek dS dN dV j dE 6N KE 6V ME 6E Ny Microcanonical Ensemble Connection to microstates provided by Boltzmann s relation SN V E k In ow V E QNVE is the number of microstates available to a system To find this number return to equilibrium solutions of Liouville s equation For a microcanonical ensemble the condition HxE must be obeyed FHx W HX E All points on the constantenergy hypersurface are equally probably while all points off the surface have zero probability A microcanonical ensemble is therefore one for which all accessible states have equal a priori probability of being accessed the probability being 1QN V E 1QNVE is the normalization with E0 QUVJE Nlhm Idx 5HX E partition function E 0 N N Nh3N Id p ma 139 5Hpr E Microcgnonicgl Ensemble Thermodynamics i6ln 2 326ln9 iam9 kT 6E NJ kT 6V ME kT 6N VaE Equilibrium observables lt0gt jdx 0x 6HX E MN jdx 6HX E QNVE Now suppose SN V E CGQN VE i MN T aE QNVE j dx 0x 6HX E j dX 0T x5Hx E as 6E CG QNVEMN jdx 6HX E ain 5HX E aE If AT x am 5HX EaE G39Q1Q GQ1nQ Microcgnonicgl En emble The microoanonioal ensemble can be generated by solving Hamilton s equations 6H pl 6H 6U r 613 6r 1 l a pi ml Pi Phase space averages computing as time averages jdx 0x 5HX E 1 0 1 of 0x 0 lt gt Idx 5Hx E r13on t v F Ux x Not ergodic if Elt VI Canonical Ensemble Using N V and E as thermodynamic control variables for an ensemble is not A natural choice as experiments in the condensed phase are never performed Under these conditions More natural choices are N VJ or NPT Corresponding to the canonical and isothermalisobaric ensembles ias T6E Recall SSNVE gt EENVS gt 6E 65 Energy function of N V T by Legendre transformation E Am V T Em V Sm V T SW V DZ S E TS AN VJ called the Helmholtz free energy talk more about tomorrow Canonical Ensemble Small Change in AN V T d4dE SdT Td R R H1MV ampW PdVde SdT dAaA dNaA dVaA dT 8N VJ 8V NJ 8T N V Thermodynamic relations 641 kg 52 8N VJ 6V NJ 6T NyV Also Canonicg Ensemble Microscopic picture N1 gtgt N2 V1 gtgt V2 E1gtgtE2 EE1E2 H1X1H2X2 Microcanonical gartion function QNVEJdXIdX25H1X1H2X2 E QIN1V1E1QZN2V2E2 Distribution function of system 1 FHX1 0C JdX2 H1X1 H2X2E 1nFltHltxlgtgt 1n ide6ltH2ltxZgt E HlltxlgtaiE1n ide6ltH2ltxZgt E 52N2V2E H109 k kT Canonica Ensemble Canonical distribution 1 1 F H x 2 C NH QWDVDT N H 1 3 X QNVT CNJ dX H CN W Thermodynamics 6A 1 AE TSET gt ANVT 1n NVT 6T Q i6an 3261IIQ SzkankTaan kT 6N VJ kT 61 NJ aT NV Eguilibrium grogerties CN 3HX lt0 QWVJ jdx 0Xe Canonical Ensemble In the canonical ensemble energy is not conserved Therefore Hamilton s equations cannot be used to generate a canonical distribution We need to supplement them with an effect that mimics the thermal reservoir Many ways to do this Langevin dynamics d1 ampdr m i dpl Udt ypidt 2mi7deW Il39 Corresponding FokkerPlanck equation appartZ iamp a plaU a o yiqzikTi Pp1 l at l 61 ml 6p 6139 6p 5P i Stationary solution Canonical Ensemble Nos Hamiltonian 8 Nos J Chem Phys 81 511 1984 Consider a Hamiltonian of the form N 2 2 p 93 HNp7psarasZZ i1 139 2 Hpsr ng1ns Microcanonical partition function ow V E j dpdpsdrds 5 Hpsr 2 p3 ng1ns E 2Q Change variables I Pi ml 2 gm V E j dp39dpsdrds SW5 Hp39r nglns E Canonical Ensemble 5s s0 Delta function identity 5fS fso QUVDVDE J dpdpsdre3N1E Hpr p 2Qng k 2 6E T Idpdre HprkT V 3N1kT g3N1 By solving Hamilton s equations for the extended Hamiltonian a canonical phasespace average can be computed as a time average dx 0x e WX 0 z l 1 lt gt J dX e HX Z ggfdt 0Xz0 Classical nonHamiltonian statistical mechanics MET Mundy Martyna Europhys Lett 45 149 1999 MET Ciccotti Martyna Liu J Chem Phys 115 1678 2001 Equation of motion for J acobian d EJXI9XOKXIJXI9XO JX09X01 Note that for Hamiltonian systems KX 0 2 JXt X0 1 a Xt J Xi X0dX0 dXO LiouVille39s Theorem If KX 2 0 system is nonHamiltonian J Xt X0 72 1 Let the nonHamiltonian phase space be a general Riemannian manifold With a metric tensor g 7xt and determinant gxt Then Liouville s theorem can be generalized to gxt j Ddxt gx0 j Ddxo gX0gtO S J wao 1nce gXpl Classical nonHgmiltonian statisticgl mechgnics Equation of motion for Jacobian Jxpr KxtJxtxO Jx0x0 1 Solution X0 ejods MKS t9 d Define KXt EWltprgt Then J09 X0 ewXtwXo0 Whence a xt JXpX0dX0 3 eWprdxt e WX00dXO Canonicg ensemble NoseHoover equations 1 zg 5Uamp 1 ml 1 Q l pn p2 77 Q I NonHamiltonian with compressibility N 539 K Z ppi Vria namp 3Namp 3N77 i1 1 l apn Q Metric factor ew e3N77 2 Conserved energy H39 Hp r 5 3NkT77 Canonical Ensemble 2 QN V E j dpdpndrdn e3N 75Hpr f 3NkT77 E 0C Jdp dr e mp r 0C QUV V T 2 1 117 H ma x m 2 I 047 e pZZm A 2 ma A 7 fx Z ei ma xzZ gig7 72 017 l 00 760 730 00 30 60 X Nos Hoover chains nnnniml Martyna Tuckerman Klein JCP 97 2635 1992 Equations Hamilton 1 P111 Nos Hoover Nos Hoover Chains Conserved energy AI P2 I H39 Hp1 Z i 3NkTm ZkTm k1 sz k2 of motion amp m1 101 Fi Qlipi k 1 M Qk N pg p Z4 deT 11 mi Q2 2 17 7 p U l lk1pnk Q1971 Qkl 2 1 WM 1 kT A171 Conserved volume element dX 83N77 39 WM dNrdedMnde Onedimensional canonical harmonic oscillator 2 P 1 2 2 3 7 p22m H fl7 1 e 2m2mm x 27quot fx 27 1 chain element 3 chain elements 4 chain elements 40 v i i 05 i i i 05 i i i 05 39 04 7 I 7 M 04 7 7 A03 7 7 03 A03 7 7 l quot a 302 3 T 02 7 7 7 7 02 7 a 01 7 a 01 7 7 01 7 7 00 439 7 M i 00 i 760 730 00 30 60 o 10 03 33 w a 730 0 50 30 50 P as i i i 55 i i 04 7 7 04 03 7 7 03 7 7 A A x 35 r 02 02 7 7 01 7 7 01 M i 00 i 10 03 33 ED 00 730 00 30 60 Radial distribution functions of wati39 CPBLYP 753 grid 30 ps NVT 300 K I 500 au 9mm r angstrom 3 rangslram NVT fluctuations half those of NVE Driven dynamics and transport properties Driven harmonic oscillator 2 p x p max Focoth m m After a short time transient behavior gives way to steady state behavior that resembles equilibrium in a different region of phase space The steady state allows transport properties to be computed General driven equations of motion 139 ampCltprgtliltrgt mi pi Fi Di P9050 Assume incompressibility Z Viol 39Dz p r Vri cim 1 Z 0 139 Driven dynamics and transport properties Liouville equation a a 5fxt X VfXt 5fxt lLfXt 0 Linearization scheme f Xat foHX AfX t iL iLO iALt Equilibrium condition iLo HX 0 Linearized Liouville equation no Afx r z ALtfo H X Driven dynamics and transport properties Solution Afx r ds e iLoltt SgtiALsJgHx Simplification l39ALSfoHX I39LS l39LofoHX I39LSfoHX a 0 aHx l39LSfo1L1X J39XFS jx 2 D p r H CI p r H dissipative flux 139 P I Take equilibrium distribution as a canonical distribution CNe HX WXDW l39LSfo1L1X HXJXES Driven dvngmics gnd trgnsport properties Nonequilibrium observable 0t j dx OXfxt j dx OXf0HX j dx OXAfxt 00 j dx OXAfx t From linearized solution lt0gtt 00 J ds ldxfoHx0ltxgte loltquot jltxgtes Let xt be the unperturbed evolution of the phasespace vector Evolution of 0Xt d0ltXt dt 0Xt eiLOtOO o 0X0 eiL0t 54 VX0XI 110009 Driven dvngmics gnd trgnsport properties Nonequilibrium observable lt0gtt 00 i ds Fees idezHx0xtsltxjx 00 J ds F8lt0t sjogt0 The quantity lt00 S j 0gt0 is called an equilibrium time correlation function J39x 1 7 Properties A0BIgt i130 L d7 AXTBXHT Driven dynamics and transport properties Example Shearviscosiy gt Q Equations of motion 1quot ampyylamp mi pi Fi 7pyi Dissipative flux pxp Jpr72 y Fxyl7VPW m 1 Driven dynamics and transport properties Coefficient of shear viscosity 77 From linear response formula 0 ltPW desltPWOPWt sgt0 Viscosity 77 V j dTltny 01 m0 239 r 5 known as a GreenKubo formula Transport coefficient related to time integral of an equilibrium autocorrelation function IMA Summer Program Classical and Quantum Approaches in Molecular Modeling Lecture 1 Introduction to Molecular Dynamics Robert D Skeel Department of Computer Science and of Mathematics Purdue University httpbionumcspurdueedu2007July23pdf Acknowledgments V Dadarlat NIH con den sed matter physics physical chemistry materials science mechanical en gineerin g molecular biophysics structural biology Twenty years ago at the IMA Workshop on Atomic and Molecular Structure and Dynamics week 5 July 13 17 1987 Wilfred F van Gunsteren He expected most improvements to come from hardware rather than algorithms Since then a 10 OOO fold improvement in processor speed a factor of 20 typically for parallelism and a factor of 25 typically for algorithms Possibilities for algorithms Dramatic improvements in algorithms seem possible Without radical innovations by deploying innovative ideas scattered in the literature improving them through analysis and abstraction and combining them Massive parallelization of algorithms is another opportunity but MP1 is too low level for regular use Twenty years from now What is possible 20 years from now for the same accuracy compared to current practice integrators and fast force evaluation factor of 5 sampling factor of 10 coarse graining factor of 10 parallelism factor of 50 processor speed factor of 10 Misleading ideas Structure determination is the minimization of potential energy Molecular dynamics is the calculation of a real trajectory Remedy Lecture 2 Statistical Mechanics and Molecular Dynamics Outline 1 Equations of motion 11 Boundary effects 111 Initial conditions IV Computational tasks thermodynamics and structure V Computational tasks kinetics VI Practicalities Atomistic models 90 collection of N atomic positions 77 M diagonal matrix of masses mi 11 collection of velocities p M 11 collection of momenta Fli il possibly other variables e g volume microscopic state Force eld Potential energy U is a sum of 9N few body potentials for covalent bonded forces 9N2 2 body potentials for nonbonded forces which approximate quantum mechanics Nonbonded energy terms 1 Coulombic 612 m m Where qi qj are partial Charges const 2 Londor dispersion van der Waals Const jW 3 excluded volume const j 77Z12 10 Bonded energy terms 1 bond stretching for bond 239 j energy constnj 0 2 angle bending for bonds 239 j k energy const6 902 Where 6 is the angle between the two bonds 11 3 Consider bonds 2 j k l With the 3 bond lengths and 2 bond angles xed rotation about middle bond j 12 remains possible Torsion aka dihedral angle 90 is Clockwise rotation of 2 j about the k j axis needed to minimize the distance from 2 to l 1 energy E Vn1 3050190 9022 n Where for example the sum might be over 22 2 3 A typical potential plotted against 90 12 0 1 2 3 4 5 6 4 Miscellaneous improper dihedral CMAP correction 13 Equations of motion VU collection of forces Equations are a Hamiltonian system xlttgt Mlpo gm Fm with Hamiltonian Hx p pTM1p 14 A numerical integrator The velocity Verlet scheme is 90quot 90quot AMI 1pquot At2M1Fxn W p gmwo Few Equivalent to the truncated Stormer and the leapfrog method An ancient method Discovered not invented 15 Alternative formulation xn l l 2x71 xn l n mg F a xn l l xn l 2At vquot Trajectory error olt AtZetT 739 m 50 periods of fastest mode More later At a period 16 Re ned models employ a more complicated force eld a polarizable forces a bond breaking force elds e g REAXX a quantum mechanics molecular mechanics QMMM 17 Coarse grained models use fewer degrees of freedom 0 constraints remove highest frequency motions o implicit solvent 0 reduced models bottom up coarse graining a continuum models top down coarse graining 18 Outline 1 Equations of motion 11 Boundary effects 111 Initial conditions IV Computational tasks thermodynamics and structure V Computational tasks kinetics VI Practicalities 19 Modeling the surroundings if thermal contact temperature is speci ed if mechanical contact pressure is speci ed if material contact chemical potential is speci ed for each species In the thermodynamic limit N gt oo boundary effects diminish as 9N 12 modeling an isolated system 20 Spherical boundary restraints Soft spherical wall centered at origin To U add a term 1 Zkmax0 radius4 for every atom 239 Seldom used in practice Instead to avoid arti cial boundary effects use 21 Periodic boundaries Simulation box is replicated in nitely often Forces are sums over in nitely many images Sums of Coulombic forces are not well de ned Need to use a reasonable limiting process One such process leads to the Ewald sum which has the special property of being continuous as a function of 90 Be assured no computational penalty 22 Modeling thermal contact A method easy to implement Instead of spherical restraints identify atoms 239 in the outermost layer and harmonioally restrain them to their initial positions add FLOP to and thermostat them stochastically Stochastically 23 Wiener processes A standard Wiener process t 2 0 is a family of Gaussian random variables fully characterized by their expectations E 0 and covariances mins t 24 Stochastic thermostating d d a a V t0 Where k3 is Boltzmann s constant T is temperature 7 are damping constants how to choose are independent standard Wiener processes More discretely add F1 ng l Wilttn12 Witn 12 2k T 7 2m B 71ml At 25 We have W tquot12 man 2 Ath where Z Z are independent standard Gaussian random numbers 26 Outline 1 Equations of motion 11 Boundary effects 111 Initial conditions IV Computational tasks thermodynamics and structure V Computational tasks kinetics VI Practicalities 27 Initial conditions The microstate of a system is largely unknown so initial values F0 are chosen at random from some prescribed distribution With probability density function pdf pF consistent with macroscopic observables such as T The appropriate distribution is known mathematically under certain hypotheses 28 For a system in thermal contact with its surroundings the canonical ensemble use the Boltzmann Gibbs distribution HI kBT HI kBT pF e e dF Probability depends on energy factor of 10 ltgt a difference of 14 kcaImol factor of 105 ltgt a difference of 70 kcalmol at physiological temperature Due to random initial values one must study an ensemble of systems 29 Equilibration A random microstate can be obtained for the canonical ensemble by performing a long episode of LangeVin dynamics d d d M F CM 2k TOM 12 W t Where C is a diagonal matrix of damping constants and W05 is a set of 3N independent standard Wiener processes De nitely nonphysical 30 Outline 1 Equations of motion 11 Boundary effects 111 Initial conditions IV Computational tasks thermodynamics and structure V Computational tasks kinetics VI Practicalities 31 Thermodynamics and structure a Most quantities of interest are de ned only in terms of a stationary distribution and the purpose of the simulation is to sample con guration space a In some cases kinetic quantities are of interest and realistic Newtonian dynamics must be used In the former case only pF is needed and equations of motion are not essential 32 An observable is a p Weighted average of some function of the microstate W Altrgtpltrgtdn eg if U ilrlt90 is energy exclusive of outside contributions Uintxgt is the internal energy This might be calculated as 1 Ntrials Ntrials Z AFV which requires random sampling of phase space 33 Representative tasks 0 thermodynamics e g pressure vs temperature a structure e g radial distribution function a energetics e g free energy differences potentials of mean force 34 Structure The meaning of this term ranges from geometry con guration of a system modulo uniform translation and rotation to topology bonding patterns e g for proteins 1 primary structure covalently bonded sequence of amino acids 2 secondary structure backbone hydrogen bonds strong noncovalent associations N H OC 3 tertiary structure other hydrogen bonds salt bridges 35 Conformations trans gauche clusters of con gurationsstructures a better still regions of con guration space such that transitions between them are rare 0 more conveniently dihedral angle ranges 36 Free energy of binding Consider a protein in a dilute solution of ligands The free energy of binding AG is de ned by the relation Pra ligand is bound to protein ceAgkBT Prno ligand is bound to protein Where c is the ligand concentration in molliter It can be modeled with a single protein and ligand in solution eg 37 KIX and pKID 38 Potentials of mean force Let R 5 be a reaction coordinate e g distance between the centers of mass of 2 molecules Let pgR is the pdf for 5 with 90 taken to be random MR was Rgtpltxpgtdxdp The potential of mean force wR is de ned by ewRkBT const pgR 39 Example With R E the distance between centers of mass of KIX and pKD or KID the potential of mean force is I I I I I N I u I w Rf kcalmol 5 I 40 Outline 1 Equations of motion 11 Boundary effects 111 Initial conditions IV Computational tasks thermodynamics and structure V Computational tasks kinetics VI Practicalities 41 Kinetics In principle this requires an ensemble of say 2 to 20 000 realistic trajectories with random initial conditions qtFV7 V 1727 7Ntrials Where ltIgttF denotes the t ow of the dynamics phase space trajectory with initial value F 42 Representative tasks a short animations e g impact of projectile on material bullets move lSAps a time correlation functions 0 transition paths 0 transition rates conformational dynamics 43 Coping with chaos Motion is chaotic and trajectories are swamped with error Compensated by the fact that initial values are unknown Times can be lengthened by shadowing arguments However for very long times only time correlation functions and time averages may be computable Bottom line formulate questions in terms of computable quantities 44 Time correlation functions Unnormalized time correlation function ltAltltIgttltrgtgtBltrgtgt AltltIgttltrgtgtBltrgtpltrgtdD In principle this might be calculated as ltAltltIgttltrgtgtBltrgtgt s fsAltlt1gttltmgtgtBltrw Autocorrelation functions can be used to compute transport coef cients like diffusion thermal conductivity viscosities e g velocity autocorrelation function gt diffusion coef cient 31 foams 170gtdt For greater ef ciency average over all indistinguishable atoms 239 45 Transition pathways active kinase domain inactive kinase domain 46 De ning a pathway Problem is to calculate a representative path from metastable state A in x space to metastable state B committor function Prtrajectory starting at 90 with random 1 reaches B before A On each isocommittor consider the distribution of crossing points from reactive trajectories Choose the center of this distribution to be representative This is illustrated in the following gure 47 I 1594th a V Where shading indicates contours of potential energy thin curves denote isocornmittors ellipses enclose concentrations of crossing points from reactive trajectories and the thick curve is the center 48 Diffusion limited reactions Enzyme Substrate Problem calculate rate constant k Where reaction rate substrate enzyme per unit volume concentration concentration 49 Outline 1 Equations of motion 11 Boundary effects 111 Initial conditions IV Computational tasks thermodynamics and structure V Computational tasks kinetics VI Practicalities 50 Practicalities The practicalities of doing such calculations involve three steps structure building Setting up the input les is best done interactively with scripts and visual feedback visualization programs RasMol VMD PyMOL simulation Generating dynamics or sampling trajectories is best done in background or remotely simulation programs CHARMM Amber Gromacs NAMD LAMMPS NWChem Tinker analysis Analyzing trajectory data 51 Simulation speci cations 0 Specify molecular system amp surroundings o Specify computational tasks a Select computational model uncontrolled approximations and error tolerances internal forces external forces e g temperature and pressure control dynamics sampling or real 0 Override defaults for performance parameters a Design simulation protocol 52 References 0 M P Allen and D J Tildesley Computer Simulation of Liquids 1987 a D Frenkel and B Smit Understanding Molecular Simulation From Algorithms to Applications 2nd edition 2002 o A R Leach Molecular Modelling Principles and Applications 2nd edition 2001 o T Schlick Molecular Modeling and Simulation An Interdisciplinary Guide 2002 a Journal of Chemical Physics 53