INTRO TO BIOINFORMATCS
INTRO TO BIOINFORMATCS BNFO 301
Virginia Commonwealth University
Popular in Course
verified elite notetaker
verified elite notetaker
verified elite notetaker
verified elite notetaker
verified elite notetaker
verified elite notetaker
Popular in BioInformatics
This 21 page Class Notes was uploaded by Vivienne Dickens on Wednesday October 28, 2015. The Class Notes belongs to BNFO 301 at Virginia Commonwealth University taught by Jeffrey Elhai in Fall. Since its upload, it has received 21 views. For similar materials see /class/230695/bnfo-301-virginia-commonwealth-university in BioInformatics at Virginia Commonwealth University.
Reviews for INTRO TO BIOINFORMATCS
Report this Material
What is Karma?
Karma is the currency of StudySoup.
Date Created: 10/28/15
BioBIKE Language Syntax Working with large numbers of items Mapping and Loops II Loops ILA Overview of loops by exalnple Implicit mapping is simple just replace a single item with a set of items Explicit mapping is not too bad just de ne a function fx function and provide a list of x s In contrast looping uses what seems like a separate language A 100p executes one set of instructions repeatedly Each time through the instructions is called an iteration Here s a previous example rendered as a loop Translation a Consider each protein in the set of all proteins of ss1 20 one at a time b Accumulate the molecular weights of each protein c When the last protein has been considered return the sum That gets you the total molecular weight Or this code gets you the entire answer by means of a more complicated loop M protein gr II Loops l Translation a Consider each protein in the set of all proteins ofss120 one at a time b Before the loop begins set the sum of molecular weights to zero The initialization occurs only once c Before the loop begins set the number of proteins This will be a constant d Find the molecular weight of the one protein you re considering at the moment This assignment is repeated each time through the loop for each protein e Add that molecular weight to the growing total f Loop Repeat steps d and e until you ve considered each protein in the set g When you ve finished considering each protein calculate the average molecular weight and use that as the value returned by the FOR iEACH function IIB Overview of anatomy of the loop BioBIKE supports two functions to describe loops F OREACH and the very similar but more general LOOP function This section will discuss only FOREACH Loops can be divided into the following mostly optional parts mum 39 39 7 Dan we quot 39 Cl V MBatlyinn ziistt be iterdted ml ii MR39eSults Section iDerermines the femlt returned by the loop a The Initialization section green is the rst to be executed and is executed only once before the loop begins The Primary iteration control first section and Control section light green are then set up before the loop begins but they are considered again with each iteration The next three sections blue are executed each iteration The Final Action section red is executed only once after the loop is finished In the sections that follow I ll discuss each part IIC Primary iteration control top line of FOR EACH The primary iteration consists of two parts the variable and the values that will be assigned to the variable Each time through the loop the variable will take only one of the values The rest of the loop describes how that value will be used Most people who have learned other computer programming languages expect loop variables to take only numeric values counting sequentially from one number to another number BBL loops can do this but it s more common in bioinformatics to want to go through a list of things like a list of genes or organisms or the nucleotides of a sequence The example at the beginning of this section calculating molecular weight is a typical case You can change the primary iteration how the variable is assigned values to achieve the following forms In this and many other examples shown here I use a version of the language not what is currently available Loops 2 mw50wgt Code fragments showing different formats for primary iteration A Consider each organism within the list of organisms called all cyanobacteria B Consider each letter within the string called my 5 equence C Consider each headersequence pair in the list of such pairs read from the FastA le quot yreadstXtquot D Consider each position taking values from 15 up to l E Consider each position taking values from the variable called gene end down to the same number less 30 F Consider each codonnumber starting from 3 and proceeding by 3 3 6 9 SQl What values do you predict will be displayed in the following loop SQZ What values do you predict will be displayed in the following loop IID Additional Control section Format E above makes clear that sometimes additional control over the loop is necessary besides that provided by the primary iterater If the loop were allowed to proceed as shown in that format it would begin at 3 and continue forever Through the Control Section you can impose additional FOREACH conditions and conditions of two new types WHILE and UNTIL The additional FOREACH conditions proceed in parallel with the rst Thus displays only 10 numbers 1 4 9 100 not the 10 x 10 table you might expect The loop shown above may be understood as follows Translation a Consider each number one at a time from J to 10 Call it numb erl b At the same time consider each number one at a time from 1 to 10 Call it nomberZ Notice that when explained in this way the folly of this loop becomes clear There39s no difkrence between numberl and numberz and no sense in inventing both They will never difkr from one another c Display thefollowing on a single fresh line the value ofn umberl multiplied by nomberZ d Loop Repeat steps cfor each numberl and numberz changing in lockstep There is a way of breaking the connection between the numberl and numberz so that they increase not in lockstep but like columns in a speedometer We39ll cover that in the Body section In contrast to FOREACH WHILE and UNTIL do not create a new variable They describe a condition In the case of WHILE that condition must be met or the loop is terminated In the case of UNTIL if that condition then the loop is terminated For example quot a mem W 513 L 1 time Both loop fragments will have the same effect In the rst case codon position will increase by 3 so long as that position remains less than the full length of some gene In the second case codon position will increase by 3 up until the point that position exceeds the length of the gene Sometimes it is easier to think about continuing the loop WHILE something remains true and sometimes it is easier think about continuing it UNTIL something becomes true It is possible to combine any number of controls into a loop The rst loop fragment may be understood as follows Translation a Consider each codon position one at a time starting with the number 3 and proceeding upwards counting by 3 b Continue with the loop so long as the codon position is less than the length of my gene This example presumes that you have previously defined my gene to be some gene c The loop continues in some way not shown here Loops 4 Using WHILE or UNTIL can be a bit dangerous If you specify a condition that is always true or that never can be met you might end up with a loop that goes on forever or until your time allocation is exceeded default 40 seconds SQ3 How many numbers will the following 100p display l and II E Initialization section The rst thing a loop does is to initialize variables you specify to be initialized It does this even before initializing the primary iterater You often may make loops without this section but it does have its moments For example you might want to initialize variables to set a constant Or you might want to initialize a variable that will change over the course of the loop gt if FOREACH gt PFOtelnElN I It is possible to de ne any number of variables and constants SQ4 What will the following 100p display Estart C utaw Elime I Mn Jer a Loops 5 IIF Variable update section The Initialization Section initializes variables only once at the beginning In the Variable Update section variables are modified every iteration through the loop This is most commonly employed to set up quantities that will be used within the loop but depend on the value of the iteration variable For example Pm Ioop lt 7 which may be understood as follows Translation a 9 5 3 9 5 f Consider each gene pr0004 7 Retrieve the short form of the name of the gene under consideration and assign that value to name amongst those with shared evolutionary antecedents as Retrieve the short name of the organism of the gene under consideration and assign that value to organism Retrieve the description of the gene under consideration and assign that value to description Display the following listing on a new line the name ofthe variable and its value for each of name organism and description Loop Repeat steps b through e until the set of genes has been exhausted Note that the assignments steps b through d are redone each iteration Each time through the loop name organism and description will take on different values because they are derived from the primary iterator gene which takes on a different value each iteration Variables may also be updated using values that have been defined in the Initialization Section or updated earlier in the Variable Update Section You could use this section to assign values to variables that don t change but that39s best left to the Initialization Section Sometimes the entire loop is calculating variables and there39s no need for the Body Section It is enough to collect one or more of the variables Loops 6 SQS Consider the example below of how initialized variables differ from updated variables Predict what will be displayed BodyiFamw to be iterated u L U B initializedrnumber Updatedinumber E I gnu IIG Body In all the sections discussed thus far there are constraints as to what kind of actions may be taken e g de nitions in the Initialization section In the Body section you have almost free rein It is important to realize however that actions in the Body section do not cause a result to be returned by the loop unless there is an explicit RETURN statement You may multiply a one variable by another but that multiplication will not necessarily nd its way into the results Sometimes you would like to have two loop variables running separately from one another unlike the two variables locked together in the example shown in SQ3 You can do this by having one loop nested within another Nested loops work like a speedometer with the variable of the inner nested loop like the 139sdigit and the variable of the outer nested loop like the 1039sdigit The inner loop runs to completion for each iteration of the outer loop Here s an example SQ6 What does the function below display Or more interesting Loops 7 SQ7 What does the function below display IIH Results section Like all BioBIKE functions loops return a value If you pay no attention to what it returns for example if you re concerned only what the loop displays as in the last example then NIL will be returned Most of the loops used as examples thus far return NIL They just display something as output In the real world however you ll usually want the loop to return a value or a list of values On the rst page of the notes there are examples of loops that do return values one because of the SUM keyword and one because of the RETURN function Here are ve ways of returning values COLLECT returns a list of values saved over the course of the loop COUNT returns the number of times the clause is invoked SUM returns the sum of a number of items MAX returns the largest number considered in the loop MIN returns the smallest number considered in the loop RETURN is another way but that will be discussed in the next section Final Actions You may use no more than one of these methods in a speci c loop Each of these ve ways of developing a result can be activated conditionally using the very useful WHEN option Here are some examples some using WHEN some not How to make a list of large genes Loops 8 Translation a Consider one at a time each gene in the set ofgenes ofAnabaena PCC 7120 9 Retrieve the length of the gene and assign that value to a variable called name 5 3 When the length is greater than 2000 put the gene and its length into a collection that will eventually be the result of the completed loop d Loop Repeat steps b and c until the set of genes has been exhausted How to count the number of small proteins P V WEE Mia Translation a Consider one at a time each protein in the set of proteins of Prochlorococcus marinus ss120 b Retrieve the molecular weight of the protein and assign that value to a variable called MW c When the MW is less than or equal to J 0000 count the protein ie add one to a count which will eventually be the result of the completed loop d Loop Repeat steps b and c until the set of proteins has been exhausted How to get a sum of the number of nucleotides devoted to tRNA llength r l length Loops 9 Translation a Consider one at a time each gene in the set of noncoding genes ofAnabaena PCC 7120 b Retrieve the length of the gene and assign that value to a variable called length c Retrieve the description of the gene and assign that value to a variable called description d When a match is found for the text quottRNAquot within the description add the length of the gene to a running sum which will eventually be the result of the completed loop e Loop Repeat steps b through d until the set of genes has been exhausted What is the size of the largest gene 1 length Translation a Consider one at a time each gene in the set of genes ofAnabaena PCC 7120 b Retrieve the length of the gene and assign that value to a variable called length c If the length is bigger than any previously considered in this loop remember it and eventually this largest length as the result of the completed loop 1 Loop Repeat steps b and c until the set of genes has been exhausted SQ8 Predict the result you would get by replacing XXX in the exalnple below with each of the ve Result Section options 111 Final action and the RETURN command incomplete Sometimes the ways of constructing a result oiTered by the Result Section are insuf cient The RETURN command makes it possible to return a result of any form at any time It is possible to stop the loop during an iteration and return a value by placing RETURN Within the Body Section For example if you want to nd the rst instance of a long upstream sequence in a genome you might do something like this Loops 10 Translation a Consider one at a time each gene in the set ofgenes ofAnabaena PCC 7120 b Retrieve the sequence upstream of the gene the sequence extending from the beginning ofthe gene backwards until it encounters the previous gene and assign that value to a variable called upstreamseq c Retrieve the length ofthe gene and assign that value to a variable called length d If the length of the upstream region is greater than 1000 nucleotides then return the current gene and exit the loop e Loop Repeat steps b through d until the above condition is met or until the set of genes has been exhausted In some loops you may want to perform some actions after all the iterations have been completed Most commonly the action to perform is to return a value using quantities developed by the loop An example of this type is given on the first page of these notes III Translating ideal loop code into VPL 10 At present if you click on FOREACH in the FlowLogic menu you don t get the lollipop avored display shown in Section IIB of these notes Instead you get a more confusing though perhaps less damaging to the retina image as shown on the next page This is because the language is undergoing substantial revision and while you are using VPL Version 10 another version is being constructed along the lines of these notes Although the appearance is different all the functionality is the same so the trick is to match tools in one packaging with tools in another The mapping is sometimes not direct The functionality of the Control Section and the Variable update section described in Sections 11 both may be found by clicking on the arrow in the controls box The functionality of the Results section is split between the cond box for conditional accumulation of results and the agg arrow SQ9 Translate each of the examples shown in these notes and run them within BioBIKEVPL Loops ll Introduction to Bioinformatics Genome Analysis Sequence contrasts Dinucleotide and Codon Frequencies The previous set of notes introduced the article Samuel Karlin 2001 Detecting anomalous gene clusters and pathogenicity islands in diverse bacterial genomes Trends in Microbiology 9335343 but focused only on one method described in it the detection of genomic islands by differences in GCfrequency Unfortunately we found that what worked well with 50Kbsized fragments did not work well at all with much smaller fragments those about the size you might expect to nd from a metagenome project Here we ll discuss two other analytical methods considered by Karlin 2001 comparisons of dinucleotide biases which he calls genomic signatures and comparisons of codon frequencies Dinucleotide Biases Dinucleotide Why start there What happened to mononucleotide Well we39ve already 391 J 39 quot1 f 1 39 Since in doublestranded DNA A T G C and AG TC only one piece of information is all that is needed to describe the frequencies of nucleotides You can express it as A or as G or as GC ie the GCfrequency but the information content is the same If there s too much information for this single measurement to identify an organism from a short 500nt fragment maybe we ll have better luck with multiple measurements Dinucleotides give you 15 independent frequencies for a DNA fragment Maybe that will provide enough information to do the job SQl Where did the number 15 come from SQZ Use a BioBIKE function to generate all possible dinucleotides SQ3 Find the counts of each dinucleotide in PMed4chromosome SQ4 Convert the counts to frequencies SQS Relate the frequencies to the dinucleotides Which dinucleotides have the highest frequencies Which have the lowest Does this make sense What is the GC fraction of PMed4 Unfortunately it should make all too much sense The dinucleotide frequencies at first glance tell us not much more than we already knew from GCfrequencies But there39s a lot more information hidden waiting to get out To see it we need to take into account the frequencies of the individual nucleotides Of course AA has a high frequency in PMed4 since A has a high frequency What we want to know is whether AA has a higher frequency than one would expect given the frequency of A The expected frequency of AA is frequencyA frequencyA SQ6 What is the frequency of AA in PMed4 How close is it to the expected frequency What is the ratio of the frequency to the expected frequency SQ7 Repeat the previous question but considering GG dinucleotide Sequence contrasts H l Now we have a way of giving a fair measure of dinucleotide frequencies Dividing the frequency by the expected frequency based on mononucleotide frequencies tells us how unusual the dinucleotide is discounting the organism39s biases forA C G and T Not surprisingly someone has thought of all this before Sam Karlin for one Look at Box 2 of the article remember the article and without swallowing your eyeballs consider the equation on the left side of the box For the most part it should look familiar especially if you ignore the detail about concatenating the inverted complement of the sequence ie considering both strands of DNA We39ll call the ratio of observed to expected frequencies the bias SQS Calculate the dinucleotide frequencies and dinucleotide biases for two organisms a low GC organism like Prochlorococcns PMed4 and a high GC organism like Prochlorococcns P9313 How do they differ from one another Comparisons of Dinucleotide Biases Back to the main question Can this measure be useful in identifying pieces of DNA that are part of larger pieces of DNA Of course Karlin would ask the opposite question whether it can be useful in identifying pieces of DNA that are foreign to a larger piece of DNA We addressed this question with GCfraction by looking at the range of values in a genome as the fragment size got progressively smaller We could do the same thing with dinucleotide biases but how do you compare sixteen numbers Do we need sixteen separate graphs for each of the sixteen dinucleotides I hope not It would be far better if we could come up with a single number that combines the information from the sixteen dinucleotides Given the dinucleotide biases for a fragment of DNA we d like to know how close they are to the biases of a different fragment of DNA perhaps the genome it s part of How to make that comparison Now return to Box 2 this time focusing on the equation on the right side of the box Read the description of the equation which lies both before and after the equation SQ9 Can you make sense of the equation It39s the right side of the equation that39s important The left side is just a name SQ10 Write a function that will take dinucleotide biases and from them derive a single number representing their similarity Why should organisms maintain constant dinucleotide biases over their genomes I ve never heard a convincing explanation but evidently they do Codon usage contrasts It39s a fact of life at least life on earth that most amino acids can be encoded by more than one codon How does an organism decide which one to use On one hand the choice makes no difference A phenylalanine encoded by UUU is just as much a phenylalanine as one encoded by UUC Nonetheless organisms do make choices and their choices are definitely not random The codon usage chart on the next page gives an example of the highly divergent choices of two bacteria Borrelia burgdorferi and Mycobacterium tuburculosis Note that the table is given in terms of RNA codons You can transform them into DNA codons by replacing U with T The first number for each codon is the ratio of the instances of that codon to the total number of codons for that amino acid The second less use ll number is the number of times that codon is used per 1000 codons Sequence contrasts H 2 Barrelia burgdmferi 2294 CDS39s 612759 codons elds triplet amino acid fraction frequency per thousand UUU Phe 088 483 UCU Ser 032 241 UAU Tyr 0 77 31 6 UGU Cys 068 49 UUC Phe 012 63 UCC Ser 005 34 UAC Tyr 0 23 9 2 UGC Cys 032 23 UUA Leu 041 415 UCA Ser 024 176 UAA 065 24 USA 016 06 UUG Leu 016 163 UCG Ser 003 23 UAG 0 19 0 7 UGG Trp 100 44 CU39U Leu 028 290 CCU Pro 042 100 CAU His 073 8 6 CGU Arg 007 21 CUC Leu 002 23 CCC Pro 015 37 CACHis 027 32 CGC Arg 004 11 CUA Leu 010 106 CCA Pro 037 89 CAAGln 084 228 CGAArg 006 18 CUG Leu 003 27 CCG Pro 006 13 CAG Gln 016 42 CGG Arg 002 0 5 AU39U lle 054 531 ACU Thr 039 174 AAU A511 080 600 AGU Ser 022 16 5 AUC lle 007 72 ACC Thr 012 56 AAC A511 020 151 AGC Ser 014 10 4 AUA lle 039 380 ACA Thr 044 199 AAA Lys 080 878 AGA Arg 065 20 1 AUG Met 100 181 ACG Thr 005 22 AAGLys 020 222 AGG Arg 018 5 5 GU39U Val 055 279 GCU Ala 044 212 GAU Asp 079 420 GGU Gly 028 13 7 GUC Val 005 24 GCC Ala 011 51 GACAsp 021 113 GGC Gly 016 7 7 GUA Val 030 151 GCA Ala 039 189 GAAGlu 075 539 GGAGly 041 20 0 GUG Val 011 54 GCG Ala 006 27 GAGGlu 025 178 GGG Gly 015 7 4 Coding GC 2927 1st letter GC 3852 2nd letter GC 2830 3rd letter GC 2101 Mycobacterium tuberculosis CDC1551 4187 CDS39s 1329826 codons elds triplet amino acid fraction frequency per thousand UUU Phe 021 62 UCU Ser 004 23 UAU Tyr 030 61 UGU Cys 026 24 UUC Phe 079 229 UCC Ser 021 116 UAC Tyr 070 145 UGC Cys 074 6 9 UUA Leu 002 17 UCA Ser 007 38 UAA 015 05 USA 055 1 7 UUG Leu 019 181 UCG Ser 035 195 UAG 030 10 UGG Trp 100 14 8 CU39ULeu 006 56 CCU Pro 006 36 CAUHis 029 66 CGU Arg 012 87 CUC Leu 018 172 CCC Pro 029 170 CAC H18 071 160 CGC Arg 038 287 CUALeL1005 48 CCA Pro 011 64 CAAGln 026 82 CGAArg 010 76 CUG Leu 051 497 CCG Pro 054 317 CAG Gln 074 229 CGG Arg 033 249 AU39U lle 015 65 ACU Thr 007 38 AAU A511 021 52 AGU Ser 007 3 7 AUC lle 079 334 ACC Thr 059 346 AAC A511 079 194 AGC Ser 026 14 6 AUA lle 005 23 ACA Thr 008 48 AAALyS 026 54 AGAArg 002 1 4 AUG Met 100 186 ACG Thr 027 157 AAG Lys 074 151 AGG Arg 005 3 4 GU39U Val 010 82 GCU Ala 0 08 112 GAU Asp 028 159 GGU Gly 019 186 GUC Val 038 324 GCC Ala 045 590 GACAsp 072 419 GGC Gly 051 493 GUA Val 006 49 GCA Ala 0 10 130 GAAGlu 035 162 GGAGly 010 100 GUG Val 047 401 GCG Ala 0 37 484 GAG Glu 065 304 GGG Gly 020 189 Coding GC 6577 1st letter GC 6782 2nd letter GC 5022 3rd letter GC 7927 Tables 1 and 2 Codon usage in Barrelia 39 a 39 f 39and L 39 L 39 39 derived from sequence analysis of each genome The number of coding sequences CDS s and codons from which these frequencies were derived are indicated above each table fractionproportion of occurrences of a particular amino acid encoded by a particular codon For each amino acid the fractions associated with each codon sum to 1 frequency per thousandthe number of times each codon was used per genome 1000 stop codon Sequence contrasts H 3 Introduction to BioLingua BioLingua Syntax A General consideration of syntax B Symbols and symbol boundaries C Form boundaries D Lisp BioLingua and BioLingua Lite E Loops A General consideration of syntax You might think that people who seem to know a computer language posses either some special knowledge or endowed with some magical ability to sense what s right No suppose you didn t know English Then the rst sentence You might think might look to you like this va njhiu uijol uz39 bu q 7quxip 0 up 10px b dpnqvufs mbohvbhfqptt mfs tpnftqfdjbm lopxmfehfps foepre xjui tpnfnbhjdbm bcjmjuz up tforf xi bu 39t sjhiu What would you do then To have any hope of understanding this sentence you d have to know some metasyntax by which I mean the overall structure of sentences You might know that many English sentences take the form SUBJECT 7 VERB So you look up the rst word in your dictionary Can va be a noun Yes So you can plug that in and begin looking for the verb After some scrounging around you nd in your dictionary that uijol is indeed a verb and that njhiu is a verb modi er so you have but that verb has generated a new box because uijol says your dictionary is a type of verb that takes an object Can you really t the rest of the sentence into the object box You examine the next word uibu The dictionary says that it can introduce a sentence and the whole thing can function as an object So now you have a v And you go through another round to gure out the inside of the sentence va That s just understanding English To create new sentences from scratch armed with only a dictionary would be nearly impossible English syntax is just too complex Notes Introduction to BioLingua l Fortunately BioLingua syntax is not at all complex There are only two basic metasyntax forms Examples 4 7 gt 4 7 gene gt value of gene quothelloquot quothelloquot Form 1 atom and Examples BIOLI TE VERSI ON GENESOF A7120 DEFINE x AS 47 Form 2 function Just as some English verbs can generate new boxes so can some BioLingua functions You can learn the syntactical requirements of a BioLingua function by typing HELP function name For example enter HELP GENES OE You learn the GENES OE is synonymous with GENE OE1 OK enter HELP GENE OE From the documentation that HELP provides you may discern the following syntax of GENES OE l entity l WITHIN coordl coordZ Functionname Argument L Clauses The function GENES OE generates a box that must be filled with an entity and optional clauses only the WI THIN clause is shown here The three types ofboxes are Function name which must be filled with a single word that is the name of a legal function Argument which must be filled with a legal atom or function Clause which must be filled with a legal keyword legality determined by the specific function and the object of the keyword which must be a legal atom or function The function may impose further restrictions GENES OE demands that its argument be if an atom or produce if a function the name of a gene protein contig replicon or organism or a list of such names The following fit into this syntactical scheme GENESOF A712 0 GENESOF prA W1 THIN 10000 20000 GENESOF ORGANISMOF all4312 The following do not fit into the scheme GENESOF A7 12 0 Functions are bounded by GENES OE is interpreted as an atom GENES OF A7120 F unction names are single words GENES is interpreted as the function name 1 I hope that in the not too distant future HELP will work directly on all functions without requiring a second trip to a synonym Notes Introduction to BioLingua 2 GENESOF organism A7120 3 symbol must be a keyword and must be followed by the object of the keyword GENESOF A7120 SMALLAERTHAN TOO SMALLER THAN is not a keyword recognized by GENES OF SQl For each statement below nd the syntax pattern of the function identify each part of the statement as the function name or required argument or clause If the statement doesn39t fit the syntax pattern modify it so that it does la LEFT SEQUENCEOF all4312 10 lb DEFINE quotmolequot AS 602 10A23 lc SEQUENCEOF all4312 FROM 1 TO 20 1d COUNT OF GENES OF A7120 Finding the syntactical pattern of a function is like nding a very complete entry of a word in a dictionary But how do you nd the word in the dictionary if you re not sure what the word is Not easy with English but not so bad with BioLingua You have the following strategies 1 HELP word This gives you documentation of the function named word If there is no such function then you get a list of all the functions BioLingua knows about that relates to word N HELP quotwordquot This gives you a list of all functions BioLingua knows about with the word in the name or documentation E BioLingua Help Description of Functions see Resources amp Links Functions are organized by subject You can scan their brief descriptions to find one you like Clicking on the function brings you sometimes to documentation and examples If such is lacking at least you know what function name to use with HELP 5 Find a model Think back on programs you39ve seen in the notes or elsewhere One of them might do something like what you want Get in the habit of figuring out other people s programs Then steal shamelessly Ask someone who knows Hit the panic button Who We Are on course web page reaching Jen and me Ask anyone who happens to be on BioLingua MESSAGE TO ALL quotmessagequot MESSAGE TO user name quotmessagequot MESSAGE TO user name user name quotmessagequot U1 B Symbols and symbol boundaries You make sense out of English sentences only because you can tell where each word begins and ends This may seem like a minor trick but you re more complicated than a mere space recognizer Apart from spaces you also 39 some 1 quot like 1 quot commas and periods but not others like hyphens and apostrophes as boundaries of words and sometimes judgment and experience is needed Notes Introduction to BioLingua 3 BioLingua can t use judgment and experience and so must rely on welldefined rules instead And here they are Spaces separate symbols The number of spaces between symbols is not important except within a quoted string so long as the number is at least one Thus the following are equivalent GENESOF A7 120 GENESOF A7120 GENES OF A7 12 O Parentheses separate symbols However they also carry special meaning delimiting functions or lists They cannot be optionally thrown in for readability as they can in mathematics So the following are equivalent GENESOF ORGANISMOF all4 312 GENESOF ORGANISMOF all4312 but not GENESOF ORGANISMOF all4312 Double quotes delimit a string that is used literally not as a symbol It can contain any symbol even internal double quotes through a trick So the following are different A7 12 O A symbol containing the name of the organism Anabaena PCC 7120 quotA7120quot The letter quotAquot followed by the digits quot7quot quotlquot quot2quot and quot0quot Some other characters quot also delimit symbols but they have special meanings and so you shouldn t use them as delimiters unless you know what you re doing Comma is a very special character so much so that you will never see it in BioLingua statements unless you happen to wander into macros Therefore elements of a list are separated by spaces not commas All other characters may be used within symbols and those symbols may be defined however you like The following are all legal symbols Th1 si s a symbol Yes it is including the question mark 11 Just a symbol It39s not necessarily equal to 2 3951 s l Perfectly OK ifnotpronouncible in polite company SQ2 For each statement below predict the outcome then try it out in BioLingua Fix statements that need fixing Figure out why things happen as they do la DEFINE 11 As 3 1b DEEINE quotdozenquot AS 12 lc 3 1 1d 3 1 le 1 3 1f DEFINE 5 5 3 lg 3 quot2quot Notes Introduction to BioLingua 4 BioLingua makes no distinction between upper case and lower case unless the characters lie between double quotes The following are therefore equivalent DEFINE my proteins AS PROTEINS SIMILAR TO p all4312 IN 86803 define MYPROTEINS as proteinssimilarto PaLL4312 in 56803 I have chosen to render function and keyword names in capital letters and variables in lower case to improve readability but that just my choice Even though BioLingua doesn39t care about case the system in which it s running Linux does care and it is the system that worries about files and file names Therefore when you re saving or loading a file you need to pay attention to upperlower case Since only strings eg characters between double quotes retain case distinction in BioLingua you must refer to filenames as strings For example LOADSHAREDFILE quotMyFavoriteProteins quot will work only if the file has this exact name capitals and all C Form boundaries The BioLingua Web Listener executes one form at a time That form can be a single symbol or five pages of code Either way you get one and only one form executed That39s why entering the following code does not give an error even though you might expect it to 1 1 2 The Listener encounters the first form the atom l returns its value and ignores the rest It39s obvious what is the extent of a form that happens to be an atom 7 it s just one symbol long The case is often not so clear if the form is instead a function A function extends from its opening parenthesis to the matching closing parenthesis In the case of complex functions like loops that closing parenthesis may be quite far away and difficult to recognize Fortunately there are tools to aid your eye If you put code in the large program window and place the cursor after a closing parenthesis the web listener will tell you in the information box below where is the opening parenthesis There are also more sophisticated programming aids see Links amp Resources SQ3 What result does the code below return when executed Why FOREACH number FROM 1 TO 100 D0 DISPLAY number quot quot SUM number D The levels of BioLingua Lisp BioLingua BioLingua Lite Now the secret comes out the language you have been using is a dialect of an extension of the general purpose programming language called Lisp Lisp offers unusually powerful tools that enable users to build extensions of the language which is why we re using it Lisp and BioLinguaproper use a very strict and very simple some would say very beautiful syntax BioLinguaproper adds to Lisp a large number of functions useful to biologists BioLinguaLite syntax is fuzzier at the boundaries but makes fewer demands on humans The language you use is an amalgam of functions from all three sources It may therefore be helpful to understand the syntactical requirements of each Notes Introduction to BioLingua 5 Smtax of Lisp and BioLingga proper functions 39keyw ord form I 39keyword form form form Functionname Argument1 Argument2 Clause1 Clause2 Function name 0 Must be first symbol the name of a Lisp or BioLingua proper function 0 One function has one name Arguments o The function may require any number of arguments and allow any number of optional arguments 0 Each argument consists of a form atom or anction that evaluates to a defined value This means that gene names won t work The Lisp function below fails unless you have previously defined the gene names to contain a value EXTRACT SEQUENCE all4 3 12 9 Error unde ned variable but EXTRACTSEQUENCE A712O all4312 9 ATG 0 An argument usually must be of a certain type it must be a list or it cannot be a list it must be a string or it cannot be a string must be a gene or cannot be a gene etc MOLECULAR WEIGHT OF quotMARGGRC quot 9 molecular weight of sequence but MOLECULAR WEI GHT OF A7 12 O pall4 312 9 Error wrong type Clauses o The function may not require clauses but may allow any number of optional clauses o A clause begins with a keyword defined by the function preceded by a colon The keyword is followed by a form atom or function that evaluates to a defined value not a gene name Explicit lists 0 A list consists of items enclosed in parentheses To distinguish it from a function the opening parenthesis is preceded by a single quote SETF x 39 1 2 3 4 9 Assigns the list offour numbers to the variable x o The items must all evaluate to a defined value not a gene name Smtax of BioLingua Lite functions form I lkey word form I lkeywordform Functionnam Argument1 Clause1 Clause2 Function name 0 Must be first symbol the name of a BioLingua Lite function 0 Many functions have multiple synonymous names eg GENES OF and GENE OF Notes Introduction to BioLingua 6 Arguments o The function requires either zero or one argument and allow no optional arguments o The argument if it exists consists of a form atom or function that may or may not evaluate to a defined value Undefined symbols like gene names are interpreted as you probably want them to be For example SEQUENCEOF all4312 9 quotATG quot When possible the function converts an argument to the type needed by the function MW OF quotMARGGRC quot 9 molecular weight of sequence and MW OF p al l 4 3 12 9 molecular weight of named protein Clauses o The function may or may not require clauses and may allow any number of optional clauses A clause begins with a keyword defined by the function colon is optional The keyword is followed by a form atom or function that may or may not evaluate to a defined value A clause may accept any of a number of possible types and direct the appropriate action COUNT OF quotMquot IN p AII4312 9 How many M s in protein sequence andCOUNT OF quotATGquot IN quotATGACAGGGAquot 9 How many ATG s in sequence andCOUNTOF quotATGquot IN SEQUENCEOF GENESOF ss120 FROM I TO 3 9 How many ATG s in list of start codons Explicit lists 0 A list consists of items enclosed in parentheses A BioLingua Lite function distinguishes a list from a function by whether the first item is a functionname o The items in a list need not evaluate to a defined value gene names are OK It would be nice if a programming language were coherent that is the entire language behaved in the same way If it39s coherence you want then you must stick at our present stage of evolution to Lisp or BioLinguaproper If you want to avail yourself of the conveniences offered by BioLinguaLite then you must accept that it is an evolving and as yet incomplete language Still it might seem pretty important to be able to distinguish between say a Lisp function and a BioLite function What clues are there Actually you can go pretty far assuming that all the functions you use adhere to BioLite conventions Most of the time you ll be right and error messages will quickly amend your thinking if you are wrong If you d like to reduce the frequency of errors the documentation of any function will tell you the function s requirements and you ll probably visit the documentation before using the function for the first time anyway Furthermore the names of BioLite functions are often phrases ending in prepositions like SEQUENCE OF instead of BioLingua proper commands like EXTRACT SEQUENCE The coexistence of three languages is not as big a deal as you might think SQ4 Look back on a halfdozen functions you39ve used and identify which language they come from Notes Introduction to BioLingua 7