### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# STAT GENET I STAT 550

UW

GPA 3.66

### View Full Document

## 23

## 0

## Popular in Course

## Popular in Statistics

This 40 page Class Notes was uploaded by Providenci Mosciski Sr. on Wednesday September 9, 2015. The Class Notes belongs to STAT 550 at University of Washington taught by Elizabeth Thompson in Fall. Since its upload, it has received 23 views. For similar materials see /class/192499/stat-550-university-of-washington in Statistics at University of Washington.

## Popular in Statistics

## Reviews for STAT GENET I

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 09/09/15

CHAP 2 RELATED INDIVIDUALS ONE LOCUS 21 PEDIGREES AND RELATIONSHIPS 211 TERMINOLOGY Founders and nonfounders founders have no parents speci ed They are assumed unrelated Related individuals having a common ancestor implies a biological relationship Inbred individuals Whose parents are related implies the maternal and paternal genes can descend from single ancestral gene Unilateral onesided and bilateral twosided relationships unilateral halfsibs aunt niece cousins bilateral sibs double rst cousins etc Cousintype relationships Half full and double cousin relationships Cousins of different degree nth cousins k times removed Complex relationships quadruple half rst cousins quadruple second cousins 212 GENE IDENTITY BY DESCENT ibd RELATIVES ARE SIMILAR because they have ibd genes that are copies of the same gene in a common ancestor NOTE ibd is de ned relative to given pedigree or time point ibd genes are of the same allelic type nonibd genes are of independent types See the motherbaby pairs in 136 A pedigree or relationship determines probabilities of ibd Which determine probabilities of joint genotypes Which determine probabilities of joint phenotypes that is similarity among relatives 221 KINSHIP and INBREEDING The simplest pedigreede ned probabilities of gene ibd are the coef cients of kinship and inbreeding f Which measure ibd between two genes 71BC Prh0m0l0g0us genes segregating from B and C are ibd fB Prh0m0l0g0us genes m B are ibd MB7FB Where MB and F3 are the parents of B 222 EXAMPLES OF PATH COUNTING Half sibs 12 X 12 X 12 18 Two genes from an inbred parent 1X f12 X 1 121f Half sibs with inbred parent 1 f8 Full sibs 18 18 14 First cousins 14 X 12 X 12 116 Double rst cousins 116 116 18 General formula Wright 1922 1gtn1PAn2ltPA 1 w gig u 1205 EXAMPLE The JV pedigree Goddard et a1 1996 13 I32 36 251 284 282 Egg 288 285 338 332 331 3 VV 431 5 2 ancestors each with 3 paths each with m n9 3 and 2 ancestors each with 1 path each with m n9 2 2gtlt3gtlt72gtlt1gtlt5764 223 RECURSIVE METHOD 1 11030 lt ltM370gt ltFwgtgt provided B is not C nor an ancestor of C 1 1 MB imm 1 MBFB Boundary conditions 1 71AA i and 71AC 0 if A is a founder and not an ancestor of C Expanding up the JV pedigree among the grandparents we have 3 rstcousin pairs and a sib pair The kinship of rst cousins is 116 and of sibs is 14 so overall we have 1 1 1 7 1 a 224 INBREEDING and GENOTYPE FREQUENCIES MYTHS Recessive diseases are more frequent in genetic isolates This is because isolates are 7 more inbred TRUTH the more inbred individuals within any population have higher probability of homozygosity Consider a recessive allele 1 with freq q and an individual with inbreeding coef cient f WW 121 7 f qf q2 M17 q PrAa 2q17 q17 f PTO 7 17 q2 M17 q See population mixtures and Wahlund variance In population subdivision people marry those more similar hence more homozygosity in offspring In inbreeding people marry relatives and hence more similar and hence Inbreeding is a form of population subdivision Autozygous E having ibd genes inbred E having nonzero prob of being autozygous 7 f PrIBD affected m 7 m TRUTH2 the affected people in a population have higher probability of being inbred Suppose a proportion 04 of the population Popl has inbreeding coef cient f and others Popg are not inbred Pra ected aa 1 7 ozq2 04q2 f J1 7 q q2 afq17 q 1019 M17 1 Pr Pop affected ll q9afq17q 1q f 7 fq q If 7 afq which is always 2 a and gt 1 as q gt 0 TRUTH3 The affected inbred people in a population have higher probability of being autozygous afq Pr autoz affected ygl Wamiq Same form as before with f now becoming af 231 ibd OF MORE THAN TWO GENES Label 2k genes of k individuals sucessively7 giving each the label previously assigned to genes to which it is ibd and otherwise the next available integer 1 2 1 3 4 4 1 5 the paternal genes of individuals 124 are ibd and the two genes of individual 3 are ibd Reduce to genotypically equivalent classes of states 1213 4415212 314415212 3144 512 1213 44 512 12 23 44 25212 32 44 252 12 32 44 522 12 23 44 52 Note that when the two genes of the rst individual are interchanged we must relabel the genes 1 lt gt 27 to obtain a legal state label The case of 4 genes of two individuals is shown in the Table there are 15 states and 9 state classes For 12 genes in 6 individuals there are more than 4 million states7 but only about 198000 state classes 232 Table for two individuals ibd pattern ibd label ibd group state description B1 B9 individuals genes pm pm autozygous shared 0 o o o 1111 1111 B1Bg 4genesibd o 0 1112 1112 B1 3genesibd o o 1121 0 o 1211 1211 B2 3genesibd O 00 1222 o 00 1122 1122 B1Bg none to of 1123 1123 B1 none 0 1233 1233 B9 none 0 0 1212 1212 none 2genes oo 00 1221 shared 0 of 1213 1213 none 1gene 0 To 1231 shared 0 01 1223 0 To 1232 0 Jyvk 1234 1234 none none 233 RELATIONSHIPS BETWEEN TWO NON INBRED RELATIVES For two noninbred relatives 7 states 3 classes 2 probs n Prz39 genes ibd K32 151 no 1 Pairwise relationship no K1 K2 11 Unrelated 100 0 0 0 Parentoffspring 0 100 0 025 Monozygous twin 0 0 100 050 Full Sib 025 050 025 025 Half sib grandparent aunt 050 050 000 0125 First cousin 075 025 0 00625 Double rst cousin 05625 0375 00625 0125 Quadruple half rst cousin 05312 04375 00312 0125 Relationships may be represented as points in an equilateral triangle mpossible region The following equations relate 11 and m 239 07172 1 l 1 11 N2ZN1 11I 97n0 14 M17M2 M17F2 ltF17M2 F17F2 l M17M2lqJlF17F271M17F211F17M2 5131732 471B17B2 25231732 1 51B17B2 52B17B2 Applying the ArithmeticGeometric means inequality to these same equations shows n 2 4mm for all real relationships See book7 R38 for details 234 EXAMPLE OF QUAD HALF FIRST COUSINS It is possible for all four of 11M17M2 11F17Fg 11M17Fg and 11F17Mg to be nonzero Without the children being inbred That is7 each of the mother and the father of each child is related to both the mother and the father of the other But7 for each child the mother is not related to the father For QHFC7 71M17M2 71F17F2 71M17F2 F17M218 so K22 132 11 187 M 471 7 2m 7167 5017527511732 24 DATA ON RELATIVES 241 SPECIFYING INHERITANCE Segregation of genes is fully speci ed by meiosis indicators Si 0 if gene is parents maternal gene 1 if gene is parents paternal gene Where i 17 quotWm indexes the meioses Si are iid with ibd at a locus is a function of the at that locus 242 The general formula EXAMPLE DATA ON 1 INDIVIDUAL Suppose we observe someone who is A1141 my 2 my S PrS J LN PrI f7 PrN 17 f E my JS PrS PMAIAI PrA11hlIf PrA1A1lN1if PTY JPTJ qfq1f qfqqf PrY l JS is the sum over all possible assignments A of allelic types to genes of the product of allele frequencies 11100 of assigned alleles ak PrlY l Jls 2 Ilqu A k EXAMPLE DATA ON TWO INDIVIDUALS We know the relationship between two individuals so can we suppose compute the probabilities A177Ag of the 9 IBD Classes groups of states Suppose we observe the individuals to be AA and AC PJ J PAAAC J A1 1 1 1 1 0 A2 1 1 1 2 qu0 Ag 1 2 1 1 0 A4 1 1 2 2 0 A5 1 1 2 3 QAQQAqO A6 1 2 3 3 0 A7 1 2 1 2 0 A8 1 2 1 3 quAQO A9 1 2 3 4 qr2quo Total probability of observing AA7AC is 9 PAAAC Z AkPAAAC l J 2 k1 AQQAQO A q qo Asq qoA92qiqo 243 DATA ON A NONINBRED PAIR PrGI7 GQ ZJ PrY l J PrJR NoltRgtPTltGhGglJOgt NlltRgtPTltGhGQlJ1gt N2ltRgtPrlt017GQlJQ N0RPrGl7GglUnrel r 1RPrGI7 GglPar 7 o sp I QRPr017 GglMZ 7 twins 244 Example showing the general formula Consider the possible allelic types of these genes given Consider the following segregation pattern of genes the genOtypeS Of 5 mdlwduals Shown CHAPTER 4 MULTILOCUS LINKAGE 41 MEIOSIS AND INHERITANCE AT LINKED LOCI E 411 The process of meiosis a The outcomes of the process of meiosis7 shown for a single pair of homologous chromosomes in the nucleus of a cell of a diploid organism The processes of mitosis and meiosis 412 Genetic distance and Mather s formula Each chiasma involves 2 of the 4 ie the potential gametes In these gametes a chiasma results in a crossover The genetic distance d in Morgans is the expected number of crossovers between the loci on a given gamete Hence the expected number of chiasmata between the loci on the tetrad is 2d Usually we measure genetic distance in centi morgans IOOcM 1 Morgan Genetic distance is ALWAYS additive since expectations are additive Genetic distance has little to do with physical distance but ch m 106bp Mather s formula If we assume no chromatid interference each chiasma results in a crossover in a given gamete independently with probability In a given chromosome interval of genetic length d suppose there are Nd chiasmata Nd can have any probability dsn If Nd 0 there are no chiasmata no crossovers and hence no recombination If Nd n gt 0 the probability of an odd number of crossovers is 12 See homework Thus we have Mather s formula 1 1 man PrltNlta0 gt0 lt17PrltNltdgt0gtgtr The only assumption here is the absence of chromatid interference under this assumption pd is an increasing function of d and is bounded above by 413 Map functions Haldane s map function pd as a function of d is the map function In Haldane s model crossovers are assumed to occur as a Poisson process rate 1 per Morgan Thus there is no interference The number of crossovers Cd is Poisson with mean d the numbers of crossovers in disjoint intervals are independent and conditionally on the number occurring their locations are uniformly and independently distributed Under Haldane s model pd is the probability that a Poisson random variable with mean d is odd dk 1 00 dk idy 7 7d 7 7d 7 pld Z 6 k 26 k k l k odd k0 l 17 exp72d Note that under this model pd is an increasing function of d pd a as d a 00 and pd d as d a 0 Note also that under Haldane s model number of chiasmata Nd is Poisson mean 2d Then PrNd 0 exp72d Mather s formula applies 414 Interference and other map functions In fact interference exists mainly in crossovers inhibiting the nearby presence of others and the requirement for reliable meiosis that there is at least one chiasma on every chromosome pair There are lots of models 7 the model determines the map function The reverse is not true Historically people would use their favorite map function to transform p to d But if d is small p m d The important point is how multilocus computations are done not What map function is used almost all multilocus computations assume no interference 42 Multilocus recombination probabilities 421 Meiosis indicators at multiple loci For multiple loci j j 1 L SW 0 if gene at meiosis 2 locus j is parent39s maternal 1 if gene at meiosis 2 locus j is parent39s paternal Wede ne Sm S721m jlL S SWJ39 lL 2 lm Where m is the number of meioses in the pedigree and L the number of loci along the chromosome Dependence of the SW S are independent over 2 2 1 m SW are independent for loci on different chromosome pairs S are dependent among loci j on the same chromosome pair 422 Conditional independence no interference Assume that L loci are ordered 1 7L along the chromosome Let the intervals between successive loci be 17IL1 Let T 1 if a gamete is recombinant on interval 7 and T7 0 otherwise j177L 71 Then7 in a given meiosis 239 T7 1 if SW y Swirl and T7 0 if SW Si717j1L71 A model for Si is equivalent to a model for T17 7TL71 The simplest models for meiosis assume no interference ie that the T7 are independent Then the SW are rstorder Markov over loci j and with meioses 239 being independent One way to express this is that PICS l Siwl77Si 71 PICS l Siwjil so that PrSi PrSiw1 lw 2PTSM l Swill x H or combining the meioses PrS PrS1 PNSm l Swill 2 x see also 424 Another way of expressing this Markov dependence is through the probability of any given indicator SW conditional on all the others7 590 SIM kl y 233 depends only on the indicators for the same meiosis and the two neighboring loci For 3 017 Prlsm 3 l Sign Prlsm 3 l Sawhsiwll X pljsiswill 7 p77117l573w1l pljs sw lu p71 l5 3zy1l where p PrT7 l PrSW y Swirl is the recombination frequency in 7 Note that the equation just counts the recombinationnon recombination events in intervals 71 and I implied by the three indicators S104 SW 37SW1 Recall in Chapter 2 we discussed for a single locus the equations PrY PrY l S PrS Z PrY l JS PrS s PrY l J PrJ M AM There are few ibd patterns J than values of S However although the component SW are Markov over loci j gene ibd patterns are not Different values of S may give rise to the same ibd pattern Grouping the states of a Markov chain does not in general produce a Markov chain So to use the Markov dependence we have to use S 423 The hidden Markov structure 31 S71Sy SJ Y1 Y yii K 1 W The conditional independence structure of data in the absence of genetic interference Note Y th j1L Ya and SVA are mutually independent given SM 424 Baum algorithm for total probability We can go forwards Let YO YMIV WYOJ the data For data Observations Y y j 1 L we want along the Chromosome up to and including locus j Note 7 Y YltLgt Now de ne the oint robabilit to compute PrY Due to the rstorder Markov J p y dependence of the SM we have WY gPMSZY gPMY l S MS with R fsPrS 1s Then L L mw H may SW H Pram 3 RM glmsw 3 l S 3 gt WY l Sm 3 gt R s gtl S 71 2 3 Rye Prl k7k177j717 safe MW 3 forjl7277Lil7 with PrY Z PrYML l SML 3 R squot 5 x 425 LanderGreen algorithm We can compute PrY l Shy for simple traitsi recall the example at end of Chapter 2 Then the computation method of 424 can be applied However this exact computation is limited to small pedigrees If there are m meioses on the pedigree then S can take 2 values Computations involve for each locus transitions from the 2 values of S to the 2 values of SHHI Computation is order L227 For Genehunter for a pedigree with n individuals f of whom are founders m 2n 7 3f and m S 16 Additionally for each locus and for each value of SM we must compute PrYJ l JSM7 Although this is easy for given SH this limits size of pedigree Actually better algorithms using independence of meioses give us a factored HMM which means we can get an algorithm of order mLQm but is is still exponential in pedigree size The mapspeci c lod score is log10LdLoo where d is the hypothesized chromosomal location of the trait locus measured in genetic distance and d 00 corresponds to p or absence of linkage For Genehunter distances are relative to rst marker at d 0 The location score is de ned as Zlog5LdLoo Under appropriate conditions this statistic has approximately a chisquared distribution in the absence of linkage We consider lod scores for location rather than location scores Genehunter Allegro MERLIN 426 EM algorithm for estimating genetic maps Consider the completedata loglikelihood L logPrSY logPrSM1 ZQlogPrS l SW74 7 L leoglprOoa l Sui 7 Now recombination parameters enter through lOgPrSn7lSu7 1 mel lOgU mJiI Mm Rm77110g1 mel Rf471 10Pfrl Mf R1274 101 WWI where Rm A Z malelSW 7 Sikll is the number of recombinations in interval 71 in male meioses and Mm is the total number of male meioses scored in the pedigree Similarly for female meioses The expected completedata loglikelihood requires only computation of me ERm71 l Y 2 13051 Swal l Y and similarly RNA Since this is a simple binomial loglikelihood the M step sets the new estimate of pmwl to EMAMm and similarly for all intervals j 23L and for both the male and female meioses The EM algorithm is thus readily implemented to provide estimates of recombination frequencies for all intervals and for both sexes Note that PrSM1SM7 l Y PrSM71SM7YPrY and PrSM1S Y my 301le 3H PMS l S39 7 1PTY 7 l Sui PY7177YL l 517 The rst term is just the R1SM71 we had in the Baum algorithm the second and fourth are just singlelocus probabilities of data given inheritance the third is just the recombinationnonrecombination transitions and the nal one can be computed by backwards version of the Baum algorithm 43 BIGGER PEDIGREES 431 ElstonStewart algorithm Similar ideas underlie pedigree peeling We don t have Markov chain but for pedigrees with no loops we do have that conditionally of genotype of each individual data above is independent of data below Using functions Rg Prdata belole g and R g Prdata aboveG y we can form similar equations to 424 and compute the likelihood BUT these are multilocus genotypes so computation is now linear in pedigree size but exponential in number of loci Methods can be extended to pedigrees with loops but this is even more computationally intensive 21 22 12 13 Pedigree without loops Shaded individuals are those for whom phenotypic data are assumed to be available LIPED FASTLINK VITESSE 432 Monte Carlo given trait data 7 SIMLINK In early 1980s often we had trait data trait data had simple known models eg dominant recessive Marker maps were just starting marker typing was expensive How to persuade NIH to fund it We can simulate trait and marker data jointly quite easily assuming some trait model But it wd be much better to simulate what marker data would look like conditional on the trait data Ploughman and Boehnke solved this problem Peel up at the trait locus saving the partial sums Assign founder types at the top in accordance with trait model and upward peeling Simulate back down at the trait locus using the saved partial sums Simulate at marker loci linked to the trait at some assumed marker allele freq and recombination fraction Do the lod score for each simulated data set Compute an empirical Elod If this Elod is big enough then this indicates that the trait data are suf cient for marker typing to be worth doing For some time NIH required SIMLINK evidence in proposals 433 Monte Carlo Baum and Monte Carlo EM As well as giving the marginal distributions 273 the Baum algorithm also provides a a realization from the joint distribution PrS l Y The forward computation is exactly as before The backward computation is replaced by sampling First SML is sampled from Then given a realization of SN 3SJ1SML a straightforward application of Bayes Theorem gives PrSM71 S l S 3 S71 SJHY Prs1 s l S SHAW olt PMS s l SW sgtQLlltsgt where proportionality is with respect to s Normalizing these probabilities we can realize SM771 This is done for each j LL 7 1432 in turn providing an overall realization S SM1 WSML from PrS l Y 434 Monte Carlo EM An alternative is MonteCarlo EM Instead of computing the bivariate distributions of SM71SM N realizations of S Smm 1 N are obtained from the conditional distribution of PrS I Y under the current parameter values These are scored exactly as above R534 z 35 7 still 139 male A Monte Carlo estimate of EMA is ZileLAN and the new estimate of pm ll is EMAMm as before again with analogous formulae for all intervals and both sexes This Monte Carlo EM is readily implemented and like many Monte Carlo EM procedures performs as well as the deterministic version Initially the Monte Carlo sample size N need not be large although for the nal EM steps it should be increased 435 Sampling ibd conditional on data ibd is a function of S By computing 273 PMS s I Y j1L We can compute probabilities of ibd given data Y Alternatively by sampling S from PrS I Y we can estimate gene IBD on pedigrees However this is only feasible on very small pedigrees Instead we can choose a random meiosis 239 and resample S given Y and given all the S1 k y 239 This is EASY 7 see the previous equations Now there are just two possible values at each locus SW 0 or 1 This de nes a Markov chain over the space of S values Subject to various conditions the equilibrium distribution is PrS I Y So we can just keep repeating the resampling process to get dependent realizations from PrS I Y 436 MCMC for lod scores on big pedigrees LangeSobel approach 7 lmJnarkers Let Z be trait data7 and Y the marker data7 and S the inheritance patterns at the markers Let 39y be hypothesized position of trait locus Then PrAZ7Y o PrAZ l Y gmz s H5 Y So we can sample S given Y and estimate FriZ l Y by averaging the resulting values of PVZ l S Hence we will have a Monte Carlo estimate of the lod score curve 111 GENETIC TERMINOLOGY Chromosomei long string of doublestrand DNA Cell nucleus 7 has 46 chromosomes 22 pairs of autosomes and 2 sex chromosomes XY Locusi position on a chromosome or DNA at that position or the piece of DNA coding for a trait Allelei type of the DNA at a particular locus on particular chromosome Genotypei unordered pair of alleles at a particular locus in a particular individual Homozygotep a genotype with two like alleles Heterozygote 7 a genotype with two unlike alleles Phenotypei observable characteristics of an individual 112 EXAMPLE ABO blood types The ABC locus is on chromosome 9 The main alleles at the locus are A B and O The 6 genotypes are AA AO BB B0 AB and 00 Homozygotes are AABBOO Heterozygotes are AOBO and AB The 4 phenotypes are blood types A B AB and O 0 allele is recessive to A and to B A and B are each dominant to O A and B are codorninant What is a gene 7 the chunk of DNA coding for a functional protein Not a locus Not an allele 113 MENDEL s LAWS 1866 1 At any given locus each individual has two genes one maternal and the other paternal Each individual segregates a randomly chosen one of its two genes to each offspring independently to each offspring independently of gene segregated by the spouse independently of gene segregated from parent 2 Independently for different loci Not true segregation of genes at loci on the same chromosome are dependent Mendel s rst law says all meioses are independent Meiosis is the biological process of offspring gamete formation The total number of meioses m is just the total count of parentoffspring transmissions occurring in the data For every individual with both parents speci ed there are two meioses the one from hisher father the paternal meiosis and the one from hisher mother the maternal meiosis This will be clearer when we start to talk more about pedigree data 121 GRAPHICAL REPRESENTATION OF PEDIGREES 0 Three graphical representations The parentoffspring links The sibship representation The marriagenode graph 0 Founders and nonfounders no halffounders 0 Gender male female and unknown square circle diamond o Shading or labelling of individuals 122 SPECIFICATION OF PEDIGREES 131 A SAMPLE OF GENES 0 Unique individual identi ers names Consider a single genetic locus7 with two codominant alleles A and B Suppose each independent gene has 0 Parentoffs rin trios default ind7 dad7 mom p g lt allelic type A with probability q We say q is the 39 SPeCi cation 0f founders Parent names 0 population allele frequency of allele A 39 Gender male female and unknown 1 27 0 or M7 For a random sample of n genes from the population F U the number of A alleles is T N Binn7q The MLE of q is Tn7 which is unbiased since nqn q The variance of the MLE is q1 llTl Which is the smallest possible variance for any unbiased estimator name dad mom sex Other data Since PrT t o qt1 7 q quot the loglikelihood is o Phenotypic covariate7 and marker data 0 Chronological partial ordering of pedigrees 101 0 0 1 102 0 0 2 4 t10gq nit og iq 201 101 102 2 204 101 102 1 206 101 102 1 fred 0 0 1 203 0 0 2 joe fred 201 1 jane 204 203 2 dave 204 203 1 hugh joe jane 1 etc So differentiating the loglikelihood BE 7 t n 7 t 7 n t 371 7 57171 7 q17q57q So the MLE of q is tn Also 32 qt 7 iii n7t 3q2 7 q2 17 q2 ag q39 T n n n E 7 l 3q2 q 17 q q17 q So the Fisher information is nq17 q and the large sample variance of the MLE is 11 In this example it is the variance for any sample size For large n MLE s are approx unbiased and have approx the smallest possible variance 132 A SAMPLE OF INDIVIDUALS Suppose we sample n individuals and that m have genotype AA 719 have genotype AB and 713 have genotype BB n1ngng Then we have 2n1ng genes of allelic type A in a sample of size 2n We can estimate q by 2n1ng2n but properties of the estimator depend on the genotype frequencies Z log L n1 logPrAA n logPrAB 713 logPrBB TWO extreme cases i Complete positive dependence there are no AB individuals in the population n2 0 The two homologous genes in an individual are of the same allelic type The estimator is 71171 and in effect we have a sample of n genes ii HardyWeinberg equilibrium There is independence of the allelic types of the two homologous genes within an individual So PrAA qg PrAB 2q17 q and PrBB 17 q2 iii See next page for a model of intermediate dependence 133 POPULATION STRUCTURE Suppose populations 2 each in HWE with qt the freq of allele A in population 2 and a the proportion of population 2 So PrA qt 2 111 MAM PM17 Zaiqu 12 2041qu 17 Z 0 PrA7Al 7 2PrA7PrAl 042121 1771 2 211427 7 222 7 lt12 Thus population subdivision results in homozygote excess relative to HWE This excess is known as the Wahlund variance In total we therefore have heterozygote de ciency but NOT necessarily for each heterozygote For two alleles let q 1 gig l 7 q q q If a Z02q 7 12 then the three genotype freqs are q2 7 2q17 q 7 20 and l 7 q2 7 134 ESTIMATION in the case of HWE Z n1 logq2 729 log2ql 7 q nglogl 7 1 2721 nglogq 722 2723log1 7 q The MLE of q is 2n1ng2n If T 2221 729 T N B2n2nq varT2n ql 7 q2n 7 back to binomial sampling Note One generation of random mating establishes HWE since by de nition the two genes in an individual are copies of independently sampled parental genes 135 CASE OF A RECESSIVE ALLELE A t m of type AA7 and n 7 t not of type AA Assuming HWE PrAA 127 so 4 New nit og wg Differentiating BE 7 2t n 7 t2q 3q 7 q 1 7 q2 t 7 nq2 q1 7 q gt So the MLE of q is Mtn Why should this be expected Variance and information Now T N Binn7q2 but how can we nd the variance of this MLE Big 7 7amp7 2n7t 7 n704q2 aqg 7 q2 1 i lt1 1 lt19 3 4q2n 4n 7 7 liq The variance of the MLE of q is approx 1 7 q24n Note this is larger than q1 7 q2n We have to make assumptions Variance of the estimator is larger We can measure the information lost 136 ESTIMATING FROM DATA ON RELATIVES For simplicity we consider just motherbaby pairs and assume HWE See next page for tables of conditional and joint probabilities Z 2 my log PT9239797 M n00 logq3 n01logq21 1 7110 lOgq2l 7 q n11 logq1 1 7112 logq1 7 1 7191 logq1 7 1 ngglog17 13 37100 2no1 mo 7111 7112 21 lqu 37122 27121 7112 7111 7110 n0110g1 I mAlqu m310g1 I The MLE ofq is mAmAmB Where mAmB 37177111 and 771 37100 20101 mg 7111 mg n91 Prchildlparent parent probab child genotype genotype ility AA AB BB AA q2 q 1 7 q 0 AB 2q17 q q t 1 q BB 1w2 0 q liq Prparent7 child Data count parent child genotype child geno geno AA AB BB AA AB BB AA 1 1 1quot1 0 7100 7101 0 AB 12 q 10 1 11 1 2 7110 7111 7112 BB 0 11 12 1 13 0 7121 7122 137 ALTERNATIVES TO THE MLE The MLE is best but there are simpler estimators that are not so bad a Use only founders the moms estimate q by QnAAnAB2n where nAA is number of AA moms and HA is number of AB moms nAA n00n01 b Use everyone disregarding relationship estimate q by QmAAmAB4n where mm is total number of AA individuals and mAB is total number of AB individuals mAA 27100 7101 n10 These are both unbiased estimators but asymptotically the MLE has smaller variance 141 TESTING HardyWeinberg PROPORTIONS Consider the following three samples each of 100 individuals Each has 120 A alleles so the MLE of q is 06 but different genotypic counts no in genotype class c n AA AB BB 2 q 2 20412 100 36 48 16 10133 06 10133 100 30 60 10 8979 06 9301 65 100 45 30 25 10671 06 11381 142 With probability pa for class c Z const Enclog c with 2pc 1 With no constraints MLE of pc is TLCTL and maximized value of the loglikelihood is Z7 Enclogmcn Enclogmc 7 nlogn Assuming HWE 09200 7 q 1 7 0 n010g cl pc I 036048016 Now if HWE is true QIOgA 2272 is approximately xf and larger otherwise In our three examples the values are 0 65 and 142 What do we conclude 142 TESTING THE ABO BLOOD GROUP MODEL factor freq phenotype frequencies A B A B AB 0 Data 0422 0206 0078 0294 H1 theory p q pl 7 q 1713M pq 171007 1 H1 tted 0500 0284 0358 0142 0142 0358 H9 theory p q p2 2pr q2 qu 21 r2 Hg tted 0295 0155 0411 0194 0091 0303 Bernstein reported ABO blood types on a sample of 502 individuals 422 type A7 206 type B7 78 type AB and 294 type 0 Did he drop 2 individuals For the general model 2 502422 log 42 206 log 206 076 log 076 294 log 294 762671 143 TESTING GOODNESS OF FIT H1 A and B are independently inherited factors Frequency of individuals having the factor A is 0500 and of B is 0284 Independence of the factors would give an AB frequency of 0500 X 0284 0142 much larger than the 0078 observed Under H1 the estimated frequencies are as shown in Table7 and the loglikelihood is 41 50242216g 358 206 16g 142 07810g 142 29410g358 764750 Twice the loglikelihood difference is 41587 and would be the value of a x random variable if H1 were true Clearly7 H1 is rejected Testing another alternative Hg Under Hg A and B are the two nonnull alleles of a single system Assuming HWE if the three alleles A B and 0 have frequencies p q and 7 pqr 1 then the frequencies of the four blood types are p2 2pr q2 qu 21 and r2 Bernstein pointed out that the sum of the A and 0 blood type frequencies is pr2 or one minus the square root of this frequency is l 7 p 7 r q Similarly one minus the square root of the sum of the B and 0 blood type frequencies is p and the square root of the 0 blood type frequency is r The sum of these three numbers should be one For his data 17 0422 029417 0206 0294 0294 099 which is close to one suggesting a good t Likelihood ratio test for Hg More formally we may perform a likelihood ratio test Finding the MLEs of the parameters p q and r is not simple in fact we shall see later that these MLEs are 13 02945 and q 01547 with the resulting tted frequencies given in the table The loglikelihood is g 502422 log 4114 206 log 1942 078 log 0911 294log 3033 7 62752 Twice the loglikelihood difference between this and the general alternative is now only 162 Again this is the value of a x random variable if Hg is true Hg is not rejected 151 GENE COUNTING CASE OF RECESSIVE TRAIT current current recessive dominant q 2q1 q phenotype phenotype new q t1 t2 t3 Qtl AA AB BB 05 0667 36 4267 2133 0573 0573 0729 36 4664 1736 0593 0593 0745 36 4766 1634 0598 0598 0749 36 4791 1609 0600 0600 0750 36 4800 1600 0600 The three genotypes are AA AB and BB with counts say ti 239123 Now n1 t1 but the counts of AB and BB are unobservable since B is dominant to A The counting algorithm If counts t2 and t3 were known then the number of A alleles is m1 2t1tg and the MLE of q would be 2t1tg2n Further 2q1 7 q 2q P AB AB BB rlt or gt 1ti M 13an n9 t2 tg 64 64 The EMalgorithm implements the sequence of iterates shown Starting from an arbitrary initial value q 05 the proportion 2ql q is computed and the 64 individuals of dominant phenotype divided into the expected numbers t9 and t3 that are AB and BB respectively Estep Then a new value ofq is estimated as 2t1tg2n Mstep 152 EM ALGORITHM FOR MULTINOMIAL DATA In latent variable problems suppose the actual data are Y and the ideal data that would make the problem easy are YX The completedata loglikelihood is 2 logPrltltY7Xgt we The actual loglikelihood to be maximized is z logPrltY y log PrltltYXgt m Estep expectation At the current estimate 9 compute ECDLL Hy99 E910gP9X7Y l Y Y Mstep maximization Maximize Hyo9o9 Wrt 9 to obtain a new estimate 9 Theoretical result 9 Z 9quot Thus the EM algorithm for nding MLEs alternates E steps and Msteps The likelihood is nondecreasing over the process Where the likelihood surface is unimodal convergence to the MLE is assured although it may be slow Where computable evaluate the log likelihood to assess convergence For multinomial data let no be actual datacounts and max completedata counts for idealized data So W Zcxmcxlogqm and nding the ECDLL just means nding p046 Emcxnc ncPr cco9 n0 l l l l ZcMcpr 153 The ABC loglikelihood 39 g Zobs countlegPOi 39 w 2all counts lOgPXi Data Y NA7NB7NAB7NO Completedata X nAA7nAO7 0 Do not confuse Z and Zquot W is just a tool that lets us maximize E 0 Compute EZ l Y 7 in the multinomial case this just involves imputing the hidden counts 7 but only because Zquot is a linear function of these counts 154 ESTIMATION OF ABO ALLELE FREQUENCIES For the MLE of ABO blood group allele frequencies the EMalgorithm is one of the easiest ways to nd the MLEs see table7 next page Estep partition the A phenotypes into expected counts of AA and A0 genotypes and similarly B into BB and B0 2pr 2r P AO t A 7 7 r l ype p2 2pr p 2r qu 2r P BO t B r l ype q2 qu q 2r Mstep Then 13 PrAA PrAO PrAB2 and j PrBB PrBO PrAB2 Note 13 does not change monotonely7 but 7 does I7 is the current value of Z not of W current values phenotype A phenotype B p q if m PrA 0422 PrB 0206 AA AO BB B0 03 03 073 073 0115 0307 0056 0150 0308 0170 077 086 0096 0326 0029 0177 0298 0156 079 087 0091 0331 0026 0180 0295 0155 079 088 0089 0333 0025 0181 phen AB phen 0 new values PrAB PrOO p q 2 0078 0294 68712 0078 0294 0308 0170 62900 0078 0294 0298 0156 62757 0078 0294 0295 0155 62753 0078 0294 0295 0155 62752 16 HAPLOTYPES AND ALLELIC ASSOCIATION 161 ESTIMATING PHASE 2 loci Consider diallelic loci eg SNPs Label alleles 0 and 1 At two loci there are 4 haplotypes 00 01 10 11 There are 10 phased twolocus genotypes eg10001100 1001 Observable unphased genotype is a pair of pairs There are 9 3 X 3 observable twolocus genotypes eg 0000 10 11 Only the doubleheterozygotes 1010 is ambiguous can be 1100 or 1001 For other genotype pairs we just count Suppose the current estimates of haplotype frequencies are 1007 1017q107q11 Suppose there are H doubleheterozygotes Then E1100 l H anqooqnqooqioq01 So EM is easily implemented For two or a very few loci it works well 162 ESTIMATING PHASE AND HAPLOTYPE FREQUENCIES Consider diallelic loci eg SNPs Label alleles 0 and 1 Then the genotypes are a set of pairs eg 00 10 11 10 10 and haplotypes a string such as 01010 Determining phase is determining which of the 4 possibilities 01111 and 00100 01110 and 00101 01101 and 00110 01100 and 00111 holds For convenience we may write an unphased genotype as 0111100100 and the phased version as 0111100100 For large samples andor small numbers of SNPs we may use EM algorithm to estimate haplotype frequencies and this also provides probabilities of phasings given the estimated frequencies However for large numbers of SNPs this does not work well many sample haplotype freqs are 0 and likelihood surface is highdimensional and multimodal 163 HAPLOTYPING TWO OTHER ALGORITHMS Clarke s algorithm note where individuals are homozygous haplotyping is trivial Also trivial if heterozygous at just 1 locus Use individuals heterozygous at at most 1 locus to identify haplotypes that must be present Assuming these see which other individuals can be explained by one of these haplotypes plus a new one 7 add these new ones to the collection and continue for as long as possible Problems May not be able to start May not be able to nish Final guess may depend on order one adds haplotypes to the pool Stephens algorithm PHASE use a model that summarizes similarities of haplotypes in a population 7 the idea is that haplotypes should look like each other 7 in chunks Use Monte Carlo to simulate alternative phasings under the model Produces 7 probable phasings with estimates probabilities Now also FastPHASE

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

#### "I signed up to be an Elite Notetaker with 2 of my sorority sisters this semester. We just posted our notes weekly and were each making over $600 per month. I LOVE StudySoup!"

#### "There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

#### "Their 'Elite Notetakers' are making over $1,200/month in sales by creating high quality content that helps their classmates in a time of need."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.