Adv Studies Comp Linguistics
Adv Studies Comp Linguistics CSE 875
Popular in Course
Popular in Computer Science and Engineering
This 5 page Class Notes was uploaded by Donnell Kertzmann on Saturday September 19, 2015. The Class Notes belongs to CSE 875 at Michigan State University taught by Staff in Fall. Since its upload, it has received 40 views. For similar materials see /class/207425/cse-875-michigan-state-university in Computer Science and Engineering at Michigan State University.
Reviews for Adv Studies Comp Linguistics
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 09/19/15
Statistics and Linguistic Applications Hale April 22 2008 Proportion test In Chapter 3 of his book Baayen reports word frequencies from CELEXi These are based on a corpus sample of 18580121 tokens word frequency relative frequency the 1093547 015885575 president 2469 0000132884 hare 153 00000082 harpsichord 15 0100000086 Even though we know better lets for the moment conceptualize the observation of a word like president as a SUCCESS and the observation of any other word as a FAILURE Our probability model is thus a Binomial with parameter p 0000133 We have seen eigi Vasishth p27 that when the corpus size n and the success probability p are not too close to zero the Binomial distribution is closely approximated1 by a Normal distribution with mean np and variance npq where the failure probability 4 1 7 p i We can now View other corpus samples as results from a kind of languageproduction experimenti From such a sample we can compute a statistic the sample proportion and look up how probable this statisticls value is under the assumed parameterizationi ls the 1million word Brown corpus with 382 attestations of president a wacky or runof themill sample Across many many corpora what fraction would attest president that many times if the parameter were really p 0000133 7 Let us compute the standardized score and make a judgment A standardized score looks like this 2 L 7 a where z is the value to be standardized a is the population standard deviation and M the population mean In this case I is our observed proportion p and we know a M in virtue of approximating the Binomial with the Normal The proportional standard deviation Up of the Binomial is pqni The kind of decreasing proportional variability as sample size goes up is suggested in the diagram labeled Aha on page 17 in section 24 of the Vasishth notes 13 Mpgn Multiplying by a form of 1 translates the proportionbased Z score into one based on an absolute number of successes De ne the successcount z in terms of the proportion of success such that p Z 10471 P xmn z 7 mp 104an zinp x x Mx m zinp x x zinp 1 x lm 1A derivation of the Binomial approximation to the Normal is given on the Mathworld Binomial page dgt TL X TL 1 The denominator now shows exactly the standard deviation of the Normal approximation to the Binomial In our Brown corpus example 382 7 0000133 X 1000 000 z gt success lt 0000133 gt failure lt 1 success gt n lt 1e06 gt 382 success gtIlt nsqrtn gtIlt success gtIlt failure 1 2159247 Wow a Zscore of 2159 That s twenty one standard deviations above the mean What s the probability of getting a sample that extreme or more Lets look it up in our tab e gt 1 pnorm2159 1 O This sample is highly unlikely under the null hypothesis that president appears 0000133 of the timer Baayen dryly remarks The resulting probability is indistinguishable from zero given machine precision and provides ample reason for surprisei He gets to this same conclusion via a different route calculating the Binomial probabilities directly with pbinomi In fact the Normal was originally introduced by de Moivre as a way of approximating the Binomial computers were expensive in 1733 Moreover the Normal approximation leads us far beyond mere successfailure proportions as we shall see The multinomial In the president case there were only two possible outcomes identi ed with SUCCESS and FAILURE to produce president respectively In a particular corpus sample I mg is often called the observed frequency of success as opposed to failure The expected frequency of success is y The more general Multinomial distribution describes k different categories of events A1 A2 1 i i Ak with probabilities 101102 r ipki But the notions of observed and expected are the same If we draw a sample of size n from a Multinomial population the observed frequencies for the events A1 1 i Ak can be described by random variables X1 1 i i k w ose speci c values 1112111 zk would be the observed frequencies in the sample The expected frequencies would just be npl 71ng i i TLka Event A1 A2 Ak Observed Frequency 11 12 1k Expected Frequency npl npg npk Table 1 Multinomial assigns probability to k kinds of events As an example of a Multinomial consider bags of MampMsi How many of each color blue brown green are there in a bag of 30 The count of one affects the others if 28 are red then none of the other colors can claim more than 2 of the candies for their own colori n n71 n71 ire71 PX1117 7Xk1kltzlgtlt 12 1 k k 1gtp1p2quot39Pk In the oneproportion Z test statistic equation 1 there are exactly two outcomesi Viewed as a special case of the Multinomial we might think of them as just the rst two of potentially many more outcomes The square of the Zscore What if we wanted to generalized beyond SUCCESS and FAILURE Consider the square of the Z score Z2 I i n2 lm To prepare notationally for a larger set of k event categories rename the success count X1 and the fail ure count Xgi In the twoevent case we have X1 X2 n analogous to 10 4 ll X17np X17np n7X27np n7X27n17q 1X2m1 and so the squares of these quantities should stand in the relationship X1 7 n2 71 X2 7 714 X1 7 n2 X2 7 71602 These equalities make it possible to rewrite the Z score as a mix of two addendsi Z2 z 7 mp X17 mp X2 7 n4 2 mpg n 7 The denominators in equation 2 are the expected frequency of success and expected frequency of failure respectively The numerators represent the discrepancy between observed and expected countsi Generalizing this to the Multinomial X1 7 np12 X2 7 np22 Xk 7 m 7 k Xj 7 nij 7 2 mm npz npk F1 7101 gives a test statistic X2 whose distribution the sum of squares of Normal deviates is known 3 X2 7 16 Xj 7 npj2 7 2 observed 7 expected2 11 npj expected The chisquare statistic has k 7 1 degrees of freedom if the expected frequencies can be computed without having to estimate the population parameters from statistics The degrees of freedom in this case is one less than the number of probabilities re ecting the constraint that they must add up to 1 If it is necessary to estimate m population parameters to specify the null hypothesis then the statistic follows a X2 distribution with k 7 1 7 m degrees of freedomi Example letter probabilities If you are a computer at Fort Meade in Maryland your job might well be to guess which language a sample of text is fromi A simple way to do this is to ask whether the letterdistribution is very different from what we would expect about language L1 L2 i i Verzani s book includes of cial frequencies for English as speci ed in the Scrabble board game Lets take a look at the overall letter distribution in this quote from this morning7s New York Times The Democratic presidential contest between Senator Hillary Rodham Clinton and Senator Barack Obama in Pennsylvania today will offer a new test of what exactly a win is There are many potential different outcomes and you can be sure the campaigns will be pointing to all kinds of things in trying to claim victory in this rst contest in six weeks 3 Adam Nagourney7 NYT April 22nd 2008 1004am gt quotevector lt unliststrsplittolowerquote quot0 gt letterdist lt sapplycletters functionx sumquotevector 0 gt barplot rbind letter dist scrabblefreq quot iiiiii i Dalillijlluuii o 60 40 acegikmoqsuwy And in particular7 the signature vowel distribution that Scrabble predictsi gt L 7 lt L rquot reqquot L 1 I L 1 I scrabblepiece I I scrabblepiece 0 I scrabblepiece U gt quotevowels lt sapplyc a e i o quot11 functionx sumquotevector 0 gt chisqtestx 1 7 p L 7 sum 7 15 Chisquared test for given probabilities data quotevowels Xsquared 66604 df 4 pvalue 01550 The null hypothesis cannot be rejected on the basis of a sample that would be expected fteen percent of the timer The Multinomial Scrabble model ts just ne The chisquare doesn7t care which distribution you think generated the data it is nonparameterici If you can work out the expected frequencies7 you can calculate the goodness of t and then see how far out you are on the X2 distribution In this way one may ascertain how well the entire sample not just its mean looks like it came from a particular distributioni Contingency Tables From the perspective of a Multinomial over 13 different event types everything is a l X k oneby k table If we extend into the second dimension we have an n X nC contingency table Frequently in linguistics we crossclassify attestations of a certain sound word etc in two or more ways 7 these are contingency tables Even though these data are arranged in a squareshaped table we can still ask whether the table as a whole has a large discrepancy as compared to some expected values To do this compute equation 3 over the nrnC cells in the table and compare the obtained X2 statistic to a chisquare distribution with particular degrees of freedom nr 7 lnC 7 1 if the expected frequencies can be computed without having to estimate population parameters from sample statistics if the expected frequencies can be computed only by estimating m population parameters from sample statistics nr 7 lnC 71 7m One fascinating hypothesis is that the column variables are probabilistically independent from the row variables Remember if two random variables are independent then their joint distribution is the product of their individual distributionsi H0 pij 10710 H1 the pij are not independent For example Cooper and Hale 2004 examined ah you know dis uencies in the Switchboard corpus a sample of spoken English Looking at pairs of conjoined constituents affectionately known as lobes they tabulated whether or not each one contains any dis uencyi Lobe 2 Dis uent Fluent Lobe l Dis uent 126 Expected 9912 1768 N2 2115ng of total 183 153 X 001 Fluent 145 400 plt Expected 19518 34912 of total 177 487 Table 2 Dis uency status of conjoined lobes obtained with tgrep2i A signi cant distribution To work out what we expect under the null hypothesis consider the marginalsi lgnoring Lobe for a moment there are 150126N observations where Lobel is dis uenti Call pldisf 0336 Of course avoiding dis uency is all there is to uency so plfluem lpldisfi Likewise we have pgdisf 145 150812 0359 Under H0 we should see Nplldisfpgdisf 99117 in the upperleft cell of the contingency table these expected values are preparenthesized for you But in fact all the squared deviations divided by the expected number add up over sixty It is highly improbably that Cooper and Hale would have observed this pattern if dis uency the rst constituent had no in uence on dis uency in the second Arranging for R to calculate your chisquared test of independence Lets borrow an example from DiGi Altman Practical Statistics for Medial Researchk As quoted in Dal gaard this data concerns caffeine consumption among women given birth The women are classi ed by marital status gt caffmarital lt matrixc652 1537 598 242 36 46 38 21 218 327 106 67 nrow byrow T gt colnamescaffmarital lt CquotOquot 1150 H151300 quotgt300quot gt rownamescaffmarital lt c Married Prevmarried Single gt caffmarital 0 1150 151300 gt300 Married 652 1537 598 242 3
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'