Adv Categorical Data Analysis
Adv Categorical Data Analysis STA 7853
Popular in Course
Popular in Statistics
This 18 page Class Notes was uploaded by Jacinto Carter Sr. on Thursday October 29, 2015. The Class Notes belongs to STA 7853 at University of Texas at San Antonio taught by Staff in Fall. Since its upload, it has received 13 views. For similar materials see /class/231436/sta-7853-university-of-texas-at-san-antonio in Statistics at University of Texas at San Antonio.
Reviews for Adv Categorical Data Analysis
Report this Material
What is Karma?
Karma is the currency of StudySoup.
Date Created: 10/29/15
Chapterl More on Exact Con dence Intervals for Discrete Distributions Lecture 3 Con dence Interval by Pivoting TheoremPivoting a discrete cdf Let T be a discrete statistic with the cdf FTt6 Let a1 a2 oz with 0 lt oz lt 1 be xed values Suppose for each value t of T 0Lt and 6Ut can be de ned as follows i If FTt6 is a decreasing function of 0 for each If de ne 0Lt and 6Ut by PT 3 15mm a1 PT 215mm a2 ii If FTt6 is an increasing function 0f6 for each If de ne 0Lt and 6Ut by PT 215mm a1 PT 3 new a2 Then the random interval 6Lt 0Ut is a l oz con dence interval for 6 An Example Poisson interval Estimator Let X1 X2 Xn be a random sample from a Poisson distribution with pa 1 rameter A and let T ZXi Then we know that T is a good Suf cient estimator of A and T N 130235011000 Suppose we observe T to and let a1 a2 g Then to obtain the Cl for A we need to solve the following 2 equations to einAnAk a Z k0 2 and 00 6 n 1 OZ kt0 2 This can be simpli ed by noting the relationship between the Poisson tail and the chi square distribution oz t 6 nk oz 2 QT 5 P S W PltX3ltto1gt gt 2M The solution of this equation is 1 2 AU XWMDWQ Similarly applying the other identity we have oz 00 6 12A 1 g4rLP2mWPWQltM The solution of this equation is 1 ALE aa a gt Hence a 1 oz con dence interval for A is 1 1 l tho 0127X 101012l A Numerical Example Suppose for a random sample of size n 10 from a Poisson distribution gives to Eml 6 Find a 90 Cl for A We have 2050 12 7 and 2050 1 14 a2 005 and 1 a2 095 Horn the chi square table X14005 2368 X1295 523 Hence AL 02615 and AU 1184 And a 90 CI is 0262 1184 Binomial Distribution Similarly we can construct a 1001 00 Cl for 7T of the Bernoulli distri bution If X1 X2 Xn be a random sample from a Bernoulli distribution 3 Chapterl Historical Perspectives amp Basics see Chapter 16 STA 7853 Analysis of Categorical data Spring 2007 Lecture 1 Methods of analyzing continuous data attracted enough attention of re searchers in the late 19th to early 20th centuries on such topics as Regression Analysis Analysis of Variance Hancis Galton R A Fisher G Udney Yule Although Karl Pearson and Yule wrote articles on measures of association between categorical variables there was little work done on categorical re sponse data Despite important contributions of R A Fisher Jerzy Neyman William Cochran Maurice Bartlett on methods of analyzing continuous data the analog of regression models for categorical data received little attention until late in the 20th century Most of the developments in categorical data research that we will study took place after 1960 Most of the developments in categorical data research was stimulated by the increasing need and sophistication in sociological and biomedical research In Sociological Research some common examples of categorical variables Attitudes Opinions on various issues such as abortion gender political be havioral etc In Biomedical Research some common examples of categorical variables Severity of injuries degree of recovery from surgery stages of diseases Regression type approaches were developed for analyzing multivariate dis crete data only much later Some leading researchers in Social Sciences are Leo Goodman Shelby Haberman Firederick Mosteller Stephen Feinberg Some leading researchers in Biomedical eld are Joseph Berkson Jerome Corn eld Gary Koch For an entertaining history of developments on CDA see Chapter 14 of the text Categorical Variable A categorical variable is one for which the mea surement scale consists of a set categories or classes Examples 1 Student classi cation Undergraduate Graduate 2 Income level Low Middle High 3 lnjury Low Moderate Severe 4 Gender Female Male Classi cation of variables Variables can be classi ed based on many considerations Some of these are given below 1 Response and Explanatory variables 0 Response variable A variable which is affected by some other vari able and Whose values can not be determined at will is called a re sponse variable It is also called a dependent variable in regression setting Explanatory variable An explanatory variable which explains are affects the values of the response variable Often we are able to select its values as we please depending on the experiment In re gression setting it is also called independent variable In this course the response variable is categorical whereas the ex planatory variables can be of any type 2 Scales of measurement Variables can also be identi ed based on the scale of measurement There are four measurement scales given below in the order of weakest to strongest Nominal scale Here the values of the variable are denoted by ar bitrary labels symbols or classes Some examples are Employment status Employed Unemployed U E or 01 Gender Female Male Opinion on an issue Support Oppose No opinion Ordinal scale This measurement scale is similar to the Nominal scale but the category or name can be associated with the extent to which a some underlying property is possessed The order of the names or symbols symbolizing the categories is meaningful Any set of distinct numbers of increasing magnitude can be used to rep resent the outcomes of an ordinal scale However the differences of the category values are not meaningful The methods developed for nominal data can be used on ordinal data but the methods de veloped for ordinal data can not be used for nominal data Some examples of ordinal data are lnjury Moderate Severe Recovery 4 from Surgery Slow Moderate Fast Income Level Low Medium high Interval Scale It can be thought of as an ordinal scale where out comes are labeled with numbers rather than with categories and where the numerical difference between any two numbers is a mea sure of the amount of difference between the underlying character istic The position of zero on the scale is arbitrary An example of a variable in interval scale is Temperature Zero degree temperature does not mean absence of temperature All the methods for Nominal and Ordinal scale work for this scale Ratio Scale Similar to the interval scale but the position of zero is unique and indicates the absence of the measured characteristic Ratios of two numbers also makes sense Most of the variables of quantitative type are measured in this scale Examples of ratio scale include Income in consecutive years employed Earnings per share Height weight distance area 3 Discrete and Continuous variables 0 Discrete variable A variable is discrete if the number of values it takes are nite or countable A discrete variable has gaps in its values Examples Total number of accidents in an industry per week Total number of defective items in sample of N items Continuous variable A continuous variable can take all possible values in an interval Examples Time it take to drive from home to UTSA Distance of homes from UTSA Heights and weights of people Note that many continuous variable are be discretized because of limitation of limitation to record values due to measurement scale limitations Categories of Multivariate Problems by the type ResponseExplanatory variables Response Variable Explanatory Variables Categorical Continuous Mixed Categorical Cross classi ed Cat Linear Logistic Linear logistic Data Problems Response Models Response Models Continuous ANOVA Models Regression Models ANCOVA Models Mixed Where 7 Lack of generally accepted classes of multivariate models and methods designed to deal with mixtures of discrete continuous variables Distributions for Categorical data Data analysis requires assumptions regarding the random mechanism gen erating the data For example in the regression models with continuous response the normal distribution is central For categorical data the follow ing three models are very important Bernoulli and the Binomial distributions Consider a sequence of n independent and identical trials Where each trial results in binary observations 0 or 1 often called failure and success Let Y1 Y2 Yn be the 11 responses of these trials and let for il2n Where Y 1 if the outcome results in a Success and Y 0 if the ith outcome results in a Failure 7 Each of the 11 trials is called a Bernoulli trial and each Y is said to have a Bernoulli distribution with the probability function PYz y W1 701w 01 Binomial Distribution Let Y 2 Y where Y represents the total number of successes in the n trials then Y has a binomial distribution with n trials and probability of success 7T Y N Binn7r The probability function pf for binomial distribution with parameters n and 7T is n Try17rniy7y0717u397n 3 The mean and the variance are EY mr VY mrl 7T When the sampling is without replacement from a nite population the trials are not independent and are not identical Then we use a distribution called hypergeometric distribution Binomial Approximation to Normal When n is large and 7T is such that mr and nl 7T are both greater than 5 the binomial distribution approaches to normal distribution with u mr and 02 mrl 7T Using the correction for continuity improves the approximation Hypergeometric Distribution Consider a population with N items of which D items result in success and N D items result in failure We select 11 items out of N Without replacement Let Y denote the number of successes out of n Then Y has a hypergeometric distribution with pf D N D y n y PYy Max0n ND S y S MmnD N n The mean and the variance of Hypergeometric distribution are u 11 and 02 711 QN n ltN4gt 2 Hypergeometric Approximation to Binomial As N gt 00 such that 7T the hypergeometric distribution approaches to Binn7r Multinomial Distribution Many experiments are such that each trial can result in to c outcomes with c gt 2 In this situation we have a distribution called the multinomial distribution Consider a sequence of n independent identical trials such that each trial can result into one of 0 possible mutually exclusive outcomes Let Yij 1 if the ith trial results in outcome j and Yij 0 otherwise Also let 13le l 7T with 2179 1 Then YZ YihYZg 7YZC represents a multinomial trial with 21YZj 1 Let nj 21 Yij be the number of trials out of n which result in outcome j with 21nj n Then the vector of counts 721122 7 125 has the multinomial distribution Multn 7T1 7T2 775 The joint pf of 721122 7 n5 is given by 121 Pn1n2nc 7T17T 27T mlngl ncl This is a c 1 dimensional distribution with my Varnj mTJl 79 and CovnZnj n7TZ7Tj The set up of the multinomial distribution is as follows Trials Categories 1 2 3 c 1 311 312 313 315 1 2 321 322 323 32c 1 n 3n1 3n2 3n3 ync 1 Totals n1 n2 723 726 n Note that marginally each nj N Binmj rj where nj 217221 and 7Tj 21 7W Poisson Distribution Let Y represent the total number of events occurring at the rate of u per unit of time Then Y has the distribution with mean u and its pf is given by My eyxllj7y0717quot397 PWw For this distribution EY u and VarY u Poisson Approximation to Normal As u gets large the Poisson distribution approaches to the normal distribu tion with mean and variance u A Relationship between Poisson and Multinomial Distribution Let Y1Y2 YC be c independent Poisson random variables with means 1 ug uc respectively Then the conditional distribution of Y1 Y2 YC given Ef Y n is multinomial Multn where 7T This result is useful in estimating parameters for the contingency tables Overdispersion In many practical applications data often exhibit more variability than pre dicted by the binomial or the Poisson distributions due to the following rea sons Binomial In the assumptions underlying the binomial distribution it is assumed that the probability of success remains constant for all the 11 trials In practice however this assumption gets violated Of 11 individuals their probabilities of supporting a certain proposal may vary among people In a family with 11 members the probability of infection due to cold may not be the same for each member due to varying levels of tolerance etc This situation can also arise with the multinomial experiments Poisson If Y represents the total number of accidents in an industry in any one hour period it is assumed that the accident rate for each worker is the same during that time However the accident rates differ from worker to worker due to inherent causes such as accident proneness etc These factors cause overdispersion in the data not accounted for by the assumed models such as the Binomial or the Poisson Poisson To see the overdispersion for the Poisson distribution when u is random with E01 6 then Em EEYlu Eon 6 VarY VarEYu EVarYu Vamp E01 Vamp 0 gt 6 Statistical Inference for categorical Data Inference for categorical data is based on the method of Maximum Likelihood estimation Hence most of the inference is asymptotic and is based on the large sample properties of MLE There are very few exact methods available for CDA Method Let Y1 Y2 Yn be a random sample from a distribution with parameter 6 Let 6 1111 be the log likelihood of the sample which is a function of the parameters given the data The mle of 6 is obtained by maximizing 6 with respect to 6 which is in turn obtained by solving the system of equations W ago Let the solution be Standard Error of Let 6 be the ith element of 6 Let I be the information matrix with the 139 j element as 316 1in Let I 1 be the inverse of I with 139 as the 139 j element of I71 Then the Z39Z39J39 E standard error of is Large Sample Properties of MLE S Under some general regularity conditions the mle 3 of 6 has the following properties 1 Asymptotically for large n 3 N N6 I71 2 3 is a consistent estimator of 6 3 3 is asymptotically e icient in the sense that for large n the standard error SE is no greater than the standard error of any other estimator Some Examples 1 MLE of the Bernoulli Parameter 7T Let Y1 Y2 Yn be a random sample from a Bernoulli distribution with the probability of success 7T Then it can be shown that the mle of 7T is w and 560 MW For large n 7 N N7T fr 2 MLE for the Poisson Parameter u Let Y1 Y2 Yn be a random sample from a Poisson distribution with parameter u Then it can be shown that MLE of u is 1 2 with wm Large sample property of 1 The Poisson family satis es the regularity conditions therefore asymp totically for large n 1 N N01 3 MLE for the Multinomial distribution Let 711112 m N Multn Show that the MLE of 7T is 7 with Var ri for 139 123 0 1 Relationship between normal and chisquare distributions a Let Z have a standard normal distribution Z N N 0 1 Then Z2 N x a chi square distribution with 1 df b Reproductive property of Chi square distribution Let X1 X2 X k be be independent chi square random variables with dfs um Vk respectively Let X 2 X then X N X312Vk ie X ha s a chi square distribution with V1 2 Vk df c Let X N NkQL E a k variate multivariate normal distribution 15