Data Methods in Economics
Data Methods in Economics ECO 725
Popular in Course
Popular in Economcs
verified elite notetaker
verified elite notetaker
verified elite notetaker
verified elite notetaker
Rel Std 100
verified elite notetaker
verified elite notetaker
This 103 page Class Notes was uploaded by Reba Terry on Sunday October 25, 2015. The Class Notes belongs to ECO 725 at University of North Carolina at Greensboro taught by Staff in Fall. Since its upload, it has received 21 views. For similar materials see /class/229064/eco-725-university-of-north-carolina-at-greensboro in Economcs at University of North Carolina at Greensboro.
Reviews for Data Methods in Economics
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 10/25/15
Introduction to the Panel Study of Income Dynamics PSID source material An Overview of the Panel Study of Income Dynamics httppsidonlineisrumicheduGuideOverviewhtml Overview 0 The PSID is a longitudinal survey of US families and their members that has been ongoing since 1968 o The panel primarily collects economic and demographic information though it also collects sociological and psychological measures and other measures through occasional supplements 0 Over time the sample size has grown from 4800 families to more than 7000 families in 2001 o The PSID is conducted by the Survey Research Center in the Institute for Social Research at the University of Michigan The Sample 0 The original sample in 1968 consisted of two subsamples 39 a crosssectional national sample of about 3000 families and 39 a national sample of lowincome families of about 2000 families 0 From 19681996 the PSID interviewed and reinterviewed individuals from families in the core sample every year including people who split off from their original families to form new families and people born into these families other members of new families are interviewed while they are in these families but not followed if the family dissolves 0 Changes in 1997 survey switched to biennial interviewing the core sample was reduced and a refresher sample of immigrant families was introduced 0 Data Collection 0 Interviews facetoface interviews using paper and pencil questionnaires from 19681972 mostly phone interviews from 19731993 and computerassisted phone interviews since then 0 Core content central focus is economic and demographic with detailed data on income sources and amounts employment family composition changes and residential location sociological and psychological content included in some waves information applies to the family a whole head of household spouse of the head of household and other individuals less information measures and methods are largely comparable over time though there have been some changes core topics income sources and amounts including poverty status and public assistance other financial matters eg taxes interhousehold transfers expenditures family structure and demographic measures including marital events births and adoptions and children forming households nestleaving labor outcomes and housework housing eg ownrent house valuerent payment size geographic mobility socioeconomic background eg education ethnicity religion military service parents39 education occupation poverty status general health status disability and emotional distress o Supplements to the PSID and Supplemental Files 0 In addition to the core content the PSID has collected additional information on either a oneshot or irregular basis 0 Major supplemental topics housing and neighborhood characteristics 19681972 19771987 achievement motivation 1972 child care 1977 child support and child development 1997 2002 job training and job acquisition 1978 retirement plans 19811983 healthhealth status health expenditures health care of the elderly and parent39s health 1986 1990 1991 1993 1995 19992003 kinship nancial situation of parents time and money help to and from parents 1980 1988 3 wealthassets savings pension plans fringe bene ts 1984 1989 1994 19992003 educationgrade failure privatepublic school extracurricular activities school detention special education Head Start Programs criminal offense 1995 military combat experience 1994 risk tolerance 1996 immigration history 1997 time use 1997 2002 philanthropic Giving 20012003 0 File Structure of the PSID Data 0 Before 1990 PSID main les for each interviewing wave consisted of 39 a CrossYear FamilyIndividual Response File 39 a CrossYear FamilyIndividual Nonresponse File and 39 a CrossYear Family File 0 From 1990 data on there was a new structure with 39 separate singleyear les with familylevel data collected in each wave ie 26 family les for data collected from 1968 through 1993 contains one record per family interviewed in wave identi ed and sorted by Family ID 39 one crossyear individual le with individuallevel data collected from 1968 to the most recent wave contains one record for each person ever in a PSID family through the current year identi ed by 1968 Family ID and Person Number contains Family ID for each year 0 Special supplemental les created for convenience of 4 researchers or because the information doesn t fit well with the core flles some of these are restricteduse Active Saving File 19841989 and 19891994 intended to measure ows of money into and out of different assets Estimating Risk Tolerance 1996 asked employed respondents about willingness to take jobs with different income prospects and other events 19942001 Family 39Income Plus39Files 1993 Health Care Burden File contains the detailed data on health events of the elderly and their burden on immediate and extended families 1993 OF UllI Income Detail File contains information on other family unit members OFUlVls 9942001 Hours of Work and Wage Files Wealth File permits researchers to ask questions about household saving over f1veyear periods 19841989 19891994 and 19942001 1988 Time and Money Transfers File several health supplements covering 19901995 1990 SelfAdministered Health Supplement contains data on health status healthcare coverage and long term care coverage of heads and wives 50 and over 1990 Telephone Health Supplement contains detailed data on health care costs and utilization for heads and wives aged 65 and over as well as data on health services provided or available to the elderly Parent Health Supplement has data about the health status health care utilization experiences health needs and resources of the parents and parentsinlaw of the head of the family 5 Demographic History Files including 19852001 Childbirth and Adoption History File 19852001 Marriage History File 1985 EgoAlter File provides retrospective event history data on substituteparenting events and usage of public assistance programs 2001 Parent Identi er File 19681985 Relationship File clari es the crude relationships from the main PSID le in early years and relates all pairs of individuals including coresidence status associated with a given family Work History Files contain detailed information about employment and unemployment and the timing of those events 19681980 Retrospective OccupationIndustry Files contain threedigit 1970 Census occupation and industry codes for selected PSID heads and wives Geocode Match Files contain the identi ers necessary to link the main PSID data to detailed Census data restricted use 19681999 Death File restricted use Medicare File in 1990 individuals age 55 or older who indicated they were Medicare bene ciaries were asked to sign permission forms for access to Medicare claim records between 1984 and 1990 additional permissions were sought in 19911995 restricted use 0 The PSID Data Center 0 It is possible to download entire les from the PSID web site and to work with these directly to do this 39 go to the PSID site httppsidonlineisrumichedu 39 on the leftside menu select Data amp Documentation I follow the links to either Packaged Main Data Documentation and Questionnaires httpsimbaisrumicheduZipsZipMainast or Packaged Supplemental Data and Documentation httpsimbaisrumicheduZipszipSuppast 39 you should be able to download the les directly 0 However the site also has a Data Center that allows researchers to search for and select the speci c variables that they need I the organization of the PSID is complex linking across les can be tricky the search tools allow you to search across les family and indiVidual data are automatically linked by the Data Center 39 from the Data amp Documentation page select the link to the Data Center httpsimbaisrumichedu 0 Both the direct download and Data Center pages require you to register before you can download data initial searches can be done without registering o Searchdata selection features 39 when you enter the data center you will get the following page 393 Panel udy 039 Income Dynam crown Internet uplorer Elle mt new ngurltes me new om v 9 E g Q 5earm Favurltes A e l v g mp5 3 Imus Hem m an I EOI HelD Welcome to the Data Center The There arefuur wavs ta add varlables ta Your cart By File an dex campare an I gt I I andthe By cm Retrleve data carts created bv van or uthers Inalmte m sunal Researm I antem a Mean I Met I Comm a Use a I I I I OlnteInet I The Data Center allows you to choose variables by le 393 Panel udy m Incame Dvnam D crasart Internet Bplarer Ele Edlt mew Fewntes Lmls help 05m v Q E E 5 seem Favarltes 3 f Actlve sawngs AH Vears 57271 mbservatlmns as vanables TlmE Dlarv Aggregates Instltule rm Sntlal Reeemh l UNIemu a Mld39llgan l umezy l Candltlunz of use index 3 Panel Vumv neame Dvna ata center Mlcmsnft Internet Bplarer E s Em new Favantee Lane new 139 9m v Q E h teem Favurmes a u N E g Tms auaws Vuu to 235W wew cumuare and se envanab es Iquot HHHHHHHHHHHHHHHHHHHHHHHHHH HHHHHH D DDDDDDDDDDDDDDDDJDDDDDD DDDDDD DDDDDD DUDE search E s Em new Fixants Ian s new 539 9m v a D E 5 linemen Favurmes e ng 17 v mess hmssmhasrwwmeduVSsaspx v Ga me n v Tquot Wed Panel Study of Income Dynamics 39 Data and uncumentatian He p I n I n39 uata nIeTvne P510 Fam elevel psm IndwwduaHexe aesnan arapnananan next 0aW wur st CBS and TA memgmg nme Dwary Aggregates VWHE Law CDS Twme Dames V O a CM wards may Phrase mta amny mcame Seavcn Cndebnuks Add TD Cart 1 records fuund ma Tyne am mam he saw Psm Famwiweve psm Mam Fanny Data 2nu5 9125037 9 rant FAMILV INCUMEZUU D Insututsfnrsuna R aarth Unwars tv 0F Wth gan Drwaty Condmons at use g1 w QIntemet 1 clicking on the symbol next to the variable name brings up the codebook entry mean Internet LrpIerer vananie s the Sun at these seven vananiee nann Exeinaang Values 55nnnnn an e ValueRange Text 7999999 Lee at 999999 or more 7999998 e 71 Aetuai Ieee e famlly neney ncume n ZUUA 1 7 9999998 Aetuai aneunt 9999999 9999999 or more game 6 0 Internet checking items and adding them to your cart 393 Panel Study of Income Dynamics Data Center 7 Microsoft Internet Bplorer me get Wew ngarmes ans 5er can J E a gem emes pg kg n v g HI a 51 AdmissmsSWbaIsrumIdIedUCEIV v Ea rm gt v 2 Panel Study of Income Dynamics barman rm 39ii if39si g Data and Dncumentatinn Data Center Data can HeIp other can views Index VErIabIeS DeIeIe Checked Nudes Empty Can Expand AII I CnIIanee AH E D 1968 Including Summary Variablgstllal EDDIV In all vearS a D unwary VarIabIeS Samuhng vanabIes FamIIv HIstmv VarIabIes and sax 1 1955 INTERVIEW NUMBER D ERSZDUU SEX UFINDIVIDuAL a D was D psm FamIivrIeveI Mam FamIIv Dara ER2E037 mm remmncomgrznm a D PSIDIndIcIduaIrIeveI u D W n a e as D ER33517 VEARS COMPLETED EDUCLI39ION as D EREEEta COREIMM INDIVIDUAL LONGI I UDINAL INT Us meme nquot Sena neeeenn I Unnem a nnngen I I eanmeane a nee Feedbazk Q1 Q 0 meme 10 when you checkout you Will be prompted for the format of your codebook the format of the data le any subsetting and other options OHTML PDF OXML 0N0 thanks I dmn t need a udebuuk ASCH Data With 5A5 statements Ommsatt Excel Spreadsheet OASCH Data Wltlquot Stata Statements OSAS V9 Transuurt Data Flle after you submit your request you Will be emailed when your codebook data le and accompanying code if needed are available In class exercise 1 Go to the PSID website and create a cart with the following variables a Summary Variables Sampling Variables Family History Variables 1 1968 INTERVIEW NUMBER ii PERSON NUMBER 68 iii SEX OF INDIVIDUAL b PSID Main Family Data 2005 i TOTAL FAMILY INCOME2004 c PSID Individual Data i 2005 INTERVIEW NUMBER i SEQUENCE NUMBER 05 iii RELATION TO HEAD 05 iV AGE OF INDIVIDUAL 05 V EMPLOYMENT STATUS 05 V39 YEARS COMPLETED EDUCATION 05 Vii COREIMM INDIVIDUAL LONGITUDINAL WT 05 2 Checkou creating a a PDF codebook b ASCII data le with SAS code 3 Download the codebook data and SAS code ER30001 ER30002 ER32000 ER28037 ER33801 ER33802 ER33803 ER33804 ER33813 ER33817 ER33848 Introduction to Tables and Graphs in SAS 1 Descriptive statistics are valuable as analytical tools and in developing data 0 in the introduction we went over basic reporting commands frequencies PROC FREQ univariate statistics PROC MEANS and PROC UNIVARIATE correlations PROC CORR 0 this class will discuss these routines in more detail focusing on how to produce simple tables and crosstabulations conditional tables of statistics introduce a general tabulation procedure PROC TABULATE discuss some plotting and graphing procedures 2 Printing information from a data set 0 the simplest reporting procedure is PROC PRINT which prints the values from a data set 0 the syntax for this procedure is PROC PRINT ltoption1istgt VAR ltvariablelistgt 0 this procedure is helpful for analyzing modestly sized data sets 0 you need to be careful not to call this procedure with large data sets such as the PSID because the resulting output can overwhelm the OUTPUT window 0 the VAR statement will help to control the width of the output 0 data set options and general options are two ways to control the length of the output 3 Univariate statistics 0 in empirical analyses we often include a table of univariate statistics to document the data set we also want to be sure that the measures that we are using have reasonable values and distributions e g all the values are in range some variation in the values we can use the MEANS and UNIVARIATE procedures to generate these statistics the advantage of these procedures is that they produce a great deal of standard output with very few commands PROC MEANS the syntax of the PROC MEANS statement is PROC MEANS ltoptionlistgt ltstatistickeywordlistgt if you don t specify any statistic keywords the default the MEANS procedure reports the number of nonmissing values means standard deviations minimum values and maximum values for all of the variables in your data set or all of the variables listed in an accompanying VAR statement you can also specify alternative statistics such as the median MEDIAN the sum SUM variance VAR standard error STDERR and coefficient of variation CV a complete list 1 is available at h Dsunnor1 sas w c59565 HTMT 39 p 39 00146729htm the MEANS procedure generally outputs one line per variable however it may output multiple lines if your variables are labeled or if you request lots of statistics PROC UNIVARIATE the syntax for the PROC UNIVARIATE statement is 1 39 JI PROC UNIVARIATE ltoptionlistgt although it is not required it is a good idea to include a VAR statement specifying the variables for which statistics will be generated the main difference between the UNIVARIATE and MEANS procedures is that the UNIVARIATE procedure estimates my more statistics usually 12 pages per variable to produce simple distributions of categorical variables we can use the FREQuency procedure the syntax for a univariate distribution is PROC FREQ TABLE ltrowvargt where row var is the categorical variable this will produce a table with rows containing the numbers percentages cumulative numbers and cumulative percentages of observations in each category of row var 4 Twoway crosstabulations crosstabulations are tables of statistics that are computed conditionally for example in an empirical analysis we might want to list simple descriptive statistics or distributions of variables for separate subsamples of our larger analysis sample e g list results separately for women and men or separately by ethnicity many statistical procedures in SAS have simple commands that allow for conditional processing PROC FREQ in a previous lecture we showed how the FREQuency procedure could be used to produce twoway tables the syntax is PROC FREQ TABLE ltrowvargt ltcolivargt ltoptionsgt this will produce a twoway table with cells corresponding to each possible combination of row var and col var be careful in your table requests 0 you generally want to avoid using this with variables with lots of different outcomes especially continuous variables note if needed you can specify formats to control the display of continuous variables 0 also to minimize output you generally want to use the row variable to be variable with the most potential outcomes in the output each cell will contain 0 the cell frequency the number of observations in that cell 0 the table percentage cell observations as a percent of all nonmissing observations in the dataset 0 the row percentage cell observations as a percent of observations with a given value of row var and 0 the column percentage cell observations as a percent of observations with a given value of col var options can be specified to suppress the calculation of some of these statistics 0 NOFREQ suppresses the printing of cell frequencies 0 NOPERCENT suppresses the printing of table percentages 0 NOROW suppresses the printing of row percentages 0 NOCOL suppresses the printing of column percentages you can also request formal statistics of the association between the row and column variables such as the Pearson and Spearman correlation coefficients the general option for these is MEASURES unless you specify otherwise the FREQuency procedure ignores missing values if you would like missing values to be included as an additional category include the MISSING option the CLASS statement in the MEANS and UNIVARIATE procedures the MEANS and UNIVARIATE procedures also will produce conditional statistics in newer versions of SAS this can be done using a CLASS statement the syntax for this statement is CLASS ltvariablelistgt with this the procedures will calculate statistics for subgroups defined by the CLASS variables 0 if there is just one CLASS variable the procedures will calculate statistics for each different value 0 if there are two CLASS variables the procedures will calculate statistics for each observed combination of values for example suppose that you had a data set with two categorical variables gender and education PROC MEANS CLASS gender education produces simple descriptive statistics for each combination in the crossproduct gender X education tests of differences of means the TTEST procedure when we calculate conditional means and other statistics we are often interested in formally testing whether these statistics differ across groups 3 PR the TTEST procedure tests for differences in means and variances across two groups the syntax is PROC TTEST VAR ltvariablelistgt CLASS ltclassvargt the CLASS variable needs to be restricted to two outcomes if the VAR statement is omitted TTEST calculates tests for all of the numerical variables in your data set OC TABULATE the TABULATE procedure is a general procedure for generating tables as such it combines features of the MEANS and FREQuency proceduresibut with a lot more exibility including multiple conditioning the general syntax is PROC TABULATE ltoptionlistgt CLASS ltclassvariablelistgt VAR ltanalysisvariablelistgt TABLE ltpageexpressiongt ltrowexpressiongt ltcolumnexpressiongt lttableoptionsgt the CLASS statement is needed to identify potential conditioning variables the VAR statement is needed to identify potential outcome variables the actual use of the variables depends on the TABLE statement expressions consist of variables operators statistical keywords and formatting instructions some TABLE operators 0 used to distinguish page row and column dimensions 0 crossproduct conditioning within a dimension 0 used to group elements consider a dataset with two categorical CLASS variables 0 and d and two analytical variables x and y example 1 specification PROC TABULATE CLASS c d VAR x y TABLE c d x y MEAN STD would produce a table of means and standard deviations of x and y conditional on the possible combinations of c and d where the values of 0 would appear in the rows and d would appear in the columns example 2 changing the TABLE statement to TABLE c ALL d x y MEAN STD would add an initial column where the means of x and y were only conditioned on c ie weren t conditioned on 6 example 3 changing the TABLE statement to TABLE ALL c ALL d x y MEAN STD would add an initial column where the means of x and y weren t conditioned on 0 example 4 changing the TABLE statement to TABLE c d x y MEAN STD would produce the same statistics as example 1 however all of the conditioning would take place by rows instead of by rows and columns example 5 changing the TABLE statement to TABLE c d N would produce frequencies of the combinations of c and d additional information on PROC TABULATE including examples and more elaborate formatting options is available at httpsu39o39oort sns J quot J39 V c59565 HTMI 39 p 39 00146759htm 5 Creating data sets that contain statistics with standard statistical procedures the MEANS procedure allows you to include an OUTPUT statement the syntax for this statement is OUTPUT OUTltSAS data setgt ltother specificationsgt unless you specify otherwise the OUTPUT statement for the MEANS procedure will generate one observation 0 per standard statistic N MEAN STD MIN MAX the type of statistic would be identified with the iSTATi variable 0 per possible classification group you can request specific statistics to do this you would list the type of statistic that you want and name it for example PROC MEANS CLASS c d VAR x y OUTPUT OUTtsum MEDIANx7med yimed OU produces output with one observation per possible classi cation combination all classi ed by c classified by d classi ed by c and d and with the MEDIANs for x and y stored in the variables ximed and yimed respectively note that requesting speci c statistics results in one observation per classi cation group TPUT statements can also be used with the UNIVARIATE and FREQuency procedures the stat 6 Graphs O and 0 SUMMARY procedure can also be used to produce data sets with the same types of istics as the MEANS procedure the primary difference between the two procedures is PROC SUMMARY does not produce listing output thus the SAS data set mum from the previous example could have also been produced with the for example PROC SUMMARY CLASS c d VAR x y OUTPUT OUTtsum MEDIANx7med yimed and SASGRAPH procedures SAS has an extensive graphics capability through the SAS GRAPH module SASGRAPH supports many types of graphs including bar charts line charts pie charts maps Scatter and line plots the GPLOT procedure PROC GPLOT produces twoway scatter and line plots it will also produce bubble plots the syntax for scatter and line plots is PROC GPLOT ltoptionsgt PLOT ltplotrequestgt ltplotoptionsgt the plot request for an outcome variable y along the vertical axis and an independent variable x along the horizontal axis would be y x you can also specify a series of conditional plots 0 let 0 be the conditioning variable 0 the request for plotting y against x conditional on different values of 0 would be y x c unless you specify otherwise the PLOT command in GPLOT produces scatter plots specifying line plots is a little tricky 0 you need to de ne the SYMBOLs for the PLOT and specify that the SYMBOLs will INTERPOLATE between values 0 the syntax for requesting that the symbols for the rst category in your graph be joined is SYMBOLl IJOIN example of a simple scatter plot PROC GPLOT PLOT y x example ofa simple line plot PROC GPLOT PLOT y x SYMBOLl IJOIN 0 unless otherwise speci ed output from SASGRAPH procedures will be directed to a graph window from there you can save the graphs as JPEG or other types of graphics les using FILE 0 EXPORT AS IMAGE o SASGRAPH involves numerous procedures only some simple examples for a particular type of graph have been shown here for more information about the SASGRAPH r J go to httn39 snnnnrt sas J quot quot J Ia h ind 1mm 7 Looking for other ways to store your output 0 SAS sends listing output including graph output to its Output Delivery System ODS o the default for the ODS is the LISTING source results from most procedures are sent to the OUTPUT window results from SASGRAPH are sent to a GRAPH window 0 it is possible to redirect SAS output to other sources through ODS commands the ODS destinations are LISTING the default HTML 7 this destination will produce HTML code and graphics files PDF 7 creates a PDF file with the output RTF 7 creates a RTF MS Word file with the output 0 syntax there are two ODS commands one to open a destination and one to close it to open a destination the command is ODS ltdestinationgt ltoptionsgt where the ltdestinationgt is LISTING HTML PDF or RTF and the ltoptionsgt are style options for the output one of the options is to specify a FILE eg FILE ltwindowsfilegt the destination is closed when SAS terminates if you want to close a destination before then and write the output to it ie have the output available to read you type an ODS CLOSE command ODS ltdestinationgt CLOSE example to redirect the output from the LISTING to a PDF le c39tempex1pdf you would type ODS LISTING CLOSE closes default LISTING dest ODS PDF FILE ctempeXlpdf39 opens PDF as an ODS destination SAS statements that produce output here ODS PDF CLOSE closes PDF dest writes output ODS LISTING reopens LISTING as destination for more information on the SAS Output Delivery System go to httpsurmort sas J quot J39 J 259523HTMI 39 r quot 100022774htm Multivarite Procedures in SAS 1 Introduction the goal of the data methods that we have been discussing is to prepare a data set analysis le that can be used in an empirical investigation we have discussed some simple statistics and crosstabulations that can be performed in SAS these procedures indicate general associations in the data the procedures do not account for indirect and possibly confounding in uences from other observed variables estimates of direct or partial associations that account for these indirect in uences come from multivariate procedures such as regression analyses probit procedures etc SAS has many multivariate routines some of these will be immediately recognizable many others are available but tougher to nd 0 SAS is written for a broad set of users 0 the descriptions that other scientists use and that SAS uses do not always line up with the descriptions that economists use should you actually use these routines it can be convenient to use the routines available in SAS however economists often find the routines in packages like Stata to be more useful can t make a good choice without knowing at least some of what s available in SAS 2 PROC REG1 1 See httn39 the SAS regression procedure PROC REG is the workhorse of multivariate procedures it estimates OLS regressions basic syntax and operation PROC REG ltregressionioptionsgt ltmodelilabelgt MODEL ltdep7vargt ltlistiof7explanatoryivariablesgt lt modelioptionsgt where the depivar is a SAS variable with the dependent or outcome variable in the regression the listiofiexpanatoryivariables is a list separated by spaces of independent variables in the regression modelilabel is a label that will appear in all of the output associated with this model it is useful for producing readable output unless asked to do otherwise the REG procedure automatically includes an intercept term to drop the intercept use the NOINT model option nnnnrt sas 39 39 quot statu 59654me 39 r 39 tochtm l the REG procedure outputs estimated coef cients coef cient standard errors as well as test statistics and pvalues under the null hypothesis that the true coef cient is zero for the intercept and independent variables the REG procedure also calculates the residual sum of squares the R2 and adjusted R2 statistics the mean square error F test statistics and other statistics estimating many models PROC REG can be modi ed to estimate many models in the same procedure ie without calling another regression procedure multiple MODEL statements can be included in the same procedure also a list of dependent variables can be provided instead of a single dependent variable all variables that appear before the equals sign in the MODEL statement are treated as dependent variables testing and correcting for heteroskedasticity including the SPEC model option leads to a heteroskedasticity speci cation test arti cial regression test being performed including the ACOV model option leads to a heteroskedasticityconsistent variance covariance matrix being calculated note that newer versions of SAS starting with version 92 will have more convenient and exible options testing for rstorder autocorrelation the DWPROB model option will lead to a rstorder DurbinWatson test with a corresponding pvalue for the null hypothesis of no autocorrelation note that the DWPROB option assumes that the data are sorted in chronological order tests of model restrictions restrictions on the coef cients from the last speci ed model can be tested using the TEST statement syntax lttestilabelgt TEST lttest7expressionilgt lt testiexpression72gt lt testiexpression7kgt the test expressions are mathematical expressions involving the explanatory variables coef cients actually from the regression model 0 if equals signs are included in an expression TEST tests the actual speci cation 0 if equals signs are omitted TEST tests the hypothesis that all of the expressions are jointly equal to zero multiple TEST statements can be issued in the same REGgression procedure examples PROC REG MODEL y x1 x2 x3 TEST x2 0 x3 0 0 estimates a regression with y as the dependent variable and x1 x2 and x3 as the independent variables 0 it then tests the null hypothesis that x2 x3 0 ie that x2 and x3 are jointly equal to zero note that TEST X2 X3 would have tested the same hypothesis TEST X2 X3 and TEST X2 X3 0 test the null hypothesis that x2 x3 TEST X2 l and TEST X2 l 0 test the null hypothesis that x2 l outputting predictions and residuals a data set with predictions and residuals from the regression can be produced using the OUTPUT statement the OUTPUT will produce these statistics for the last MODEL estimated multiple OUTPUT statements can be included the OUTPUT statement must include at least one statistics keyword 0 P ltprediction7variablegt generates and stores predictions 0 R ltresidualivariablegt generates and stores estimated residuals 0 see the documentation for additional supported statistics the OUTPUT data set would contain one observation for every observation read into the REGression procedure it also will contain copies of all of the dependent and independent variables used in the procedure 0 outputting coef cient estimates including the option OUTESTltSAS7data7setgt in the PROC REG statement causes the procedure to create a SAS data set containing the coef cients for all of the estimated models the data set contains one observation per model estimated models are identi ed by a iMODELi variable and by the names of the dependent variables coefficients are stored in variables with the same names as the eXplanatory variables 3 PROC LOGISTIC o the primary procedure for running binary choice models is PROC LOGISTIC as the name suggests PROC LOGISTIC estimates logit models however it also estimates probit and other types of models SAS has a PROC PROBIT that can be con gured to estimate binary choice probit models the standard speci cation estimates another type of model 0 syntaX PROC LOGISTIC ltoptionsgt MODEL ltresponse7speci cationgt ltrespivariableioptionsgt ltlistiofiindependentivariablesgt lt modelioptionsgt the response speci cation can be of two types 3 O a binary variable or ordered categorical variable ltpositiveioutcomesgt lttotalioutcomesgt where the first term is a SAS variable with the number of positive outcomes in a grouped observation and the second term is the total number of outcomes this is used to estimate grouped binary data note that you can only specify one MODEL per LOGISTIC procedure to estimate multiple models you need to call the procedure again a quirk in PROC LOGISTIC is that its default is to treat the lowest value in a response variable as the outcome of interest 0 0 thus the default for a 0 1 binary variable is to model the probability that the outcome is 0 7 that is the exact opposite of how most social scientists specify these models 0 there are several ways to fix this one is to include a response variable option EVENT 1 another is to include a response variable option DESCENDING 0 example let y be a 01 binary variable and let x1 x2 and x3 be independent variables we could specify a model PROC LOGISTIC MODEL yDESCENDING x1 x2 x3 using the model options you can also specify the type of distribution that you want to use for the model 0 the default is the logistic distribution producing a logit model 0 to get probit estimates you would use the LINKPROBIT model option 0 from the example above PROC LOGISTIC MODEL yDESCENDING x1 x2 x3 LINKPROBIT 0 to estimate a conditional loglog binary model useful in eventhistory analyses you would use the LINKCLOGLOG model option if you provide an ordered categorical model the LOGISTIC procedure will estimate an ordered choice model 0 again the choices will be modeled focusing on the lowest value 0 to estimate a model focusing on high values use the DESCENDING response variable option 0 outputting predictions a data set with predictions can be produced using the OUTPUT statement two common statistics are 0 P ltprediction7variablegt generates and stores predicted probabilities 0 XBETA ltlatentJred7variablegt generates and stores predicted latent variable 0 see the documentation for additional supported statistics the OUTPUT data set would contain 0 one observation for every observation read into the LOGISTIC procedure for binary data 4 PR 0 2 See httn39 0 O 0 one observation for each possible response level and observation for ordered categorical data it also will contain copies of all of the dependent and independent variables used in the procedure utputting coefficient estimates including the option OUTESTltSAsidata7setgt in the PROC LOGISTIC statement causes the procedure to create a SAS data set containing the coefficients for the estimated model the data set contains one observation coefficients are stored in variables with the same names as the explanatory variables C QLIM2 the QLIM procedure is a new addition to SAS and estimates a variety of qualitative and 1i mited dependent variable models including logit probit tobit and selectivitycorrected models b inarychoice speci cations syntax for a logit model is PROC QLIM ltoptionsgt MODEL ltdep7vargt ltlistiof7explanatoryivariablesgt DISCRETEDLOGIT ltother7modelioptionsgt note that unlike the LOGISTIC procedure QLIM models the probability that the dependent variable equals 1 probit speci cation would be estimated using just the DISCRETE option it is the default binary choice model in QLIM binary choice specifications with controls for heteroskedasticity can also be estimated using the HETERO statement HETERO ltdep7vargt N ltlistiof7het7explanatoryivariablesgt lt hioptionsgt 0 the hioptions include the type of LINK LINEAR or EXPonential 0 they also include whether a SQUARE of the linear combination of the explanatory variables will be used 0 they also include whether the constant should be dropped NOCONST tobit specification syntax for a standard tobit censored regression model censored from below at zero is PROC QLIM ltoptionsgt MODEL ltdep7vargt ltlistiof7explanatoryivariablesgt ENDOGENOUS ltdep7vargt N CENSOREDLB0 nnnnn sas quot etsu 60372HTIL 39 P 39 quot tochtm 5 with this speci cation QLIM will estimate a standard ML tobit more generally the lower bound LB in the CENSORED option can be specified as 0 a numerical constant 0 a variable upper bounds can also be speci ed by including an UB in the CENSORED option for example lety be CENSORED from below by 0 and from above by 100000 let x1 x2 and x3 be independent variables we could specify a model PROC QLIM MODEL y X1 X2 X3 ENDOGENOUS y N CENSOREDLB0 UB100000 in addition to these models QLIM also supports estimation of selection models bivariate probit models BoxCox models and other models it is possible to output predictions and coef cient for this procedure An Introduction to Using SAS on PCs Dave Ribar 1 SAS is a system of software packages some of its basic functions and uses are database management 7 inputting cleaning and manipulating data statistical analysis 7 calculating simple statistics such as means variances correlations running standard routines such as regressions graphics 7 producing two and threedimensional plots and charts producing maps statistical programming 7 SAS allows you to program your own statistical routines 2 execution after logging onto PC invoke SAS from a desktop shortcut from the start menu or from a class menu SAS will open ve windows Program editor 7 SAS commands are entered here you can create and submit a new set of commands or open edit and submit an existing le with commands Log 7 this window describes the commands that have been run it also describes the resources that were used and the errors that were encountered while SAS processed your commands Output 7 this window lists the statistical results or other program output if any from the commands output will only be generated if your program ran without errors Explorer 7 this window helps you work directly with SAS les Results 7 this window helps you work directly with different types of SAS output you can toggle between the different windows either by clicking each window s minimization and resize icons upper right comer of the window area clicking on the window itself or using the Window menu from the top menu bar operating sequence 7 typically when you use SAS you will follow this sequence enter program commands by either typing the commands in the Program Editor or by opening an existing program sas le submitting the commands using either the submit icon on the tool bar or going to the menu bar and selecting Run Submit reviewing the Log window for errors reviewing results in the Output window when the commands come from or are stored in a le we refer to this sequence as a batch sequence you submit all of your commands in a batch instead of one at a time after the program has run if you want to keep a permanent copy of the material in any of the windows open click on the window go to the menu bar and select File Save or File Save As 0 it s a good idea to keep copies of your programs data processing for statistical analysis usually involves many commands these will need to be repeated if you want to recreate or modify the data set saving a copy of the program keeps you from having to retype these commands each time you want to correct or modify your processing steps 0 for nal versions of analyses it s also good practice to save results from the Log and Output windows for documentation and recordkeeping purposes if you don t want to keep the material in a particular window open the window and at the menu bar select Edit Clear All 0 after each program runs SAS appends the relevant information to the Log and Output windows 0 if you are debugging or developing a program the Log and Output windows can get clogged with lots of results to leave the SAS system go to the menu bar and select File Exit 3 all SAS commands that are entered into the Program Editor share the same basic syntax note in the writeup that follows terms marked in brackets ltgt represent things that you need to enter don t actually type the brackets regular command ltcommandgt note commands can extend across multiple lines the semicolon marks the end of the command nonexecutable comment ltcommentgt OR ltcommentgt commands in SAS are typically grouped into procedures or PROCs although some commands can be used anywhere in a program 4 commands that can be used anywhere in a SAS program FILENAME command FILENAME lt lerefgt ltWindows7 le7namegt the FILENAME command assigns the internal identi er lt lerefgt to the speci ed Windows le this command is used in conjunction with reading or writing non SAS data les ie text or ASCII les LIBNAME command LIBNAME ltintemalinamegt ltWindows7directorygt SAS reads and writes SAS data les from libraries the LIBNAME command assigns the internal identi er ltintemalinamegt to the Windows directory where the SAS le is or will be stored OPTIONS command OPTIONS ltlist of optionsgt the OPTIONS command resets SAS options such as the width of the program output LINESIZE the maximum number of observations to be processed OBS and whether SAS les should be compressed COMPRESS TITLE command TITLE lttitle7informationgt the TITLE command causes lttitle7informationgt to be printed at the top of each page of output this is a great way to document your output TITLE changes the information for the th title line 0 RUN command RUN when you submit a set of commands to be executed SAS will execute all of the commands in every complete procedure one way to end a procedure is to invoke a new one another way is to put a RUN command at the end of the procedure placing a RUN command at the end of your program ensures that SAS will run all of the commands 5 DATA step 0 this is a special type of procedure that is used to input manipulate and create data 0 if data are input or read into a DATA step the observations are processes sequentially and the DATA step effectively loops over the observations DATA statement DATA ltsasdatagt the DATA statement is the rst statement in a DATA step it creates a SAS data le called ltsasdatagt SET statement SET ltsasdatagt the SET statement inputsreads data from an existing SAS data le called ltsasdatagt INFILE statement INFILE lt lerefgt the INFILE statement is used to begin reading an external nonSAS le lt lerefgt where lt lerefgt has been de ned using the FILENAME command INFILE identi es the le to be read you can skip the FILENAME and lt lerefgt and simply type INFILE ltWindows7 le7namegt INPUT statement INPUT ltvariablelgt ltvariable2gt the INPUT statement reads the data from the external le speci ed by the INFILE command into the variables ltvariablelgt ltvariable2gt etc as written this statement assumes that the INFILE data are all numeric with at least one blank space between the variables alternative formatting can be speci ed note for nonSAS les you must use an INPUT statement and explicitly describe the data layout for SAS les this is handled automatically by the SET command OUTFILE statement OUTFILE lt lerefgt the OUTFILE statement is used to begin writing to an external nonSAS le lt lerefgt where lt lerefgt has been de ned using the FILENAME command OUTFILE identi es the le to be written to as with the INFILE you can skip the FILENAME and lt lerefgt and simply type OUTFILE ltWindows7 le7namegt PUT statement PUT ltvariablelgt ltvariable2gt the PUT statement writes the data from the variables ltvariablelgt ltvariable2gt etc to the external le speci ed by the OUTFILE command if no OUTFILE command is speci ed the data are written to the SAS Log assignment statement ltvariablelgt ltmath7expressiongt the assignment statement places the value from ltmath7expressiongt into ltvariablelgt a math expression consists of a numeric or character constant a variable or a formula involving constants and variables mathematical operators IF THEN ELSE statements IF ltconditionlgt THEN ltcommandlgt ELSE IF ltconditon2gt THEN ltcommand2gt these statements are used to call commands conditionally the logical condition usually tests a variable against another variable or constant commands are executed if the condition is true the ELSE clause is optional if ELSE is used without the IF ltcondition2gt THEN ltcommand2gt is executed anytime that ltconditionlgt is not true logical operators lt gt lt gt A AND OR note the symbols can be replaced by EQ LT GT LE GE NE this sometimes improves the readability of programs if you want to execute multiple commands for the same condition you need to use a DO END construct as follows IF ltconditionlgt THEN DO ltcommand l gt ltcommand2gt ltcommandngt END subsetting IF statement IF ltconditiongt this tells SAS to keep only the observations that satisfy ltconditiongt this statement is useful for cleaning data and removing outofrange observations it is included here because its syntax is similar to an IF THEN statement but it produces very different results KEEP and DROP statements KEEP ltvariablelgt ltvariable2gt DROP ltvariablelgt ltvariable2gt the KEEP statement keeps the listed variables in the SAS data file that is created by the DATA step the DROP statement drops the listed variables from the data file these statements are use ll in paring down the widt number of variables of a file the DATA step is terminated by a RUN statement the start of another DATA step or by a procedure 6 some utility procedures 0 CONTENTS procedure PROC CONTENTS DATA ltsasdatagt this procedure lists the variables in and other characteristics of the SAS data file ltsasdatagt the procedure is useful for examining unfamiliar external SAS datasets and for checking the structure of datasets that you have created 0 SORT procedure PROC SORT BY ltvariable lgt this procedure sorts the preceding SAS data set in order of ltvariablelgt for large les this procedure can be very time consuming 7 procedures to generate simple statistics 0 MEANS procedure PROC MEANS despite the name this procedure actually calculates several statisticsimeans standard deviations minimum and maximum values and counts of nonmissing values unless you specify otherwise PROC MEANS produces statistics for all of the variables in the preceding dataset if you want to produce statistics for just a subset of variables add the following statement VAR ltvariablelgt ltvariable2gt UNIVARIATE procedure PROC UNIVARIATE VAR ltvariablelgt ltvariable2gt this procedure generates a very detailed set of univariate statistics the statistics ll up an entire page of output because of the large amount of output you should use a VAR statement to limit the number of variables CORRelation procedure PROC CORR VAR ltvariablelgt ltvariable2gt the procedure calculates correlation coefficients for pairs of variables from the preceding SAS dataset it also calculates simple statistics again because of the extensive amount of output the use of a VAR statement is recommended 0 Frequency procedure PROC FREQ TABLE ltvariable l gt this procedure reports the numbers of observations with different values of the variable listed in the TABLE statement it is very valuable for checking the values of discrete variables the procedure can also be used to calculate crosstabulations using the following TABLE statement TABLE ltvariablelgt ltvariable2gt 8 running regressions the REG procedure the syntax of the REG procedure is PROC REG MODEL ltdependent7variablegt ltlistiof7independentivariablesgt the ltdependent7variablegt is the name of the SAS variable corresponding to the outcome of interest the list of independent variables contains the names of the SAS variables corresponding to the explanatory variables if more than one explanatory variable is used the REG procedure performs multiple regression unless asked to do otherwise the REG procedure automatically includes an intercept term including an intercept is the default and SAS does not need to be told to include one the REG procedure outputs estimated coefficients coefficient standard errors as well as test statistics and pvalues under the null hypothesis that the true coefficient is zero for the intercept and independent variables the REG procedure also calculates the residual sum of squares the R2 and adjusted R2 statistics the mean square error F test statistics and other statistics 9 additional reading and information 0 SAS has online help and tutorials 0 a variety of documentation exists including Language References and User Guides for different SAS components Data Cleaning 1 General approach to cleaning data 0 objective is to use the available data to form an analysis data set which has measures that correspond as closely as possible to the variables in our conceptual model hypothesis or research question can support an analysis of the behaviors and relationships of the conceptual model or hypothesis is representative of the population of interest or a population of interest has measures with good properties avoids outliers mistaken reports etc o raw data from surveys are rarely in a form that permits immediate analysis 0 the process of taking the data from its raw form to an analysis form is called cleaning the data 0 in this class we will talk about some general data checking procedures discuss some SAS commands that are helpful for cleaning and transforming data discuss some SAS commands for documenting measures 2 Some practical data cleaning tips 0 first and most importantly make sure that the data have been read in correctly potential problems with column formatted data if the column speci cations get mixed up data can also be only partially downloaded check that the number of observations matches the documentation also check the ranges of your variables match the values from the codebook 0 you need to determine how the relevant sample was collected and who was a respondent some data sets are simple and only contain information on respondents other data sets have different respondents at different times or for different data items for example longitudinal surveys have people who are respondents in some waves but not others you need to nd the variables that identify respondents such as the why non response variables in the PSID and make sure that you can reproduce the original sample totals o a related issue is that you need to familiarize yourself with any skip patterns or special universes for data that you may be interested in a skip pattern refers to questions in a survey that are only asked conditionally people who don t meet the conditions skip the questions the universe for a data item refers to types of study subjects that are actually asked a question or for whom the item is constructed you should check the conditioning variables and verify that you can reproduce the universes for all data items 0 top codes and other special values in data many data sets reserve special values to represent missing data refusals don t knows or not being in the data universe we usually remember to look for these in categorical data however they also sometimes appear in continuous data one particular type of special code is a top code which can be used to indicate missing data values beyond a certain range for instance incomes or weekly hours above certain thresholds or other conditions for example some data sets record valid usual weekly work hours as being between 0 and 97 code usual hours in excess of 97 with a 98 and code variable hours ie people who work but have irregular schedules with a 99 for privacy reasons many government data sets recode incomes above a certain amount at either that particular threshold or at the mean of the censored incomes as much as is reasonable and possible you want to check the consistency and reasonableness of the data items to do this you need related or conceptually conditional variables for example you might compare people s reported hours of work and earnings with their reported employment status are there nonworkers who are reporting positive hours or earnings examine the distributions of variables for implausibly high or low values such as wages that are substantially below the minimum wage or levels of education or experience that exceed people s ages more generally look for outliers in the data in longitudinal data look for consistency in distributions and de nitions of variables over time 6X amine the data for unusual patterns of missing responses for example an examination of the employment status information in the 2005 PSID individual file revealed that there were a surprisingly high number of missing responses and almost no valid responses for heads and wives a closer look through the documentation revealed that heads and wives information was stored in another le and had not been cleaned unusual patterns may indicate a skip pattern that you ve overlooked or some other misunderstanding of the data some analysis variables will need to be constructed you want to apply the same consistency and reasonableness checks to these data as to the raw data because these data are constructed from other underlying data additional consistency checks are possible for example if you construct a categorical measure from another underlying measure you could use PROC FREQ or PROC TABULATE to look at the conditional distributions and make sure that they match up with your intended coding the consistency checks for constructed variables are intended to catch programming mistakes that is to help debug your programs the more complicated the data construction procedures the more thorough you need to be in your checking for especially complicated or tricky procedures selected handchecking may be necessary in a hand check you would actually go line by line through the data and the construction steps making sure that all of the programming and logic is sound compare your data to other data sets 3 Conditional processing as the preceding discussion suggests a lot ofthe processing in data cleaning is conditional we have already gone over IFTHENELSE statements these are the workhorse statements of conditional processing an alternative for processing mutually exclusive conditions involving a single variable or expression is the SELECT sequence the syntax for the statement is SELECT ltselectexpressiongt WHEN whenexpressionl ltmore when expressionsgt statement ltWHEN whenexpressionn ltmore when expressionsgt statementgt ltOTHERWISE statementgt END the select expression can be any SAS expression that produces a single value including character values 0 the selectexpression is optional you don t have to include one 0 if an expression is used it must appear in parentheses similarly the when expressions can be any SAS expressions including constants that produce a single value 0 whenexpressions are not optional at least one must be included 0 these expressions also appear in parentheses 0 if a select expression is provided SAS compares the when and select expressions 0 if a select expression is not provided the when expressions must be logical expressions and they are evaluated on their own 0 the when expressions are evaluated sequentially once a when expression is true and the corresponding statement is executed SAS steps out of the SELECT sequence the OTHERWISE statement is optional and operates like an unconditional ELSE statement SAS expects a single statement to follow the WHEN and OTHERWISE conditions 0 to execute multiple statements within a single condition use the DOEND construct 0 you can also include a null statement just a semicolon note the conditions in the SELECT sequence must be exhaustive of the possible conditions if SAS steps through the entire sequence without nding a true condition it will issue an error message and stop the DATA step at that point SELECT statement is easier to read and more ef cient faster than a comparable IF THENELSE if you have many conditions examples suppose that you have a categorical variable x that takes on values 15 0 example with a selectexpression 4 SA 0 SELECT x WHEN l y x WHEN 234 y x2 OTHERWISE END y will have values ifx is less than 5 and will be missing ifx 0 equivalent example without a selectexpression SELECT WHEN x EQ 1 WHEN x IN 234 OTHERWISE END yx yX2 you might consider using a null statement with the OTHEWISE condition to trap for unexpected results from the expression comparisons and to keep SAS from stopping in the middle of your DATA step alternatively you could write an error message to the LOG e g OTHERWISE PUT 39Unexpected SELECT at obs 39 7N7 to debug conditional processes it is important that you go through them and consider what will happen under all circumstances S functions SAS has a number of functions that help with manipulating data general use functions are most commonly used in assignment statements however they can also be used in most places where SAS evaluates expressions the general syntax for a function is functioninamemrgumentilist the arguments themselves can be SAS expressions including functions multiple arguments are delimited by commas some functions such as the DATE function don t use arguments for these you would just include parentheses but nothing between them invoking a function will cause a value to be returned m categories of functions are listed below along with some specific functions arithmetic mathematical trigonometric and hyperbolic ABSargument returns the absolute value of the argument EXPargument returns the exponential of the argument LOGargument returns the natural logarithm of the argument LOGlOargument returns the base10 logarithm of the argument returns the maximum value from the list returns the minimum value from the list MAXargumentilist MINargumentilist V MODarg1 arg2 Nargumentilist NMISSargumentilist SQRTargument SUMargumentilist character LEFTargument RIGHTargument SUBSTRarg1 posn date and time ATE DAYargument MDYmmddyyyy MONTHargument YEARargument geographic FIPSTATEargument STFIPSargument ZIPCITYargument returns the remainder from argJargZ returns the number of nonmissing values from list returns the number of missing values from list returns the square root of the argument returns the sum of the values from list left aligns a character expression right aligns a character expression extracts 71 characters from arg starting at pos returns an integer representing today s date returns the day of the month from a SAS date value returns the SAS date value for mmddyyyy returns the month 112 from a SAS date value returns the year from a SAS date value converts FIPS code to state postal abbreviation converts the state postal abbreviation to a FIPS code converts a ZIP code to a city name and postal abbrev probability and quantile inverse probability PROBCHarg1df PROBFarg1 nd ddf chisquare cumulative distribution function F cumulative distribution function PROBNORMargument standard normal cumulative distribution function random number NORMALseed RANEXPseed RANUNIseed sample statistics CVargumentilist MEANargumentilist STDargumentilist truncation CEILargument FLOORargument INTargument ROUNDargument returns a pseudorandom std normal variable using seed returns a pseudorandom exponential variable using seed returns a pseudorandom uniform 01 variable using seed calculates the coefficient of variation of the arguments calculates the mean of the arguments calculates the standard deviation of the arguments rounds the argument up to the next integer rounds the argument down to the next integer truncates the argument to an integer rounds the argument to the nearest integer for more information and the complete list of SAS functions see httpsurmort sas Documenting your data programs should be documented with comments output should be documented using the TITLE and FOOTNOTE commands additionally variables themselves should be labeled and possibly formatted LABEL statement 39 JI lrdict59540HTMLdefaulta0002452i7 htm SAS restricts the length of a variable name to 32 characters and restricts the types of characters and symbols that can be used in a name LABELs allow you to provide more detailed names the syntax for the label statement is LABEL ltvariable lgt 39ltlabellgt ltvariable2gt ltlabe12gt ltvariablengt 39ltlabelngt LABEL statements can be used in DATA steps or in other procedures 0 if a LABEL statement is used in a DATA step the LABEL stays with the variable when the SAS data set from the step is created 0 if a LABEL statement is used in a reporting procedure such as GPLOT or TABULATE the LABEL only remains with the variable for that procedure to remove a previously applied label use the command LABEL ltvariablelgt it is sometimes necessary to remove or modify a label to get output to print reasonably eg to get tables in a MEANS or TABULATE procedure to t on a single page Arrays and DO Loops in SAS 1 ARRAYS o in computer science an ARRAY is a structure for holding multiple related items that allows the user to reference the items using matrix notation like matrices ARRAYs can be unidimensional or multidimensional the array itself has a name the elements of the array are referenced through subscripts the number of subscripts corresponds to the number of dimensions example suppose that we had survey information regarding a respondent s monthly participation in the Food Stamp Program over the preceding calendar year suppose the name of the array was fspart the array would have 12 elementsia series of binary indicators corresponding to participation in each of the months of the year the structure for this uni dimensional ARRAY could be depicted Month Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec subscript 1 2 3 4 5 6 7 8 9 10 11 12 fspartr so the January element of the array would be fspart 1 the Feburary element would befspart2 and so on example suppose that we had similar survey information regarding monthly Food Stamp Participation but that it covered two years instead of one the structure for this multi dimensional ARRAY could be depicted Month Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Year subscripts 1 2 3 4 5 6 7 8 9 10 11 12 t 21 t 12 the array has 24 elementsibinary indicators for every month of two years the element corresponding to participation in May two years ago would be fspart 15 the element for May a one year ago would befspart25 note that the user usually supplies the interpretation of the subscripts as should be clear there are many potential ways to arrange the data 2 ARRAYs in SAS 0 SAS permits the construction and use of ARRAYs in DATA steps 0 the rst step in using an ARRAY in SAS is to declare one this is done through an ARRAY statement1 1 SAS allows explicit and implicit de nitions of ARRAYS The implicit de nition is an older and less used approach This course will only cover the explicit de nition 1 syntax ARRAY ltarraynamesubscriptsgt ltgt ltlengthgt ltltarrayelementsgt ltinitial valuesgt at a minimum the ARRAY de nition must contain the name of the ARRAY and either an indication of the subscripts or a description of the elements themselves the arrayname must be a valid SAS name similar to a variable name the subscript or subscripts which must appear in braces is either a numerical constant an asterisk or a range of numbers a numerical constant is the most common speci cation and would simply give the number of elements in the ARRAY 0 if multiple dimensions are desired the numbers specifying each dimension would be separated by commas 0 the asterisk is a wild card character and indicates that the ARRAY has as many elements as are de ned subsequently in the ARRAY statement if the asterisk is used elements must be de ned AND the ARRAY can only have one dimension 0 ranges take the form lowerupper where lower is the value of the rst subscript in the range and upper is the value of the last the number of elements would be upper 7 lower l ranges are useful when the subscripts are tied to other values such as years the array elements are usually SAS variables 0 the elements must all be of the same type 0 they may include existing SAS variables in which case the ARRAY serves mainly as an alternative way to reference the variables 0 however they can also be new variables 0 if you do not de ne the elements or if there are fewer elements than are de ned by the subscript SAS creates new variables that have the arrayname as the pre x and the subscript number as the suffix other ways to de ne elements 0 iNUMERICi will de ne the elements as all of the previously de ned numeric variables from the DATA step 0 iCHARACTERi does the same thing with character variables 0 iALLi will de ne the elements as all of the previously de ned variables from the DATA step all of the previously de ned variables would have to be of the same type 0 iTEMPORARYi will de ne a list of temporary variables these do not have names are automatically RETAINED and are not transferred to the SAS data set that is created by the DATA step you can ll the initial values of the elements by listing the values in parentheses and separated by blanks or commas at the end of the ARRAY statement you can also specify whether the elements are character or numeric by using the and specify the LENGTH if these are not speci ed the ARRAY statement will use the existing types of the variables some examples O for the unidimensional Food Stamp Participation ARRAY that we had discussed earlier ARRAY fspaIt12 would create variables fspart ifspartl2 0 suppose that we already had variables called fspl 7 fsp12 we could also specify the ARRAY as ARRAY fspa1t 12 fspl fsp2 fsp3 fsp4 fsp5 fsp6 fsp7 fsp8 fsp9 fsplO fspll fsp12 or even more compactly as ARRAY fspart fsplfsp12 0 suppose that we wanted to initialize an ARRAY for fspart ifspartl2 with zeroes we would issue the command ARRAY fspart12 0 0 0 0 0 0 0 0 0 0 0 0 now consider the two dimensional example 0 suppose that you had a calendar with 24 months of data starting with January from two years ago fspl and continuing to December from last year fsp24 0 you could specify a two dimensional ARRAY as ARRAY fspa1t212 fsplfsp24 0 note that if you had issued the statement ARRAY fspaIt212 SAS would have constructed variables fspart ifspart24 o referencing ARRAY elements after an ARRAY is de ned you can reference an element by typing array namesubscript where array name is the name that was defined earlier by the ARRAY statement and subscript is a numerical constant or numerical value with the desired subscript index this specification works just like any other variable specification in SAS and can be used in the same ways that variables are eg in assignment statements in conditioning statements etc in assignment statements ARRAY elements can appear on the left or right side or both sides using our previous unidimensional example the code 3 ARRAY fspart 12 fsp1fsp 12 X fspart7 creates a 12element unidimensional ARRAY of Food Stamp Participation indicators and assigns the July value to the SAS variable x this code does the same thing ARRAY fspart 12 fsp1fsp12 i 7 X fsparti o referencing all of the elements in an ARRAY as a list if you use the asterisk as the subscript reference SAS treats the ARRAY as a list of variables for example to INPUT the elements of fspart you could type ARRAY fspart 12 fsp1fsp12 INPUT fspart to determine whether someone had participated at all in the Food Stamp Program during the preceding year you could type anyfsprt MAXof fspart o DIM function the function DIMarray name returns the number of elements in array name 0 if the ARRAY is multidimensional DIM returns the number of elements in the first dimension 0 to obtain the number of elements in higher dimensions use DIMn where n is the dimension that you are interested in this function can be useful for writing general code that does not have to be updated if your ARRAY specifications change 0 outofrange references each time that you try to reference an element in an ARRAY SAS checks the subscripts against the definition for the ARRAY if the subscript is below 1 or the lower value if a range is specified or above the maximum possible subscript value SAS will issue a runtime error message that an array subscript is out of range and stop processing the DATA step 3 DO loops 0 the use of variables as subscript references is a big advantage of ARRAY processing variable references mean that you can pick out an element of an ARRAY relative to some characteristic of the current observation or relative to some condition that you are processing an additional tool that extends variable subscript references is the iterative DO loop the standard syntax for a DO loop is DO ltindex7variablegt ltstart7valuegt TO ltstop7valuegt ltBY ltincrement7valuegtgt SAS commands END the DO loop causes the SAS commands to be performed repeatedly the DO loop first sets the indexivariable to the startivalue 0 the indexivariable must be a SAS variable 0 the startivalue can be either a numerical constant or a numerical variable the DO loop compares the indexivariable to the stopivalue which itself can be a numerical constant or a numerical variable 0 if the indexivariable is not yet past the stopivalue greater than the value if positive increments are used or less than the value if negative increments are used SAS performs the statements in the loop 0 if the indexivariable is past the stopivalue SAS exits the DO loop by going to the next statement following the END the DO loop is iterative 0 after SAS performs the statements it returns to the top of the loop returns to the DO statement 0 the indexivariable is then changed by the amount of the incrementivalue or by 1 if no incrementivalue is specified note that the incrementivalue can be negative 0 the loop continues until the indexivariable is past than the stopivalue DO loops are useful in lots of contexts but they are especially useful for traversing accessing all the elements of ARRAYs example 3 l suppose that we have a unidimensional ARRAY fspart that describes Food Stamp Participation over the preceding year suppose also that we want to count the number of months out of the year that someone participated we could use the code ARRAY fspart l2 fsplfsp 12 fsmnths 0 DO month 1 TO 12 fsmnths fsmnths fspartmonth END examples 32 and 33 consider the same ARRAY but now assume that we want to measure the number of quarters out of the last year that someone was receiving Food Stamps we could use code with nested DO loops example 32 ARRAY fspart 12 fsplfsp 12 fsqtrs 0 month 0 DO qtr 1 TO 4 loop over quarters fsthisqtr 0 DO qtrmonth 1 TO 3 loop over months in qtr month month 1 fsthisqtr MAXfsthistqtr fspa1tmonth END end of qtrmonth loop fsqtrs fsqtrs fsthisqtr ND end of qtr loop or code with incremental indexing example 33 ARRAY fspart 12 fsplfsp12 fsqtrs 0 DO month 3 T012 BY 3 fsqtrs fsqtrs MAXfspa1tmonth fspartmonth1 fspa1tmonth2 0 example 34 consider a twodimensional ARRAY and assume that we want to measure the number of quarters out of the last two years that someone was receiving Food Stamps we could use code with nested DO loops ARRAY fspart2 12 fsqtrs 0 DO yearindx 1 TO 2 loop over years DO month 3 T012 BY 3 fsqtrs fsqtrs MAXfspartmonth fspartmonth1 fspa1tmonth2 END end of month loop END end of year loop 4 Other DO loops 0 SAS supports two other types of DO loops 0 DO WHILE loop syntax DO WHILE ltlogicaliexpressiongt SAS commands END this loop will execute these commands while the logical expression is true if the logical expression is false when the loop is entered SAS skips the commands effectively the condition is checked at the top of the loop for example we could redo example 31 from above with the following code ARRAY fspart 12 fsp1fsp 12 fsmnths 0 month 1 DO WHILE month LE 12 fsmnths fsmnths fspartmonth month month 1 END a similar construct is the DO UNTIL loop syntax DO UNTIL ltlogicaliexpressiongt SAS commands END this loop will execute these commands until the logical expression is true if the logical expression is false when the loop is entered SAS executes the commands effectively the condition is checked at the bottom of the loop this means that DO UNTIL loops are always executed at least once for example we could redo the previous example with the following code ARRAY fspart 12 fsp1fsp12 fsmnths 0 month 1 DO UNTIL month GT 12 fsmnths fsmnths fspartmonth month month 1 END WHILE and UNTIL conditions can also be added to incremental DO loops the syntax is DO ltindex7variablegt ltstart7valuegt TO ltstop7valuegt ltBY ltincrement7valuegtgt ltWHILE ltlogicaliexpressiongtgt ltUNTILltlogicaliexpressiongtgt SAS commands END W the DO loop would check 0 the iterative condition 0 the WHILE condition if it is included and 0 the UNTIL condition if it is included ARNING loops can be both tricky and dangerous to program you need to consider the conditions very carefully as the foregoing discussion indicates the distinctions between different types of loops can be very subtle as importantly you need to verify that the conditions that terminate a loop will be met at some point if the conditions are not met you could create an endless loop consider the following code ARRAY fspart 12 fsp1fsp 12 fsmnths 0 month 1 DO WHILE month LE 12 fsmnths fsmnths fspa1tmonth END we ve made a mistake the variable month is never incremented because of this the WHILE condition always remains true SAS will continue to process this statement until you either 0 issue a break the red exclamation point in the black circle in the SAS tool bar across the top of the SAS window or 0 shut down SAS Sources of Data in Economics 0 General description economic data refer to pieces of information or collections of information that describe different aspects of economic processes 0 economic outcomes end results of economic processes usually examined as endogenous variables variables determined by or inside the model examples include employment levels price levels quantities demanded quantities supplied economic constraints recall that we describe economic decisionmaking in the context of making the most out of the resources available ie maximizing subject to constraints budget constraint information includes prices incomes wage rates interest rates institutional constraint information includes laws regulations practices tax systems technological constraints include production processes and time constraints information constraints constraints can be exogenous determined outside or imposed on the model however they can also be endogenous e g price levels are a constraint on individual behavior but also a market outcome economic objectives this information describes what is being maximized or how a decision is made individual objectives include maximizing preferences or wealth rm objectives are to maximize pro ts government objectives might be to maximize social welfare stabilize prices or maximize revenues objectives are typically assumed to be fixed or exogenous sometimes the objectives are not clear and thus become the focus of research eg what are the goals of a nonprofit organization like a university or a symphony demographic information information about births deaths family formation population levels age distributions racial composition etc sometimes examined as economic outcomes sometimes considered as indicators for preferences eg cultural or age differences or constraints eg potential labor supply marriage opportunities for these and other noneconomic data we must be careful to specify how they fit into an economic model or contribute to an economic interpretation remainder of discussion will focus on other general categorizations of data Are the data measured at the micro or macro level do the data refer to individual decisionmakers or an aggregation of decisionmakers O macro data are easier to work with macro data have fewer observations can typically be entered into a spreadsheet with micro data you often have to work directly with a large survey or administrative source 0 however it is harder to answer certain questions with macro data eg aggregation bias in demand and supply functions Are the data quantitative or qualitative O quantitative data refer to measures of outcomes that can either be counted or mapped to the set of real numbers characterizes most items that we consider to be pure economic data such as quantities transacted prices charged wages paid employees hired etc easily amenable to statistical analysis that is to use data from a sample numerous cases to make inferences about the world as a whole I focus is on testing numerical hypotheses I get more cases here I but have relatively little detail per case and little understanding of the individual cases relatively easy to collect just need to record a number usually also collected in large quantities O qualitative data refer to other measures and information sometimes the measures can be categorized mapped into a small number of classes I examples include racial and ethnic measures or location measures l these measures can be treated using quantitative methods other measures are harder to categorize l examples include interview responses writing passages openended answers to questions I economists seldom work with these data one type of qualitative study that economists sometimes employ is the case study I use data from one or a few cases and examine these data in great detail substitute depth for breadth l these data can come from just about anywhere and cover just about anything companies people governments countries crises etc I goals are to get a complete picture of the case and to get data that address the predictions of the model I case studies are often used when situations aren t well understood or models aren t available here the process works backwards the data are collected theories are developed to explain patterns in the data other types of qualitative data that are sometimes collected or used by economists are l semistructured interviews I openended questions I focus groups an important strain of qualitative research outside of economics is ethnographic research 0 Are the data experimental or observational 0 many statistical techniques assume that data is produced experimentally in an experimental design the experimenter assigns the treatment to each case treatments are beyond the control of the subjects can repeat the experiment this does not characterize a lot of economic data we don t assign a depression or in ation treatment to some countries or regions we don t assign schooling levels to individuals in some ways both of these examples involve subjects selecting their own treatments we describe these data as observational statistical analysis of observational data can be more complicated this is not to say that there are not genuine economic experiments it s just that most economic data is generated nonexperimentally O examples of experiments small scale laboratory experiments are used to examine a variety of situations such as how people interact in simple games these games are often very arti cial and abstract program experiments make a new program available to some people but not to everyone and examine how outcomes differ across those who do and do not get into the program used to examine welfare reform job training and adolescent interventions social experiments these are larger and more expensive hence they are not conducted often best known are the income maintenance experiments from the 1960s and 1970s 0 time series crosssection or panel 0 O descriptions usually applied to quantitative data time series one individual or observational unit followed across many time periods examples quarterly GDP annual unemployment disadvantages typically small sample sizes data may be highly aggregated data generating process may change over time observations may not be entirely independent from one time period to the next crosssection many individuals measured for one time period examples political polls consumer con dence surveys disadvantages complicated to use need to learn how to cope with nonresponse and invalid responses may not have good measures of key economic variables like prices or local wage levels 0 repeated crosssection a series of crosssection samples across several time periods using different individuals in each period examples Current Population Survey disadvantages can be more complicated than simple crosssections may limit the number of variables that are collected 0 longitudinal or panel data follows a single cross section of individuals over time examples Panel Study of Income Dynamics Compustat disadvantages difficult and costly to track individuals over time data sets are more complicated than other crosssection data sets unavoidable random and nonrandom attrition need to distinguish between data that are collected prospectively as it happens or retrospectively recalled 0 Are the measures direct or indirect 0 again this is a classi cation that is usually applied to quantitative data direct measures come directly from the data for example we could look at the average income level for a county using the sample average from the Decennial Census similarly we could look at employment levels by aggregating the reports from the Current Population Survey relatively straightforward to use and interpret main questions with data quality are l is the sample representative I do we understand how the question was answered or what the measure captured l were the responses accurate we use indirect measures measures built up from other measures when direct measures are not available or of very low quality cannot obtain direct measures if necessary speci c data were not collected sometimes only a few observations are available so direct measures have intolerably large sampling variances common examples of indirect techniques are interpolation making an educated guess about a value when data on either side of the value are available eg using data from 1998 and 2000 to estimate a value for 1999 extrapolation making an educated guess about a value when data are available only on one side eg 0 using data from 1998 and 1999 to make an estimate for 2000 more sophisticated measures possible 0 What is the source of reporting do the data come from selfreports proxy reports or administrative sources 0 use surveys and interviews to obtain selfreports and proxy reports in a proxy report someone reports for another person examples are one person describing outcomes for other family or household members an employer categorizing his or her employees a person describing neighbors the key advantage of selfreported data is that the instruments can be very exible you can ask people just about anything the disadvantages are l people might err in their answers eg mistakes and recall errors I people might choose not to provide information item and unit nonresponse alternatively sometimes economists turn to administrative records such as tax returns employment records program rolls to obtain data quality of data is usually higher I exceptions are records where people have an incentive to misreport eg tax records I also sometimes the administrators do not care about particular items and either record them haphazardly or do not keep them up to date data items are limited to those of interest to the particular agency for example demographic data often missing universe is limited to people who participate in program or ll out a form I administrative records on welfare participants only cover the participants not others who are eligible but don t participate or those who leave the rolls l certain types of taX information is limited to people who itemize data are much more sensitive I people can choose whether or not to complete a survey they might not have a choice about providing other types of information to the government I researchers need to be much more careful to preserve conf1dentiality 0 Are the data primary or secondary 0 data we collect ourselves is primary while data others collect is secondary O exibility for primary data we can collect what we want for secondary data we are limited to what is available 0 COSt it takes a lot of time and effort to collect primary data I obviously you have to eld the instrument and recordcode the responses I prior to that however you have to design the instrument and create the sampling frame using someone else s sample is much less time consuming although there is still a need to understand the data and possibly clean the data Missing Data material drawn from Little amp Schenker 1995 A Introduction 1 Social scientists commonly encounter problems of missing incomplete and noncomparable data Several speci c problems a unit nonresponse subject fails to participate in an interview i subject may be unavailable ii subject may refuse to participate ii in this case no interview data are available iv some other data may be available such as information from the sampling frame or information from other interviews b item nonresponse subject fails to provide a usable answer to a particular survey question i subject may not know answer ii subject may refuse ii question may have been skipped by interviewer iv some questions depend on interview mode sensitive questions might be asked in person or through a mail survey but not over the phone v responses may be partly or completely masked to prevent inadvertent disclosure eg MSA status is purposefully missing for some respondents in the Current Population Survey in this case we have other information responses by the subject from the interview itself c noncomparable data in repeated surveys or panels 1 H H H V questions response categories and de nitions can change over time i occupational and industrial codes change every decade or so ii questions are revised iii response categories are expanded or collapsed 3 These problems in turn lead to analytical concerns a the subjects with missing observations or responses may be systematically different from subjects with full information leading to a form of selectivity bias even if the data are randomly missing there is still a loss of information problems of missing data may interact with other survey design issues B First steps after inputting a data set we first check the extent of missing data and data quality problems 1 We first look for unit nonresponse if indicators are available in the data a for example the PSID has why nonresponse variables that indicate why a person or household did not participate in a particular interview wave nonresponse indicators often appear in longitudinal and other complex data sets where subjects or parts of observational units may participate in one portion of the study but not in others eg attrition responses from some family members but not others We next look at item nonresponse and the consistency of responses a examine how many subjects have usable responses for the items of interest b examine how many subjects are missing different combinations of items 3 We also look for hidden data problems a many surveys take steps to fix data problems we will be discussing these below b we want to know which data may have been affected or altered through this process c problems indicated by allocation ags 4 The potential for bias usually increases with the extent of missing data a if only a small portion of the data for an analysis population are missing say ten percent or less the biases and other analytical problems are likely to be negligible b note the operative term here is analysis population i the overall extent of missing data might be small ii but the problems may be concentrated in a particular analysis subgroup C More formal discussion of patterns of missing data 1 descriptions of data a letX be an N x V data matrix with elements xi where 139 indexes the respondent observational unit and j indexes the type of data item b letM be a similarly dimensioned matrix containing binary indicators mi that equal one if xij is missing 2 monotone and nonmonotone patterns of missing data 3 a consider the columns of X denote these X1 XV b suppose that XV is the only column with missing data we would call this a univariate pattern c suppose now that several columns X Vm through X V have missing data i arrange the columns from fewest missing elements XVm to most XV ii consider all adjacent columns XVV and XVj1 where m S j S l H ii if every nonmissing element in XVVH has a corresponding nonmissing element in X V the missing data follow a monotone pattern X1 XVerI XV me1 XV d if there are multiple columns of missing data that do not follow the above pattern the data are non monotone 3 also want to consider whether X and M are related 4 let pM X 6 be the conditional distribution of M given X and a set of parameters 6 4 5 data are said to be missing completely at random MCAR if pM X 6 pM 6 for all X 6 we can assess these relationships by examining distributions of the available X for people with and without missing data 7 suppose that differences exist or that they cannot be adequately examined a next consideration is whether differences between respondents with and without missing data can be accounted for using the available observed variables a divide X into two components i XobS observed portion ii Xmis portion with missing data b data are said to be missing at random MAR if pM I X 6 pM Xobs 6 for 2111er c example i suppose thatX consists of two variables education and age and that education is missing for some individuals but age is available for everyone ii education is MAR if the pattern of missing responses only depends on age education is not MAR if the pattern of missing responses depends at least partly on education itself H ii D Na39139ve approaches for addressing missing data 1 Suppose that data are missing what do we do next Economists typically apply the following na39139ve approaches listed by frequency of use 5 2 Completecase analysis a d in a completecase analysis we discard observations that are missing any of the items that we are interested in for our general analysis we have used this approach so far in our homework and example assignments strengths i approach is very easy to implement all we need to do is check the data for missing observations ii analyses that adopt this approach are valid unbiased if data are MCAR weaknesses i the approach may drop usable data and therefore be inef cient ii analyses are biased if data are not MCAR iii bias depends on the strength of the association between M and X and on the extent of missing data standard selectivity results 3 Availablecase analysis a b use the largest sets of available information for estimating individual parameters for example in a ZSLS analysis i we might be missing data on the dependent variable in the secondstage equation ii we could run the firststage equation without accounting for the patterns of missing data for this variable the rststage parameters can all be estimated without this variable we would then run the secondstage equation on the restricted sample H ii 6 c a more general example is a methodofmoments estimator i we could estimate each moment means variances covariances using the data available for that particular moment ii we would combine these moments in our nal calculation of estimates d the availablecase approach makes more use of the existing data e however there are potential problems i estimates may be biased if the data are not MCAR ii there may be other computational problems such as estimated variancecovariance matrices not being positive de nite iii standard error calculations are usually very specialized and dif cult 4 Unconditional mean imputation a in this approach we replace the missing values of X with the unconditional means from the nonmissing values b this sometimes produces reasonable results for simple descriptive statistics c however it does not respect the associations between the different elements of X which can lead to problems in multivariate analyses d a modi cation for multivariate analyses is to use unconditional mean imputation but to also include the elements of M as explanatory variables E Reweighting 1 Let s reconsider the completecase approach a b 0 a problem with this approach is that it may be biased if the data are not MCAR the reweighting approach uses complete cases but uses weights to adjust the sample to re ect the distribution of the analysis population reweighting is especially attractive for cases of unit nonresponse basic approach a form cells based on the X variables that are available for all respondents ie don t suffer from any item nonresponse problems i note the list of available variables may be thin ii variables from the sampling frame may be used calculate the proportion of respondents with complete cases this is an estimate of the probability of response use the inverse of the response rate as a weight instead of a cellbased approach could estimate a binary choice model of response and use predictions as the response probabilities advantages a b c approach is relatively easy to implement potentially reduces bias does not require a model of the joint distribution of the data disadvantages a does not address biases from variables that do not appear in the weight calculations b can have very high variance d some weights can be unreasonably high especially if the data were weighted to begin with informal trimming procedures are sometimes needed can be difficult to calculate some other statistics F Overview of more formal imputation procedures 1 Many survey data sets especially survey data sets prepared by the government use imputation procedures when responses are missing these are also sometimes referred to as allocation procedures a b imputation refers to assigning a response when none is given alternative is to simply identify a response as missing ie leave the solution to the user how do you use the available information to impute a response 2 Several advantages of imputation a b recreates the original rectangular data set when carried out by the researchers who created the data imputation can use their special knowledge of the data it may also use variables that are masked from public release versions another advantage when carried out by the original researchers is that it produces a data set that is consistent for different users 3 Another use for imputation is to assign values for variables that were not originally included in a survey 9 ie to add variables to the data set a example Census Bureau s experimental poverty measures i primary data set is Current Population Survey ii key variables for the new poverty measures such as work and child care expenses are not recorded in the CPS iii use imputation to add these measures to the CPS b example adding price and local employment information into crosssection data sets 4 Basic principles a imputations should be based on predictive distributions of the missing values given the observed values preferably using as much of the observed data as possible b imputations should also preserve as many parts of the distributions of the missing data as possible 5 Conditional mean imputation a replace the missing values of X with the conditional means from the nonmissing values i means would be conditioned on a small number of values ii example in the ThreeCity Study missing data on earnings is conditioned on work status use the observed mean for working respondents for workers with missing earnings b more generally divide data set into cells based on the other observable variables and then use the means of the observed data within those cells to impute c this procedure is nearly as easy to use as unconditional lO mean imputation but it uses some of the predictive power from the available data d disadvantages i uses only a small slice of the available data ii does not preserve other parts of the distribution 6 Simple modelbased approaches explicit models a description i estimate a regression or other model using the observed data obtain coeff1cients ii combine coef cients with available information on the other variables to predict values of the missing variable b critique i straightforward to implement also easy to determine the statistical properties of imputations ii may be sensitive to specification issues especially if the outcome variable is discrete or limited iii simple predictions do not capture the full distribution only have the variation associated with the predictors c stochastic regression i follow the same procedure as above but also include a pseudorandom error with the same variance as the standard error in the regression ii the variance of the imputations will follow that of the original data more closely want to be careful to provide specific seeds to the pseudorandom functions so that results can be exactly reproduced H ii 7 Random categorical matching implicit model 1 l a description i randomly draw an outcome from observations that share characteristics with the problem observation ii use this draw as an imputation b critique i usually preserves the characteristics of the marginal distribution eg the unconditional mean and variance ii can view this as a nonparametric procedure with weaker assumptions than the explicit modeling approach harder to capture conditional distributions may align well for matching variables but not for other variables iv which characteristics should be used to form the match how should they be used c example hot deck procedures in CPS and other Census Bureau data sets i procedure tries to group observations into cells with many criteria ii if no matches are found the criteria are broadened ii Lillard Smith amp Welch 1986 examined hotdeck procedures in CPS data for wages iv more recently Bollinger amp Hirsch 2006 found that CPS imputed earnings data did a poor job in analyses involving variables like union status that were not included in the imputation H ii H 8 Approximate matching a problem with the exact matching approach is that you may have cells with no matches this can happen if 12 i X is continuous or ii X has many dimensions b may instead have to form close matches c categorical matching i could construct categories based on the X variables ii hard or impossible to nd exact matches ii somewhat arbitrary but may be important and useful if there is some strong conceptual or empirical basis for stratifying iv still limited d simple distance metrics i if there is only one X variable or one continuous variable after stratifying could look at a simple distance metric ii problem is harder if there are multiple variables iii for multiple variables can use the Mahalanobis metric H d XA XB39 CovXB391 XA X3 9 Multiple Imputation a problem with all of the imputation procedures listed above is that they understate the uncertainty associated with a given imputation b idea behind multiple imputation is to impute several values and then reweight the data c procedure for multiple imputation PROC MI is available in SAS1 1 See h n39 unnnrt a 39 quot 596541me defmlltmi tochtm l3 References Bollinger Christopher and Barry T Hirsch Match Bias from Earnings Imputation in the Current Population Survey The Case of Imperfect Matching Journal of Labor Economics 243 July 2006 483519 Lillard Lee James P Smith and Finis Welch What Do We Really Know About Wages The Importance of Nonreporting and Census Imputation Journal of Political Economy 943 part 1 June 1986 489506 Little Roderick J and Nathaniel Schenker Missing Data In G Arminger C Clogg and M Sobel eds Handbook of Statistical Modeling for the Social and Behavioral Sciences New York Plenum 1995 Schafer Joseph L and John W Graham Missing Data Our View of the State of the Art Psychological Methods 7 2 June 2002 147 177 By Statements and Nonrectangular File Structures in SAS 1 Processing nonrectangular les in SAS recall the standard Execution Phase in a SAS data step execution proceeds as an implicit loop over the observations being read with each observation being processed as follows starts from DATA statement a record is read additional statements in the DATA step are executed on that observation at end of DATA step the observation is output processing returns to the top of that step and all information in the program data vector is reset iteration counter 7N is updated ow continues until the end of the input le is reached this execution sequence is especially wellsuited for processing rectangular lesi les with the same record layout for each observation and with no relationships among the observations data steps and other SAS procedures can be written to accommodate nonrectangular les a key tool in processing these alternative data structures is BYgroup processing this class will cover the general use of the BY statement in DATA steps and other procedures alternative data structures general processing techniques for these structures methods for sorting and merging les methods for traversing other le structures 2 BYgroup processing1 1 See httn39 refers to techniques in SAS for working with data that are ordered or grouped in terms of the values of one or more variables requires a BY statement BY ltDESCENDINGgt ltordering7varlgt ltltDESCENDINGgt ltordering7var2gt ltltDESCENDINGgt ltordering7var7ngtgtgt ltbyioptionsgt where orderingivarl is the rst variable that is used to order or group the data orderingiva is the second variable that is used to order or group the data etc for multiple variables the ordering or grouping is assumed to be nested for instance in a data set with two ordering variables SAS expects that observations are ordered or grouped by ordereringiva within groups of orderingivarl the rst step in any BYgroup processing is to order or group the data this is generally done through the SORT procedure though there are other ways that a data set may be ordered BY variables values and groups u n lrcon59522 HTlVH 39 P 39 77Rl727htm nnnnrt QZQ 1 BY variables isare the ordering variables listed in the BY statement BY value is the value of an ordering variable within a BY group BY group is the set of observations de ned by a particular BY value or combination of BY values BYgroup processing treats all of the observations within a BYgroup as a group o the actual r 39 varies J J J39 on the r J or the context 0 DATA step processing a BY statement causes a DATA step to look for changes in the values of the BY variables when SAS encounters the rst instance of a BY value it sets a temporary FIRST logical binary 0 1 indicator to 1 when SAS encounters the last instance of a BY value it sets a temporary LAST logical indicator to 1 because there can be multiple BY variables there need to be multiple FIRST and LAST indicators for a given BY variable the corresponding indicators would be FIRSTvariable and LASTvariable consider the following sequence from a data set that is sorted according to BY variables 139 and j and that contains an additional variable x initial variables variables created by BY processing N 1 j X FIRSTi FIRSTj LASTi LASTj 1 100 2 25002 1 1 0 0 2 100 2 25009 0 0 0 1 3 100 5 25837 0 1 0 0 4 100 5 26387 0 0 0 0 5 100 5 26287 0 0 0 0 6 100 5 26053 0 0 1 1 7 200 2 26135 1 1 0 0 8 200 2 25915 0 0 0 1 9 200 3 26248 0 1 0 0 10 200 3 26730 0 0 0 1 11 200 4 25880 0 1 0 0 12 200 4 25911 0 0 0 1 13 200 5 26184 0 1 1 1 14 300 1 26282 1 1 1 1 processing can then be done based on the FIRST and LAST variables 0 match merging when a BY statement and MERGE statement are used together in the same DATA step the BY statement causes the MERGE to perform match merging match merging combines observations from two data sets that share the same values of the BY variables SAS supports manytoone and onetomany match merging in other words one of the merged data sets must only contain one observation per BYvalue combination when a SAS data set is created by match merging the resulting data set is automatically sorted according to the BY statement 0 statistical and reporting procedure processing a BY statement in a statistical or reporting procedure causes the procedure to calculate separate statistics or list separate output for each BY group if an OUTPUT statement is used only one data set is created however this data set 0 will contain the BY variables and 0 will have separate observations for each BY group the processing for several procedures for BY and CLASS groups is similar 0 SORT procedure processing a BY statement in a SORT procedure causes the data set to be arranged in order of the BY variables the syntax for the SORT procedure is PROC SORT ltoptionsgt BY ltDESCENDINGgt ltordering7varlgt ltltDESCENDINGgt ltordering7var2gt ltltDESCENDINGgt ltordering7varngtgtgt a BY statement must accompany the SORT procedure if the DESCENDING option is used in front of an ordering variable the data set is sorted in DESCENDING reverse order of that variable 0 the default is to sort the data in ascending order 0 DESCENDING must be used in front of every BY variable that is to be sorted in DESCENDING order the SORT procedure can be very time and resourceintensive 0 the procedure reorders the data set 0 it also completely recopies the data set 0 you want to program carefully to use as few SORTs as possible 3 Database structures 0 database structures refer to the organization of a database s observations and records 0 in a rectangular or at structure all of the observations have the same record layout and refer to the same types of units suppose that we had a data set in which each observation had a unique identifier obsiid and accompanying variables x y and z a diagram of the record structure would be as you can see the structure for a single observation is at when the observations are stacked on each other the structure becomes a rectangle 3 in nonrectangular structures there are different types of records referring to different types of organizations in a hierarchical structure the relationship between the different types of records can be described in terms of a tree the tree structure re ects nesting of logical relationships a good example of a hierarchical structure is the Current Population Survey which has records describing households families within households and persons within families for a given household assume that there is a household identifier hhiial and household descriptors regn region of residence and urbn urban residence for a given family assume that there is a family identifier that is unique within the household famiz39al and family descriptors inc family income and tanf family TANF receipt for a given person assume that there is a person identifier persiial and personal descriptors age and ealuc household level mud family level m person level the advantage of this structure is that it does not repeat information across lower levels eg information for the household is only stored once information for each family is stored once the shortcoming is that these types of files can be hard to work with in a relational structure data are stored in several different at files each flat file corresponds to a type of observation the relationships between the records and files are determined by key or index variables for example the CPS data could be stored in three different files 0 a household file with variables hhiial regn and urbn 0 a family file with variables hhiial famiial inc and tanf 0 a person file with variables hhiial famiial persiz39al age and ealuc 0 families can be linked or matched to households through hhiid 0 people can be linked to families through famiid and linked to households through hhiid 0 consider the relationship between the household and family les household le family le 1 7 7 2 2 1 hhiid1 regn urbn hhiid2 regn urbn hhiid3 regn urbn hhiid2 famiid 2 hhiid4 regn urbn MA 1 1 2 relational structures are more exible than hierarchical structures new relationships can be added through additional key variables 0 rectangular les can be drawn or created from nonrectangular les for example with the information above we could construct personlevel les that also contain all of the available family and household information for each person creating rectangular les from nonrectangular les is the main way that we create analysis data sets in SAS Matchmerging in SAS les that are related by key variables can be matchmerged in SAS the les should be sorted in order of the key variables you can use PROC SORT you can work with a permanent le that had previously been sorted you can create a le that is sorted if a le is sorted this will be indicated in output from a CONTENTS procedure so you can use PROC CONTENTS to check the sorting merging is a form of reading so it would take the place of a SET or INPUT statement in a DATA step the syntax for merging two SAS data sets would be MERGE ltSAS7data7setilgt ltdata7setioptionsgt ltSAS7data7set72gt ltdata7setioptionsgt BY ltDESCENDINGgt ltordering7varlgt ltltDESCENDINGgt ltordering7var2gt ltltDESCENDINGgt ltordering7var7ngtgtgt ltbyioptionsgt each of the input data sets would need to include and be arranged by the ordering variables because MERGE is a type of reading command you can use all of the input data set options with it MERGE will produce a data set with at least one observation for each read combination of BY values if MERGE can t nd a match it will read in the variables and values from the available record and pad the unmatched information with blanks MERGE continues reading data until both contributing les are exhausted the IN data set option can be used to create a temporary variable that indicates whether a particular data set contributed an observation to the merged data set example consider the three les from our previous example suppose that 0 the household le is a SAS data set hhdata that is sorted by hhiz39d and 0 the family le is a SAS data set famdata that is sorted by hhiz39d and famiid code to merge the data sets is DATA thfmrg MERGE hhdata INhhin famdata INfamin BY hhiid IF hhin EQ 0 THEN PUT 39No household match for HH 39 hhiid 39 family 39 famiid IF famin EQ 0 THEN PUT No family match for HH hhiid assuming that every family matches to a household and that there are no households without families the MERGE should produce one record per family same length as the original family data set but with all of the family and household variables hhiid famiid inc tanf regn and urbn in each record 0 make sure that you include a BY statement when using the MERGE if you don t SAS will conduct onetoone merging linking the rst records in each data set the second records etc 5 Working with a sorted data set 0 computing and storing simple statistics for BY groups is straightforward these statistics can be computed by either PROC MEANS or PROC SUMMARY using a BY statement for example consider the personlevel data set from above 0 suppose that it is stored in a SAS data set persdata that is sorted by hhiid and famiid 0 we could calculate the highest education attained among the family members the age of the youngest person in the family and the age of the oldest person in the family with the code PROC SUMMARY DATApersdata BY hhiid famiid VAR age educ OUTPUT OUT 3erssum MAXfmoldest fmmosted MINagefmyngest note the data set perssum would contain one record per family this le could be merged directly onetoone with famdata or merged onetomany with persdata it is also possible to accomplish the same thing using a DATA procedure to do this we need to introduce one additional DATA step command the RETAIN statement in a normal execution phase SAS assigns missing values to erases the values in all of the variables when it returns to the DATA statement at the end of an observation iteration the RETAIN statement keeps SAS from doing this the syntax for the statement is RETAIN ltlistiof7variablesgt the list of variables would not be overwritten with missing codes when SAS returns to the top of the DATA step now consider the following code that reads persdata assuming that persdata is sorted by hhiid and famiid DATA perssum SET persdata BY hhiid famiid IF FIRSTfam7id THEN DO fmoldest age fmyngest age Initial obs for family Initialize family values information for first pers gtxlt fmmosted educ read END ELSE DO Subsequent obs for fam fmoldest MAXfmoldest age fmyngest MINfmyngest age fmmosted MAXfmmosted educ END Compare ages and educs to current highlow values RETAIN fmoldest fmyngest fmmosted IF LASTfam7id Output last obs for family the above example is very arti cial because the same thing can be produced using the SUMMARY procedure however it illustrates how BYgroup processing works BYgroup processing within a single data set is very useful when working with longitudinal data if the data in two les are grouped but not sorted you would use the following BY statement BY ltordering7varlgt ltltordering7var2gt gt NOTSORTED Survey Design and Weighting material drawn from Korn amp Graubard 1999 A Some basic design considerations and their implications 1 Many considerations when conducting a survey a b e want the nal results to be representative of an analysis population may want the survey to include speci c subpopulations i racialethnic diversity ii economic diversity iii age diversity iv programmatic diversity want to keep costs down by surveying many people in a limited number of areas may have other design issues eg AddHealth samples within schools and also samples peer networks possible differential unitnonresponse 2 These design issues affect subsequent statistical analyses a b affect the representativeness of the observations also affect the independence of the observations 3 Consider some general issues a clustering selecting subjects from a few areas may lead to spatial correlation in the data i although this might not lead to biased estimates of population parameters it could affect calculations of standard errors ii essentially there is less variation than a sample taken completely at random 1 b oversampling of particular populations which is done to ensure adequate representation of those individual populations leads to a sample that doesn t represent the general population different rates of unit nonresponse can also lead to a loss in representativeness 4 In analyses we address these issues by a b including survey weights including design variables B Sampling plans 1 Simple random sampling a b consider a given population of size N choose a subset of 11 individuals where each possible subset is equally likely to be sampled individuals are chosen without replacement we won t refer to this distinction subsequently the ratio nN is called the sampling rate or inclusion probability most sample estimators such as the usual mean and variance estimators assume this type of sampling 2 Stratif1ed simple random sampling a b population is first divided into mutually exclusive and exhaustive strata simple random sampling is then carried out within each strata sampling rates may will likely vary across strata i for example if the populations of the strata differ but the sample sizes don t the sampling rates will 2 vary ii sampling rates may vary for other reasons d subsequent population estimators would weight each observation by the inverse of its sampling rate e if the observations are more homogeneous within strata than across them strati ed random sampling can reduce the variance of the population estimates 3 multistage sampling a population is rst divided into cells or primary sampling units PSUs usually on the basis of geography b a sample of those units is taken c sampling of individuals then takes place within those selected PSUs d advantages i reduces the costs of conducting a survey by restricting it to a smaller number of areas very important if inperson interviews are used ii also may be the only feasible way to construct a sampling frame e disadvantage is that observations within clusters may be correlated f weighting is again used to adjust estimates weights are the product of i inverse of the PSU inclusion probability and ii sampling rate within the PSU g here we ve described a twostage process but the process can be carried out iteratively 4 Unit nonresponse a this is a distinct issue from the sampling plan 3 however the correction would be similar b within PSUs and clusters we may will have unit nonresponse c estimated response probabilities can be formed possibly conditioning on additional information d the inverse of these probabilities can then be multiplied times the other components of the weight calculation to form a new weight 5 In all of the cases except simple random sampling we have unequal sampling rates for individuals within the overall population but can use weights to obtain unbiased estimates of the population parameters C Poststratif1cation 1 Another strategy to improve the accuracy and representativeness of a sample is to poslslrali the sample or the sampling weights 2 In poststratif1cation weights are developed so that the totals or proportions of different types of respondents match known population figures 3 The advantage of this technique is that it brings in additional information about the population a it can be effective in dealing with differential non response and undersampling b it also helps to make surveys more comparable by reweighting them to a common analysis population D Use of weights in statistical analyses 1 If observations are sampled unequally respond 4 unequally or can be adjusted to re ect the larger population it makes sense to use weights to reduce biases Weights are usually the product of a inverse sampling probabilities sometimes referred to as the base weight note these sampling probabilities can themselves be products of probabilities if multi stage sampling is used b inverse response probabilities sometimes referred to as nonresponse adjustments and c poststratif1cation adjustments For most crosssectional statistics calculations for weighted estimates are straightforward a let be a variable for subject 139 and let W be the associated weight b the weighted mean is l X 2 WX w z z c let x and y be deviations of variables X and Y from their weighted means the slope coef cient from a weighted regression of Y on X is WixZyl Bw In SAS most statistical procedures include a WEIGHT statement a syntax WEIGHT ltweightvariablegt b the weighlivariable would be the SAS variable containing the weights 5 If the sampling design indicates that weights should be included you should include them right E An alternative to weighting 1 An alternative to weighting is to model the survey design in your statistical procedure 2 In a multivariate model this is accomplished by including measures of the characteristics that enter the weighting procedure as additional explanatory variables in an unweighted model a coefficients on these variables will confound genuine effects and survey design effects b however coefficients on the other variables should be purged of the in uence of design effects c this assumes that you have modeled the design correctly d it also assumes that the design variables don t introduce other problems 3 Weights are often based on characteristics that we would include in models anyway such as age raceethnicity age and socioeconomic status 4 Suggests that weights might not be especially useful in multivariate analyses 5 Other considerations might lead us to drop weights a dropping incomplete cases cases with item non response from a weighted sample changes the 6 response pattern in that sample i formally the sample should be reweighted to re ect the new sample ii however this is rarely done iii result is incorrect weights that might not reduce bias in some cases weights can increase rather than reduce the variance of estimators we might consider bias vs ef ciency MSE tradeoffs F Adjusting for clustering 1 If a clustered survey design is used observations may not be independent within clusters a b this can lead to incorrect usually downward biased standard errors it also means that the estimation procedure is inef cient in the simplest cases the problem is similar to that from a random effects speci cation issues become more complicated if weights and other design issues enter 2 Two types of corrections are possible a just X the standard errors this leads to correct standard errors but does not address the ef ciency concerns estimate a FGLS speci cation i addresses standard errors m ef ciency ii however it requires you to take a stand regarding the precise source of spatial correlation most researchers simply choose to address the rst 7 3 problem using a robust method for calculating standard errors A practical dif culty in adjusting for clustering is that not all publicuse surveys include the necessary identi ers G SAS procedures 1 1 See httn39 As mentioned most SAS statistical procedures allow you to incorporate WEIGHTs SAS has a number of procedures that are designed to accommodate survey data with compleX designs PROC SURVEYMEANS1 a syntax PROC SURVEYMEANS ltoptionsgt VAR ltlistofanalysisvariablesgt STRATA ltlistofstratifyingvariablesgt CLUSTER ltlistofclusteringvariablesgt WEIGHT ltweightvariablegt b the sample design information is speci ed in the STRATA CLUSTER and WEIGHT statements c if multiple STRATA or CLUSTER variables are speci ed SURVEYMEANS examines the available combinations of these variables d if STRATA and CLUSTER variables are both speci ed SURVEYMEANS adjusts for clustering Within STRATA e for multistage designs the rst highest STRATA unnnrt a v nwVHTMT rlPanlt Hrve mean itochtm 8 and CLUSTERS should be used f the procedure allows for other options such as CLASS analyses and BY processing 4 PROC SURVEYREGZ a syntax PROC SURVEYREG ltoptionsgt STRATA ltlistofstratifyingVariablesgt CLUSTER ltlistofclusteringVariablesgt WEIGHT ltweightvariablegt MODEL ltdependentVariablegt ltlistofindependentVariablesgt lt optionsgt b the sample design information is speci ed in the same way as the SURVEYMEANS procedure c unlike the standard REG procedure only one MODEL statement can be speci ed 5 SURVEY procedures are also available for a frequency distributions SURVEYFREQ b logistic regression SURVEYLOGISTIC c selecting samples with given design features SURVEYSELECT 2Seehtm unnm 2 39 quot HHWTMT quot itochtm References Carrington William 1 John L Eltinge and Kristin McCue An Economist s Primer on Survey Samples Working Paper no 0015 Suitland MD Center for Economic Studies US Bureau of the Census October 2000 Korn Edward L and Barry I Graubard Analysis of Health Surveys New York Wiley 1999 l Intr 2 Bas SAS Macros oduction SAS comes with a macro processor and a macro language that can make your programs more exible macros provides a way of writing higher level statements and programs that generate SAS code in processing code the macro processor reads the macro language and substitutes revised code into the SAS program after this the generated SAS statements are executed two delimiters tell the macro processor that macro code is about to follow ampname 7 is used to indicate a macro variable name 7 is generally used to indicate a macro or a macro command1 ics of SAS macro processing replacing text strings using macro variables suppose that we wanted to write generalized code that annotated program output to indicate whether we had conducted a weighted or unweighted statistical analysis we could use a macro LET statement as follows LET wgtrunWeighted Analysis this would assign the text string Weighted Analysis to the macro variable wgtrun later in the program we could use that macro variable as follows TITLE quotDescriptive Statistics ampwgtrunquot the macro processor would take the text string Weighted Analysis and substitute it for ampwgtrun so that SAS would execute the statement TITLE quotDescriptive Statistics Weighted Analysisquot macro variables can be globa or local 0 global variables are used throughout a SAS session 0 local variables are used within a particular macro generating SAS code using macros there are macro commands like the LET statement described above you can also create macros the basic syntax is MACRO ltmacroinamegt 1 The INCLUDE LIST and RUN SAS commands follow this format but are not macros These commands are more helpful in interactive line mode and will not be discussed here 1 macro de nition here MEND ltmacroinamegt the macro would substitute the text from the macro de nition anywhere that the macro is invoked to invoke the macro you would type macroiname for example typing MACRO wgtitype Weighted Analysis MEND wgtitype TITLE quotDescriptive Statistics wgtitypequot would de ne a macro wgtitype that would write the string Weighted Analysis wherever it was invoked 0 this would exactly replicate the macro string example that we examined earlier 0 generally we would use macros for something more sophisticated than this the macro statements can include SAS statements it is also possible to pass arguments into macros MACRO ltmacroinameltargumentl argumentngtgt macro de nition that includes ampargument1 ampargument2 MEND ltmacroinamegt the arguments argument through argumentn would be passed into the macro and used as local macro variables to invoke a macro with arguments you would type macroinameargument lltvalue l gt argumentnltvaluengt arguments make macros much more generalizable other macro features and techniques macro statements can be used to generate SAS code conditionally IF ltconditionlgt THEN DO ltconditional text lgt END ELSE ltIF ltcondition2gt THENgt DO ltconditional text 2gt END the syntax is similar to a regular IF THEN ELSE statement but would be used within a macro de nition repeated SAS code can also be generated DO ltindex7vargt ltsta1t7numbergt TO ltend7numbergt ltmacro text to be repeatedgt END the macro language also has DO WHILE and DO UNTIL constructs 3 Macro variables systemde ned variables the macro processor has a number of variables already de ned examples 0 ampSYSDATE contains the current date with two digits for years 0 ampSYSDSN contains the name ofthe most recently used SAS data set the complete list is available at httn39 Imnnrt sas quot 139 olref 59526 HTMl 39 r quot 01071968htm userde ned macro variables macro variables can be created in macro assignment statements using LET they can also be de ned using GLOBAL and LOCAL statements they can be created as arguments in macro de nitions there are some other statements such as the iterative DO statement that create macro variables assigning values in assignment LET statements the macro processor generally lls macro variables with the character string that follows ignoring any leading or trailing blanks for example all three of these expressions are equivalent LET wgtrunWeighted Analysis LET wgtrun Weighted Analysis LET wgtrun Weighted Analysis the statements treat also treat most other characters including digits as text for example LET n22 puts the text string 22 into n the only exception to this are the characters and amp which tell the system that a macro follows for example LET varlvaluel LET var2ampvar l puts valuel into var and subsequently var2 assignment statements can also include macros and special macro functions 0 to perform integer evaluations of numbers you can use the EVALO function LET nEVAL22 assigns the text string 394 to n 0 to perform oating point real evaluations of numbers use the SYSEVALFO function a null value is assign by leaving the expression on the right side of the LET statement blank LET n thereafter any time that ampn is used it will generate nothing sometimes you want macro variables to contain special characters such as blanks semicolons parentheses quotations amp and this process is called macro quoting 0 the STR function masks all special characters except amp and 0 the NRSTR function masks these as well as amp and using macro variables once a variable is created you would use it or resolve it by typing it with an ampersand for example LET wgtrunWeighted Analysis TITLE quotDescriptive Statistics ampwgtrunquot resolves to TITLE quotDescriptive Statistics Weighted Analysisquot an exception to this is a macro variable that is used within single quotes this is an example where single and double quotes matter for example TITLE Descriptive Statistics ampwgtrun resolves to TITLE Descriptive Statistics ampwgtrun the macro variable is effectively ignored macro variables can be used with other text 0 macro variables can be appended to the end of a text expression no delimiter is used TITLE quotDescriptive Statistics Unampwgtrunquot resolves to TITLE quotDescriptive Statistics UnWeighted Analysisquot 0 to append a macro variable to the beginning of a text expression you would use a period as a delimiter TITLE quotDescriptive Statistics ampwgtrunXXquot resolves to TITLE quotDescriptive Statistics Weighted AnalysisXXquot 0 to append a macro variable to the beginning of a text expression and also include a period you would use two periods as delimiters this is useful for writing macros that reference external SAS data set for example LET librefcartdata DATA amplibrefhwl resolves to DATA cartdatahw 1 listing the values of macro variables you can write the values in macro variables to the SAS log using the PUT statement this is help ll in checking and debugging your code scope of macro variables as mentioned macro variables can be global available throughout a program or local available only within a particular program generally macro variables that are created outside of a macro will automatically be global while macro variables created inside of a macro or created as arguments for a macro will automatically be local to specify that macro variables created inside of macro are global you would use the GLOBAL statement if you try to reference a local variable outside the macro in which it is created you will get a cryptic warning message that the apparent symbolic reference for that variable is not resolved the macro variables that are currently available to the system are listed in the local and global Symbol Tables to write these variables you would use the PUT statement with these options 0 iALLi would write all of the macro variables 0 iAUTOMATICi would write all of the automatically generated variables 0 iGLOBALi would write all of the macro variables from the global table 0 iLOCALi would write all of the macro variables from the local table 0 USER would write all of the usercreated macro variables 0 Passing information from a DATA step to a macro variable sometimes it is useful to take values that are produced by the processing of the program and to pass these into subsequent macros the SYMPUT routine does this the syntax is CALL SYMPUTltmacroivariablegt ltvaluegt O the macro variable argument is either a character string or a character variable with the name of the macro variable that you want to pass information into 0 the value argument is a character string or character expression with the information that will be passed example DATA t1 SET t0 ENDf1nal SAS commands IF f1na1 THEN CALL SYMPUT numobs TRIMLEFT7N7 RUN FOOTNOTE quotData set t0 has ampnumobs observationsquot passing information this way can be tricky because it raises issues of when information is available note the use of the RUN statement before the FOOTNOTE statement above 4 Some debugging tools 0 macros are trickier to program than regular SAS code 0 there are some OPTIONS that help trace code MLOGIC 7 causes SAS to trace the ow of the execution of macros MPRINT 7 causes SAS to write each statement generated by a macro to the SAS log SYMBOLGEN 7 causes SAS to describe how each macro variable resolves 0 you can also use the PUT statement to selectively write information to the SAS log 0 note the trace options can generate amp of output so use them selectively and turn them off once you are sure that your macros are working correctly 5 For more information on macros see httpQurmort lt83 0 J 139 olref 59526PDFdefault mornlref ndf Reading and Writing Data in SAS 1 Basic inputoutput io commands in the introduction we went over basic io commands basic input to read a SAS data set you would use the SET command to read a spacedelimited ASCII data set you would use the INFILE and simple variable list only INPUT commands basic output to create a SAS data set you would use the DATA statement to create a spacedelimited ASCII data set you would use the OUTPUT and simple PUT commands this class will cover more advanced operations including creating and reading permanent SAS data sets options for creating and reading SAS data sets different types of formatted ASCII data sets working with SAS transport les 2 What does a DATA step normally do1 3 the 1 See httn39 Compilation Phase checks syntax of submitted statements creates input buffer program data vector and descriptor information Execution Phaseiimplicit loop over observations being read observations processed starting from DATA statement a record is either read into the input buffer if INPUT is used or directly into the program data vector if a SAS data set is read through a SET or MERGE additional statements in the DATA step are executed on that observation at end of DATA step 0 observation is output 0 processing returns to the top of that step 0 all information in the program data vector is reset iteration counter 7N is updated ow continues until the end of the input file is reached processing is bestsuited for rectangular data sets but can accommodate other structures lots of variations possible OUTPUT and DELETE commands unless told otherwise SAS outputs one record to the SAS data set specified in the DATA statement at the end of the DATA step for each record read if you do not want SAS to output a record to the SAS data set you would type the DELETE command 39 39 quot lrcon59522 HTlII 39 P 39 m72htm nnnnrt ms 1 0 when this command is used the current observation is not output 0 also SAS immediately returns to the top of the DATA step to begin a new iteration the commands following a DELETE statement are not executed 0 the DELETE command shortens the length of a SAS data set recall that the DROP and KEEP commands reduced the width of a data set if you want to write an observation to a SAS data set somewhere other than the end of a DATA step you would use the OUTPUT command 4 Temporary and permanent SAS data sets SAS distinguishes between temporary and permanent SAS data sets temporary SAS data sets only last as long as a SAS session they are automatically deleted when the SAS session closes have a onepart name e g lttemp7data7set7nmgt to read a temporary data set you would use the command SET lttemp7data7set7nmgt to create a temporary data set you would begin the DATA step with the statement DATA lttemp7data7set7nmgt permanent SAS data sets are stored in libraries and can outlive a SAS session have a twopart name ltlib7refgtltperm7data7set7nmgt where 0 the rst part ltlib7refgt is the internal library name de ned as the rst argument in a LIBNAME statement and 0 the second part ltperm7data7set7nmgt is the permanent data set name in Windows the permanent data set would be stored as ltperm7data7set7nmgtsas7bdat in the directory de ned in the second argument of the LIBNAME statement to read a permanent data set you would use the command SET ltlib7refgtltperm7data7set7nmgt to create a permanent data set you would begin the DATA step with the statement DATA ltlib7refgt ltperm7data7set7nmgt 5 SAS data set options2 options can be used when a SAS data set is created or read most options can be used for either operation but some are speci c to creating or reading the options would appear in parentheses after the SAS data set name 2 See httn39 39 39 quot lrdict59540HTML 39 P 39 779 htm nnnnrt ms 2 some common options are Option Description Input Output DROP drops listed variables from data set both FIRSTOBS data processing begins with speci ed input observation number KEEP keeps only the listed variables both RENAME renames the listed variables both WHERE only keeps observations that satisfy the listed input condition some examples SET ltsas7datagt DROP ltvarlgt ltvar2gt drops ltvarlgt and ltvar2gt from ltsas7datagt SET ltsas7datagt KEEP ltvarlgt ltvar2gt only inputs ltvarlgt and ltvar2gt from ltsas7datagt SET ltsas7datagt RENAME ltvarlgtltnewlgt renames input variable ltvarlgt as ltnewlgt 6 changing general speci cations of input and output les we have previously discussed the syntax for input and output statements for simple spacedelimited text ASCII les changing the length of the lin 7 one way that text input and output can become more CO mplicated has to do with the length of the line being read the Windows version of SAS assumes that input and output will take up no more than 256 character spaces on each line unless it is told otherwise SAS will ignore input past the 2561h character position of a line and jump to the next line to continue reading a record similarly when writing output SAS will begin a new line insert a line break after the 256 position to specify an alternative logical record length or LRECL you would add the LRECL option to the end of either the INFILE or FILE statement changing delimiter 7 another potential complication is the use of alternative delimiters in a le such as commas to change the delimiters that SAS will recognize add the DELIMITER or DLM option the new delimiter or delimiters should appear as a character constant eg DLM3939 or as a character variable reading tabbed data 7 some datasets include tabs as delimiters in a file tabs are simply stored as characters so you could adjust the delimiter to recognize tabs another approach is to include the EXPANDTABS option which converts tab characters to a series of spaces converts tabdelimited input into spacedelimited input 7 formatted input and output les 0 the syntax that we discussed for INPUT and PUT statements was very simpli ed a more complete syntax is INPUT ltspeci cationilgt ltspecification72gt lt l gt PUT ltspeci cationilgt ltspeci cation72gt lt l gt each speci cation can consist of several parts 0 column and line pointercontrols 0 a variable name or list 0 column speci cations 0 format modi ers and informats in addition and characters can appear at the end of the statements to hold the line until the next INPUT or PUT statement in the same DATA step iteration or possibly across iterations o pointer controls column pointer controls precede variable names in an INPUT or PUT statement these controls direct SAS to begin reading or writing the variable information starting at the speci ed column column point controls can be absolute or relative 0 absolute controls are written n where n is the column position where you want to next read or write 0 relative controls are written n where n is the number of additional column positions from the current position examples PUT 12 variablel this would begin writing variable in the 12Lh column position of an output line INPUT 4 variablel 4 5 variable2 this would read variable from the 43911 7 7Lh column positions and begin reading variable2 in the 13 column position of the input line line pointers advance the input or output by a line the symbol for a line advance is INPUT variablel variable2 would read variable from one line and variable2 from the next line line holds 7 normally at the end of an INPUT or PUT command SAS advances to the next line line hold commands prevent this these commands are useful if the formatting of data the record layout varies across lines you can determine what type of record you are reading with one INPUT command and then specify the correct format for the rest of the record using a subsequent command 0 to hold a line within the current DATA step iteration you would leave an at the end of the line for example INPUT variablel IF variablel EQ 1 THEN INPUT variable2 ELSE INPUT variable3 would read variable from the beginning of a record line depending on the value of variable the code would read and interpret the next piece of data on the same line as either variable2 or variable3 to hold a line across DATA step iterations you would leave an at the end of the line 0 informat specifications informat specifications immediate follow either a variable or a list of variables to specify the input or output as character data you would use as the informat specification to specify something as a character of a particular length you would use ri as the informat where 71 gives the length of the character to specify the input or output as an integer of a particular length you would use n to specify the input or output as a real number of a particular length and precision you would use rial where n is the total length of the number including the decimal point and al is the number of decimal places in the number INPUT variablel 8 variable2 5 variable3 83 would read variable as a character variable of length 8 variable2 as an integer of length 5 and variable3 as a real variable of length 8 and with 3 decimal places SAS has MANY other informat specifications see h Dsunnor1 sas J quot 1 hdict 59540 HTMI default a000309877htm for a complete list for lists of variables with the same informat you would group the variables and informat into parentheses for example INPUT variablel variable2 variable3 5 column specifications another way to read or write data is to specify the beginning and ending column positions 8 SA 0 these speci cations would follow the variable name in an INPUT or PUT command PUT variable 1 1215 variable2 1618 writes variable in the 12111 7 15111 column positions and variable2 in the 16111 7 18111 positions S transport les the structure of SAS les varies across computer platforms a SAS le created for a Windows computer will not be readable on a UNIX computer SAS does have a facility for creating les that can be moved across systems these are called transport les transport les cannot be read directly in a DATA step you need to rst create a copy of the le in your system s format suppose that you had a SAS transport file called newdataxpt in a directory C 39mysasdata and suppose that you wanted to create a SAS permanent data based on this to the same directory you could do this by typing the code LIBNAME outlib quotCmysasdataquot LIBNAME trandata XPORT quotCmysasdatanewdataXptquot PROC COPY IN randata OUToutlib RUN transport les can also be read and in some cases must be read using the CIMPORT procedure if the COPY procedure does not work you might try the CIMPORT to perform the same steps using the CIMPORT procedure you would type LIBNAME outlib quotCmysasdataquot FILENAME traniin quotCmysasdatanewdataXptquot PROC CIMPORT LIBRARYoutlib INFILEtran7in RUN note transport les can be tricky to use and some of the speci cations for these les can vary across versions of SAS
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'