INTRO TO STATISTICAL COMPUTING
INTRO TO STATISTICAL COMPUTING STAT 321
Virginia Commonwealth University
Popular in Course
Verdie Hauck PhD
verified elite notetaker
Socio 1101 (Lopez, Intro to sociology)
verified elite notetaker
verified elite notetaker
verified elite notetaker
verified elite notetaker
verified elite notetaker
Popular in Statistics
This 66 page Class Notes was uploaded by Verdie Hauck PhD on Wednesday October 28, 2015. The Class Notes belongs to STAT 321 at Virginia Commonwealth University taught by James Davenport in Fall. Since its upload, it has received 31 views. For similar materials see /class/230676/stat-321-virginia-commonwealth-university in Statistics at Virginia Commonwealth University.
Reviews for INTRO TO STATISTICAL COMPUTING
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 10/28/15
1 Creating multiple observations from the same record Use the double trailing this is another of the pointer controls not only prevents a new record from being read into the buffer when a new INPUT statement is encountered but it also prevents the record from being released when the program loops back to the top of the DATA STEP Carefully read section 213 in your textbook Let s look at some examples involving the census data of the 50 states and the falls data See Program SASLecO4input01sas See Program SASLecO4input02sas Reading multiple records to create a single observation Suppose the data is recorded as follows 1023 David Shaw red 189 195 1049 Amelia Serrano yellow 145 124 There are four ways to read this data 1 single input statement 2 multiple input statements 3 using the line pointer control 4 using the n line pointer control 1 input idno 14 name 1623 team strtwght endwght 2 a input idno 14 name 623 input team 16 input strtwght 13 endwght 57 b input idno 14 this reads only idno amp the input weights input strtwght 13 endwght 57 3 quot quot forces a new record to be read into the input buffer and the pointer to return to column 1 input idno 14 strtwght 13 endwght 57 this does the same thing as 2b above See Program SASLecO4inputO3asas See Program SASLecO4inputO3bsas See Program SASLecO4inputO3csas 4 n points to speci c lines in the input data records It allows you to read the variables in any order no matter which record contains the data values data club2 input 2 team 16 1 name 623 idno 14 3 strtwght l3 endwght 57 cards etc Variables List Input input QlQ4 is equivalent to input Q1 Q2 Q3 Q4 See Program SASLecO4inputO4sas Another example with Formatted Input data new input namelname3 6 4 7 scorelscore3 3 7 lines BobAAAlOOAPhredAA73AGeoffAA94 This is an example of Variables List Input See Program SASLecO4input05 sas Reading Data from External Files This requires the use of another SAS holy word namely INFILE The following statement must be placed in the Data Step before the INPUT statement in le 39EMy Current Work Filesstat321my data filescensusdat39 See Program SASLecO4censusdatal sas Re the census data examples Creating Permanent SAS Data Sets for Future Use This again introduces a new SAS holy word namely LIBNAME This statement must be place outside the Data Step it is usually best to place it early in the program up front near the OPTIONS statement Libname jim quotEMy Current Work Filesstat321mysdlquot Note that the use of the library reference name jim This is used in SAS s two level naming system This level ie the rst level has meaning only in the current program See Program SASLecO4censusdata2sas See Program SAS Lec04 census data3sas Call attention the log le in the examples Accessing Permanent SAS Data Sets You must rst specify the location and library reference name in the LIBNAME statement Then within the Data Step use the set statement Data newone Set librefnamedatasetname Proc contents See Program SASLecO4censusdata4sas Note to bring into our work area an existing SAS data set we use the Set statement within the Data Step How do we know what s in a permanent SAS data set For that we will need a procedure called Proc Contents l USING MORE THAN ONE OBSERVATION IN A IN A CALCULATION See program SASCreatetourrevsas This material deals with calculations that require more than one observation 0 Accumulating a total across a data set or a BY group 0 Saving a value from one observation in order to compare it to a value in a later observation Accumulating a Total for an Entire Data Set 0 First let s nd the total number of tours that have been booked We include an assign statement as follows variableeXpression e g totbookbookings See program SASAccumtourrevsas If you just want the total number of bookings you can use a combination of the KEEP option in the DATA statement and the END option in the SET statement that creates a variable whose value is 1 when the Data Step is processing the last observation and O for other observations The basic syntax is SET SASdataset ltEND variablegt o The SAS system does not add the END variable to the data set being created It s in the program data vector but the SAS system does not output the variable to the SAS data set 0 You can use the END variable to ne the last observation See program SASUseEndtourrevsas Getting a Total for Each BY Group Suppose you need the number of tour bookings for each vendor and you want this information to be in a SAS data set format 0 If we want to group by VENDOR then we must use a BY statement in the Data Step 0 If we use a BY statement then we must rst sort by VENDOR 0 To accumulate the number of bookings you must use a sum statement in the Date Step and it must be zero at the beginning of each value of the BY variable VENDOR 3 0 You can use FIRSTVENDOR in an IFTHEN statement to set the sum variable to O in the rst observation of each BY group See program SASAccumBytourrevsas To report the total bookings by vendor group in a SAS data set format you need only the variables VENDOR and VENDORBK from the last observation in each BY group Use the DROP option in the Data Statement to purge the other variables COUNTRY LANDCOST amp BOOKINGS Use LASTVENDOR in a subsetting if statement to write the last observation in each BY group See program SASKeepLasttourrevsas Writing Totals and Detail Reports 0 Suppose you want more information rather than just total bookings maybe you want to see information about the cost of each tour and you want it all in one SAS data set 0 In a second SAS data set you want to see the total number of bookings with each vendor and the total money spent with that vendor 0 Begin the program by creating two SAS data sets using the Data Statement and the Set Statement See the rst part of program SASMoreInfotourrevsas o The only calculation you need to make for the individual tours is the amount of money spent on each tour Therefore calculate the amount using an assignment statement and output the record to the SAS Data Set details See the middle part of program SASMoreInfotourrevsas Recall that after an OUTPUT statement executes the observations remain in the program data vector hence we can continue to operate on this observation Use the FIRSTVENDOR variable to tell when the SAS System is processing the rst variable in each group Then set the sum variables equal to 0 Then accumulate the sum with an appropriate assignment statement Then output the last observation by using the LASTVENDOR variable in an TFTHEN OUTPUT statement Finally include the DROP and KEEP options so that each data set contains just the variables that you want See program SASMoreInfotourrevsas Using a Value in 21 Later Observation Suppose you wish to create a new SAS data set that contains the tour that generated the most revenue Revenuebookingslandcosts Method 1 Calculate the revenue in a Data Step Sort the revenue in descending order In another Data Step use the OBS option to output the rst observation Method 2 more ef cient Compare the revenue from all observations in a single Data Step The SAS System can retain a value from the current observation to use in future observations When the processing of the Data Step reaches the next observation the held value represents information from the previous observation A sum variable is an example of a variable that retains its value from one observation to the next The RETAIN statement causes a variable to be created in the DATA Step to retain its value from the current observation into the next observation rather than being set to missing at the beginning of each iteration of the Data Step Handling Character Comparisons Some special situations arise when making comparisons among character variables 0 Compare uppercase and lowercase characters 0 Select all values beginning with a particular group of characters 0 Select all values beginning with a particular range of characters 0 Find a particular value anywhere within another character value Comparing Uppercase and Lowercase Characters The SAS system distinguishes between uppercase and lowercase letters in comparisons Madrid and MADRID are not the same values The UPCASE function produces an uppercase value you can then make the comparison between the variable and the uppercase value See program SASNewGuidearttoursas Selecting All Values Beginning with the Same Group of Characters In many situations you need to select a group of character values such as all persons whose last name begins with a D By default the SAS system compares value of different lengths by adding blanks to the end of the shorter value and testing the result against the longest value first attempt gt if guide D then chosen yes else chosen no The SAS system interprets the comparison as Guide D where D is followed by seven blanks because GUIDE has length 8 bytes Because the value of GUIDE never consists of the single letter D and 7 blanks the comparison is never true To compare a long value to a shorter standard put a colon z after the operator correct method gt if guide D then chosen Yes else chosen No The colon causes the SAS system to compare the same number of characters in the shorter value and the longer value In this case the shorter string contains one character therefore the SAS system tests only the first character from the longer value All names beginning with a D make the comparison true If you aren t sure that all the values of GUIDE begin with a capital letter use the UPCASE function See program SASDGuidearttoursas Selecting a Range of Character Values Suppose you want to select values beginning with a range of characters such as all names beginning with the letters A though L or M through Z o In compute processing letters have magnitude A is the smallest letter in the alphabet and Z is the largest Therefore the comparison A lt B is true so is the comparison D gt C o A blank is smaller than any letter o What about special characters in a future lecture See program SASGuideGrouparttoursas Finding a Value Anywhere within Another Character Value Suppose you need to select tours that visit other attractions in addition to museums and galleries In data set ARTSARTTOUTR the variable DESCRIBE refers to those events as other However the position of the word other varies in different observations How can you nd out whether other exists anywhere in the value of DESCRIBE for a given observation The INDEX function determines whether a speci ed character string the excerpt is present within a particular character value the source INDEX sourceexcerpt Both source and excerpt can be any kind of character expression including character stings enclosed in quotes character variables and other character functions If excerpt does occur within source the function returns the position of the first character of excerpt which is a positive number If it doesn t the function returns a 0 By testing for a greater than 0 you can determine whether a particular string is present in another character value if indeXdescribe other gt 0 then otherev Yes else otherev No OR it can be written if indeXdescribe other then otherev Yes else otherev No See program SASOtherEventsarttoursas CREATING SUBSETS OF OBSERVATIONS See SASCreateartssas Selecting Observations for a New Data Set There are two ways to select speci c observations when creating a new SAS data set from an existing SAS data set 0 Delete the observation that meets a condition which implies that all others are kept 0 Accept only those observations that meet a condition Deleting If condition Then Delete This causes the SAS system not to output that observation from the data vector to the new data set and returns to the top of the data step This deletion of observations works with the set permdataset type statement the in le statement or using in stream data entry cards or lines statements See program SASRemovelartssas Let s assume a permanent SAS data set exists in the SAS data library called quotasdlquot libref name and the data set is called one We already know how to create and examine the contents of such a data set but how do you access and modifyalter an existing data set data two set asdlone This produces sequential processing of the observations in one There are no input or cards statements While creating data two we can create new variables via transformations choose selected observations choose selected variables And we can create additional permanent SAS data sets within a single Data Step See Program SASModinyariablescensusdata5sas The following examples will use the following data set called origins See Program SASCreateoriginsas See Program SASProcContentsoriginsas Let s now focus on accessing a permanent SAS data set and form new SAS data sets that are subsets from the original data set Go over the subsetting diagram Selecting subsets of observations f1rstobs n amp obsm e g data asdloriginsubsetsl set asdlorigin f1rstobs7 rst obs kept or data asdloriginsub sets3 set asdlorigin obslO last obs kept or data asdloriginsubsets2 set asdlorigin f1rstobs4 obslO this is the last observation read not the number of observations See program SASSubsetobsoriginsas We could also use quot IF statementsquot see example See program SASSubsetiforiginsas We will discuss this in more detail later in the semester I Keeping or Dropping Variables data asdloriginsub setl set asdloriginkeepname a b c See Program SASSubsetKeeploriginsas or data asdloriginsub set2 set asdlorigin keep name a b c See Program SASSubsetKeep2originsas This later is an example of a KEEP DROP statement while the former is an example of the KEEP DROP option within another SAS statement data or set I Choosing between Data Set options or Statements With options you control the size of the program data vector Also sometimes data set options can be used in some cases when the statement will not work I Choosing between DROP or KEEP Depends in a simple case you may base your decision to use the DROP or KEEP options on which method allows you to specify fewer variables If you are working with large production jobs that read data sets to which variables may be added between the times your job runs you may want to use the KEEP option to be sure which variables are in the subset data set I Creating two Data Sets in a single Data Step no se icolon data asdlorigin rstkeepname a b c d 2 asdloriginsecond keepname a d e f set asdlorigin In this case you must use the KEEP data set option If you use the KEEP Statement then all data sets created in this Data Step would contain the same variables See Program SASTwoSubsetsoriginsas Differences in DROP amp KEEP as options in the Data Statement vs the Set Statement I Using these options in the Set statement determines which variables are E from the permanent SAS data set being used as input hence they determine how the program data vector is built Excluded variables are never read into the program data vector at all I Using these options in the Data statement determines which variables are written from the program data vector to the resulting SAS data set 0 Rules for SAS Statements See Online Help amp Textbook 1 Every SAS statement must end in a semicolon 2 SAS statements can be in upper or lowercase 3 Statements can continue on the next line as long as you don t split words in two 4 Statements can be on the same line as other statements 5 Statements can start in any column 0 Comment statements Everything between is a comment gt This is a one line comment statement 0 Rules for SAS names See Online Help amp Textbook 1 Names must be 32 characters or fewer in length but there are exceptions Librefs fllerefs etc must be 8 characters in length informats must be 31 characters or fewer and format names may be up to 32 characters 2 Names must start with a letter or an underscore 3 Names can contain only letters numerals or underscores 4 Names can contain upper and lowercase letters 2 0 SAS Data Set 32767 variables and rows rows is only limited by the resources on your computer 1 What is it What creates it 2 How does the Data Step Work 3 What you supply to SAS so it can construct a SAS Data Set 0 Consider Figures 21 22 23 amp 27 Raw data outside computer 1 any word processor 2 Microsoft s Access 3 Epiinfo Make View and Enter Data 4 dBase ZZZ 5 SAS fsp 6 Spreadsheets Lotus Quatropro NCSS etc either as ASCII les or directly to SAS Data Sets 0 SAS Data Structure 0 rectangular array casewise data entry 39 each row record etc 39 observations on a single experimental unit case subject item individual etc 39 each column eld etc I a different variable 0 missing data In SAS every variable EXISTS for every observation Thus missing values must be assigned some sort of character missing value ie a period l 0 SAS Data Set descriptor portion 0 temporary vs permanent SAS Data Sets workyour le of observation amp variables in the LOG 0 We will eventually discuss how to read existing les create permanent SAS Data Sets in a SAS Data Library and how to come back and use a permanent SAS Data Set SAS PROGRAMS Two Parts Data Steps Proc Steps using v 913 these should end with a Run statement when using the pc version of SAS How the Data Step Works names the data set compiles checks syntaX executes if correct The Data Step in its simplest form is a loop within an automatic output and return action During the Compile Phase setup also takes place syntaX is checked statement is compiled if correct input buffer is setup records cards data lines etc program data vector is setup and all variables are set to missing values descriptor portion is initialized among other things the siX characteristics of each variable are initialized I Name 39 Type I Length I Informat I Format I Label During Execution Phase each record is read processed and written to the program data vector When the ending of a Data Step is encountered 1 current observation is written to the data set 2 program loops to the top of the Data Step 3 variables in the data vector are set to missing values a period for numerical data and a blank for character variables 5 The Data Step continues amp iterates as many times as there are records to read Then the Data Step is closed amp SAS goes on to the next Data Step or Proc Step Info that you supply to create a SAS Data Set input statements to tell SAS how to read the records def1nitions of variables amp indicators of whether they are numeric or character specif1c location of raw data cards le tape cassett etc TELLING SAS HOW TO READ THE DATA Styles of Input There are three basic input styles as well as various format modi ers and pointer controls List Input simply place variable names in the INPUT statement and at least one space between each value in the data lines while it may not appear so this type of input places restrictions on your data input name testl test2 test3 See Program SASLeclinputOl sas When creating more than one new data set in a single Data Step using the DROP or KEEP options in the Data Statement allows you to drop or keep variables in each of the new data sets A DROP or KEEP Statement on the other hand applies to all of these new data sets being created When you create permanent SAS data Sets they should have 1 Data Set Labels 2 Variable Labels 3 Variable formats either SAS supplied or customized 4 Variable Value Labels if needed Additional labels and formats may be locally modi ed by procedures These can be done in a PROC DATASETS b during a DATA STEP can use Label statement can use format statement c PROC FORMAT Quick ways to find out what39s in a SAS Data Set 1 PROC CONTENTS 2 PROC PRINT LABELS 3 PROC FREQ TABLES l Assume a permanent SAS data set exists in the SAS data library called quotsavequot libref name and the data set is called quottravelquot How tile SAS System Processes a DATA Step During compilation the SAS system reads the entire step through once from beginning to end 0 Checks syntax and organization and keyword spelling o Builds the program data vector the input buffer if needed and the descriptive portion of the SAS data set workname During execution consists of a loop by default the SAS system goes through the data step loop for each observation it processes Most DATA Step statements tell the SAS system how to change or add to the input data ie update a value that is currently entered Some statements tell the SAS system more about the data set such as storage space Some statements delete observations that are not needed output additional observations or change the order in which program statements are carried out Adding Information to Observations with a Data Step How do you add information to observations with a DATA Step See Program SASCreatetravelsas Basic method is to create a new variable in a DATA Step using an assignment statement variableeXpression 0 Adding the Same Kind of Information to All Observations data newairl set savetravel newcost aircost 10 run proc print datanewairl var country aircost newcost title Increasing Air Fare by 10 in All Observations Creating a New Variable run See Program SASNewAirCostltravelsas Note var statement in proc print speci es the variables to be printed Since the observation for India has a missing observation SAS assigns a missing value to newcost Adding Information to Some Observations but Not Others Basic use of the IFTHEN and ELSE statements data bonusinfo set savetravel if vendor Hispania then remarks For 10 people else if vendor Mundial then remarks Yes run proc print databonusinfo var country vendor remarks title The SAS System Creates Variables for All Observations run Note remarks is a character variable and because the rst value that SAS encounters for remarks contains 14 characters the SAS system set aside 14 bytes of storage in each observation for bonuspts whether the actual value is a l4character phrase or simply a blank See Program SASBonusInfotravelsas 0 Changing Information without Adding Variables Instead of creating a new variable you can modify an existing variable data newair2 set savetravel aircost aircost 10 run proc print data newair2 var country aircost title Changing the Information in a Variable run See Program SASNewAirCost2travelsas Using Variables Ef ciently Avoid creating variables that have many empty cells or missing values inef cient use of storage space data tourinfo inef cient use of variables gt set savetravel if vendor Hispania then bonuspts For 10 people else if vendor Mundial then bonuspts Yes else if vendor Major then discount For 30 people run proc print datatourinfo var country vendor bonuspts discount title Using Variables Inef ciently run See Program SASTourinfotrave1sas 7 A BETTER WAY data newinfo ef cient use of variable space gt set savetravel if vendor Hispania then remarks For 10 people else if vendor Mundial then remarks Yes else if vendor Major then remarks 30 people contact for info run proc print datanewinfo var country vendor remarks title Using Variables Ef ciently run See Program SASTourinfoNewtravelsas 7 Describing the Data Set Being Created Sometimes you need to tell the SAS System the amount of storage to allow for a particular variable Suppose we want the Remarks variable to contain miscellaneous information about tours Furthermore we want to allow for 30 characters to store this information SAS usually allows the space it encounters in the rst observation To override use the Length statement This statement is executed during the compile phase not the execution phase It is called a declarative statement Creating New Character Values You can divide long values into pieces combine existing values to make a longer value read pieces and so on Extracting a Portion of a Character Value How do you isolate one piece of a character variable For example the value of OTHRGATE contains two cities the city of arrival and the city of departure How do you divide that value so that you can create separate variables for the two cities The SCAN function gives you this capability The SCAN function selects a term from a character value the term can be any character sting and the divider for terms called the delimiter can be any character or any list of characters The general syntax SCANs0urcenltlist 0fdelimitersgt source can be any kind of character expression including character variables character constants and so on n the number of the term in the list to be selected from the source listofdelimiters gives one or more delimiters If you specify more that one delimiter then the SAS system uses any of them if you omit the delimiter the SAS system divides the words according to a default list of delimiters including the blank and some special characters arvgatescanothrgate l deptgatescanothrgate2 It s better to use nonblanks e g Rio de J anerio See program SASScanLeftldepart2sas Aligning New Values Recall when you create new character variables using assignment statements the SAS system maintains the eXisting alignment It does not do any truncation or padding To leftalign the values use the LEFT function deptgatescanothrgate2 deptgateleftdeptgate or deptgateleftscanothrgate2 SAS performs the innermost nested operation rst It uses that result as the argument of the next function See program SASScanLeft2depart2sas Assigning Lengths t0 Variables Created by the SCAN F anction The SCAN function causes the SAS system to assign a length of 200 bytes to the result variable in an assignment statement Most of the other character functions cause the target variable to have the same length as the original value In the program SASScanLeft2depart2sas ARVGATE will be of length 200 bytes while DEPTGATE also has a length of 200 bytes because the LEFT function has the SCAN function as an argument See program SASScanLeft2bdepart2sas Set the lengths of these variables using the LENGTH statement to save on empty storage space See program SASScanLeft3depart2sas Combining Character Values Using Concatenation Concatenation combines character values by placing them one after the other The operator is the double vertical bar The length of the result is the sum of the lengths of the pieces up to a maXimum of 200 bytes See program SASConcatAllGatesdepartsas Removing Interior Blanks ALLGATES contains many interior blanks Why When a character value is shorter than the length of the variable to which it belongs the SAS system pads the value with trailing blanks The length of USGATE is 13 bytes but only San Francisco uses all of them Therefore the other values contain blanks at the end and the value for Brazil is entirely blank The SAS system concatenates USGATE and OTHRGATE without change therefore the middle of ALLGATES contains blanks for most observations Of course most of the values of OTHRGATE also contain trailing blanks as well If you concatenate COUNTRY after OTHRGATE you will see these trailing blanks To remove these interior blanks use the TRIM function General syntax TRHls0urce The TRIM function produces a value without the trailing blanks in the source However other rules about trailing blanks in the SAS system still apply If the trimmed result is shorter than the length of the variable to which the result is assigned the SAS system pads the result with new blanks as it makes the assignment See program SASTrimAllGatesdepartsas Adding Additional Characters Notice that the values of ALLGATES come immediately together In the observation for Brazil the value of OTHRGATE comes at the beginning of the value To make the result easier to read concatenate a comma and blank between the trimmed value of USGATE and the value of OTHRGATE Use the IFTHEN statement to equate the value of ALLGATES with that of OTHRGATE in the case of Brazil See program SASCommaTrimAllGatesdepartsas Troubleshooting When New Variables Appear T rancated What do you do when you have concatenated values and the result appears to have lost part of a value Earlier we used the SCAN function to divide OTHRGATE into two new variables ARVGATE and DEPTGATE with default lengths of 200 bytes See program SASScanLeft2depart2sas Now suppose we wish to reverse the division and put ARVGATE and DEPTGATE back together into a new variable called OTHRGATE2 using the concatenation operator However we forgot to use the TRIM function to extract the padded blanks See program SASTruncationdepartsas The value of OTHRGATE2 contains only the value of ARVGATE It appears that DEPTGATE has been lost This occurs because the sum of the lengths of ARVGATE DEPTGATE the comma and the blank is 402 bytes but we know the maXimum length is 200 bytes The SAS system performs the concatenation but it truncates the result on the right in order to make it fit into the length of 200 bytes Because the value of ARVGATE is 200 bytes that is left after the truncation There are two ways to correct this Use the TRIM function on ARVGATE and DEPTGATE before concatenation or use the LENGTH function to set the lengths of ARVGATE and DEPTGATE before the SCAN function is called See program SASTroubledepartsas Treating Numbers as Characters The SAS system use 8 bytes of storage for every numeric value in the Data Step by default the SAS system uses 8 bytes of storage for numeric values in an output data set However a character value can contain a minimum of one character in that case the SAS system uses one byte for the character variable both in the program data vector and in the output data set In addition the SAS system treats the digits 0 through 9 in a character value like any other character This can create a saving of space in the stored values Suppose we have an additional variable in this travel data set namely HOTELRNK that provides the ranking of the hotel rooms as a 2 3 or a 4 This variable can be read as a character variable See program SASHotelssas Writing Observations to Multiple SAS Data Sets The SAS system allows you to create multiple SAS data sets in a single data step The basic tool is the output statement The basic syntaXis OUTPUT ltsasdatasetnamegt o If you use an output statement without specifying a data set name SAS will output that observation to data sets named in the current Data statement 0 If you want to write to a speci c data set then it must be named in the output statement 0 Any data set named in an output statement MUST be listed in the Data statement Suppose you want to output two data sets one with guide Lucas and one with guide other See program SASOutput1artssas Note that when you create more than one data set in a Data Statement the last one listed will be the most recently used data set and hence will be the current workdataset in this case othrtour To use another data set in a procedure you must use the DATAsasdatasetname option Using an output statement suppresses the automatic output of observations at the end of a DATA step Therefore if you plan to use an output statement in a DATA step then you must program output for that step with output statements See program SASOutput2artssas Understanding the OUTPUT Statement An output statement tells the SAS system to output the observation when the output statement is processed NOT at the end of the DATA Step This can cause problems if you are not careful See program SASOutput3artssas The problem with the example is that the assignment statement that computes the variable days is misplaced in the programming stream See program SASOutput4artssas After the SAS system processes an OUTPUT statement the observation remains in the program data vector so you can still continue to program with that observation You can even output it again to the same SAS data set or to a different one See program SASOutput5artssas WORKING WITH GROUPED OR SORTED OBSERVATIONS See program SASCreatetourtypessas Sometimes we need to group certain observations together according to their values of a particular variable 0 How do you put observations into group 0 How do we sort observations Working with Grouped Data 1 The observations must be in a SAS data set not an external le this does not mean that they have to be in a permanent SAS data set but can be in a workdataset 2 The variables that de ne the groups must appear in the BY statement 3 All observations in a group must appear together in the data set ie the observations must be sorted before you use the BY statement Grouping Observations with the SORT Procedure The basic syntaX is proc sort dataoldone outnewone by variable run39 7 If the values of the variable consist of only letters then the sorting is done alphabetically in ascending order by default If you omit the outnewone options the sorted version of the data set is named oldone and becomes the current version ie it is replaced The SORT Procedure provides a message in the SAS log that tells you that the sort procedure was executed See program SASSortltourtypessas Group g BY More Than One Variable First variable is sorted then within the rst the second is sorted then with those the third is sorted Etc See program SASSort2tourtypessas Arranging in Descending Order proc sort dataoldone outnewone by descending tourtype vendor landcost run39 7 Finding the First or Last Observation in a Group Suppose you want to create a data set containing the least expensive tour that features architecture and the least expensive tour featuring scenery How do you do this without rst displaying the data set and seeing which observation to select 0 First sort by TOURTYPE and LANDCOST 0 When you use a BY statement the SAS system automatically creates two additional variables that are hidden variables for each variable in the BY statement 1 FIRSTvariable variable is your variable name 2 LASTVariable Their values are either zero or one one if true and zero if false They exist in the program data vector and are available for use in program statements However FIRST and LAST variables are NOT written to the SAS output data sets See program SASSort3tourtypessas See program SASSort4tourtypessas Sorting Data with the SORT Procedure Sometimes it is more important to work with the actual sorted observations instead of just the grouped data Also you may want to delete any duplicate observations if they exist Well that last example with the errors didn t work So we are going to need ways to better control our input options List input is nice since you don39t have to worry about the speci c columns in which the data elds lie Question Can we combine the features of LIST input and FORMATTED COLON FORMAT MODIFIER Yes using the colon z format modif1er You can use list input rather than column or formatted input to read data even when they contain values that require an informat modifier to be read correctly The colon 3 formal modi er allows you to use list input for reading character data containing more than eight characters or data requiring informats numeric data that contain invalid characters See Program SASLecO3inputOl sas data jansales input item 10 amount comma5 lines trucks l382 jeepsquotl235 landrovers 239l lincoln navigators lOl4 proc print datajansales title January Sales in Thousands run quit item has length 10 and amount requires a format to strip the comma Place a colon between the variable name and the informat As in simple list input at least one blank must separate each value from the next and character values cannot contain embedded blanks What if you do have imbedded blanks Use the quotampersand amp format modi erquot This allows for one ie a single embedded blank in a character eld AND at least two blanks must separate each value from the next data value in the record then you go back to regular list input requirements of at least one input idno name amp 18 team strtwght endwght cards 1023ADavidAShaw V redAI 89quot l 65 7 See Program SASLecO3input02sas See Program SASLecO3inputO3sas With quotampquot and z the default input style is LIST input You may miX styles as long as you miX them in a way that appropriately describes the records in the raw data input idno name 18 team 2530 strtwght endwght HOWEVER when you miX styles of input in a single input statement you can get unexpected results if you don39t understand where the input pointer is left after a value in the input buffer is read COLUMN POINTER CONTROLS especially useful with mixed input and formatted style input 11 11 Let us rst consider the n and the n pointer control n input item 10 17 amount comma5 cards trucks 13 82 jeeps 123 5 landrovers 23 91 column 17 OR input item 10 6 amount comma5 cards trucks 1382 jeeps 1235 2391 landrovers 7 n in the input statement moves the pointer 11 columns to the right 4 The data must be aligned in columns when you have column input you specify the columns formatted input you specify length with informats and move the pointer with pointer controls Points to Remember formatted input reads until it has read the number of columns indicated by the informat It overrides the default of stopping when a blank is read you can position the pointer with pointer controls n n n you can read data stored in nonstandard forms with informats you retain all of the exibility amp features of column input Recall that when reading with list input the pointer is left in the column immediately after the last read column What about pointer location with COLUMN amp FORMATTED input styles It s the same the pointer is left in the column immediately following the last read column data one input id 111 5 dob mmddyy6 ll gender 1 lines JARMO404501 the Hogben code pointer moves to column 12 input team 35 16 score 1213 input team 6 5 score 2 adds 5 to current pointer position after the rst read the pointer is in column 7 so 7512 So what about the three types of input Column input uses column speci cations to mover the pointer to each data eld Formatted input can use inforrnats and pointer controls to position the pointer List input the default is a scanning method the pointer stops in the NEXT column after the rst blank is read Let s now consider the other types of pointer control Using the Pointer Control To Read multiple records into the buffers You can read multiple records to create a single observation by pointing to a speci c record in a set of input records with the n linepointer control Using the n control allows you to read the variables in any order no matter how the variables are listed in the records or no matter which record contains the variable It is also useful if you want to skip lines of raw data data one input 2 team 16 1 name 623 idno 14 3 strtwght 13 endwght 57 cards 1023 David Shaw red 189 195 1049 Amelia Serrano yellow 145 124 1051 Tim Jones green 190 178 proc print var idno name team strtwght endwght run quit See Program SASLec03input04sas Reading a Record twice the trailing CONCATENATING SAS DATA SETS m m WITH PROC APPEND m When Variables Have Different Attributes If a variable has different attributes in the BASE data set than it does in the DATA data set the attributes in the BASE data set prevail In the cases of differing formats informats and labels the concatenation succeeds If the length of a variable is longer in the BASE data set than in the DATA data set the concatenation succeeds If the length of the variable is longer in the DATA data set than in the BASE data set or if the same variable is a character variable in one and numeric in the other PROC APPEND fails to concatenate the les unless you specify the FORCE option Using the FORCE options has the following consequences 0 The length speci ed in the BASE data set prevails Therefore the SAS System truncates values from the DATA data set to t them into the length speci ed in the BASE data set or pads them with blanks The type speci ed in the BASE data set prevails The procedure replaces values of the wrong type all values for the variable in the DATA data set with missing values Choosing between PROC APPEND and the SET Statement If two data sets contain the same variables and the variables possess the same attributes the le that results from concatenating them with PROC APPEND is the same as the le that results from concatenating them with the SET Statement However PROC APPEND does this much faster especially if the BASE data set is large you are avoiding the processing of all that data The two methods differ enough when the variables or their attributes don t match that you must consider the differences in behavior before you decide which method to use Condition SET Statement PROC APPEND No of data sets you can concatenate Any number Two Data sets contain different variables Uses all variables Assigns missing values where appropriate Uses all variable in the BASEdata set Assigns missing values to observations from the DATAdata set where appropriate Requires the FORCE option to concatenate if the DATAdata set contains variables that aren t in the BASEdata set Can t include variables found only in the DATA data set when concatenating the data sets Different formats informats or labels Uses eXplicitly defined formats informats and labels over defaults If two or more data sets eXplicitly define a format informat or label uses the definition from the data set that you name first in the SET statement Uses formats informats and labels from the BASE data set Different If the same Requires the FORCE lengths variable has a option if the length of a different length in variable is longer in the two or more data DATA data set sets uses the Truncates the values of length from the the variable to match the data set you name length in the BASE data first in the SET set statement Different types Doesn t Requires the FORCE concatenate option to concatenate Use type from the BASE data set and assigns missing values to the variable in observations from the DATA data set Interleaving SAS Data Sets Interleaving combines individually sorted SAS data sets into one sorted data set using SET statements and BY statements The number of observations in the new data set is the sum of the number of observations in the original data sets 0 How to use the By statement 0 How to sort data sets to prepare for interleaving o How to use the SET and BY statement together to interleave observations Using BYGroup Processing The BY Statement speci es the variable or variables by which you want to interleave the data sets To understand this we rst review our understanding of BY variables BY values and BY Groups 0 BY variable is a variable named by the BY statement 0 BY value is the value of a BY variable 0 BY group is all observations with the same value for all BY variables In discussions of interleaving BY groups commonly span more than one data set If you use more than one variable in a BY statement a BY group is a group of observations with a unique combination of values for those variables When interleaving the SAS System creates a new data set as follows a Q b 4 U1 0 Before executing the SET statement the SAS System reads the descriptor portion of each data set you name in the SET statement and creates a program data vector that by default contains all the variables from all data sets as well as any variables created by the Data Step It sets the value of each variable to missing It looks at the rst BY group in each data set you name in the SET statement in order to determine which BY group should appear first in the new data set It copies all observations in that BY group from each data set containing observations in the BY group to the new data set It copies from the data sets in the same order as they appear in the SET statement It looks at the next BY group in each data set to determine which BY group should appear next in the new data set It sets the value of each variable in the program data vector to missing It repeats steps 3 through 5 until is has copied all observations to the new data set Preparing to Interleave Before you can interleave data sets the data must be sorted by the same variable or variables you will use with the BY statement that accompanies your SET statement We have two data sets from two divisions each containing the variables 0 Project is a unique code that identi es the project 0 Dept Is the name of the department involved in the project 0 Manager is the name of the manager of the dept o Headcoun is the number of people working for the manager on the project See program SASCreateinterleavranddsas Note Data Set randd is already sorted by PROJECT See program SASCreateinterleavpubssas Note Data Set pubs has variables in a different order and is not sorted by PROJECT We want to combine the data sets by PROJECT so that the new data set shows the resources that both divisions are devoting to each project Both data sets must be sorted by PROJECT before you can interleave them De ning Variables done in the Input Statement sets attributes name type length informat if any format if any label if any the default is numeric and length is 8 bytes the length of numerical variables is not affected by informats or column specif1cations but can be increased to something similar to double precision we will discuss this later Following the variable name with a changes the type of the variable to a character variable of length of 8 bytes for list input You can push this to 200 characters by using either input name 1200 or input name 200 We will talk about the quotformatquot and quotlabelquot attributes at the end of the lecture Line l of input records pointer input buffer column 1 the input statement brings data to the input buffer amp gives you control over how to move the data from the buffer to the data vector via quotthe data pointerquot Things that affect your choice of input style see Chapter 2 of your text How the data is entered on the records cards etc How you would like to enter the data Do character variables have imbedded blanks Do numeric variables contain nonnumeric character a Do the data contain time or date values that require special instructions Is there more than one observation per input record Is the data for a single observation spread over several input records Input type Restrictions LIST must have data values separated by blanks data elds must be in same order as variables list in the input statement missing values are represented by a quotquot for numeric variables and must be present character variables can39t have embedded blanks default length for character variables is 8 bytes longer values are truncated when SAS writes to the program data vector you can change this with the LENGTH statement and still use LIST input Must have data in standard character or numeric form ie no in numeric etc List input requires the fewest speci cations in the INPUT statement However you can not input all data sets with LIST input therefore we must learn other styles Pointer Movement Input idno name team strtwght endwght cards 1023 David Red 189 165 7 See Program SASLec2inputOl sas the pointer is left in the column immediately following the last read column Input type COLUMN INPUT Data values occupy the same elds within each record You specify these column locations in the INPUT statement data must be standard character or numeric form when using column input your aren39t required to indicate missing values with a placeholder such as a period uses the columns speci ed to determine length of character input variables unlike LIST it reads data until it reached the end of the last speci ed column not until it reaches a blank can skip columns altogether read columns in any order can read only a part of a value or reread the value character variables can be up to 200 characters in length See Program SASLec2input02sas Input type FORMATTED INPUT Data can be stored in special formats such as binary packed decimal special formats time amp dates or imbedded commas and monetary symbols So we need ways to input this data You must use SAS s preprogrammed formats See Program SASLec2inputO3 sas Merging SAS Data Sets Merging combines observations from two or more SAS data sets into a single observation in a new data set The new data set contains all variables from the original data sets unless you specify otherwise I onetoone merging don39t use a By statement observations are combined based on their positions in the input data sets I Match Merging you use a By Statement to combine observations from the input data sets based on common quotBy Groupsquot The Merge Statement merge SASdatasetlist This may contain any number of data sets BY variablelist a list of variables by which to merge the data sets must previously have been sorted by these variables in the proper sequence ONETOONE MERGING combines the rst observation of all data sets into the rst observation of the new data set combines the second observation of all data sets into the second observation of the new data set and so on 2 I The number of observations in the new data set is equal to the number of observations in the largest data set you name in the MERGE statement improvclass name year major n6 improvtimeslot data time room n6 See SASCreateimprovclasslsas See SASCreateimprovtimeslotsas data improv schedule merge improvclass improvtimeslot See SASScheduleaimprovsas straight forward onetoone merging lSt with 1322Ild with 2Ild etc note that the of observations are equal amp variables in each data set are different DATA SETS WITH THE SAME VARIABLES improvclass2 name year major n7 See SASCreateimprovclass2sas Instead of scheduling conferences for one class you want to schedule exercises for pairs of students one student from each class You want to create a data set in which each observation contains the name of one student from each class and the date time and location of the conference You don39t want the variables year and major in the new data set 3 data improvexercise merge improvclass dropyear major improvclass2 dropyear major renamenamename2 improvtimeslot See SASSchedulebimprovsas I If you have variables with the same name and do not use the RENAME option the value from the last data set read is the value of that variable in the new data set even if it is a missing value I Once the SAS system has processed all observation from a data set the program data vector and all subsequent observations in the new data set have missing values for the variables that are unique to that data set l MATCH MERGING Merging with a BY statement allows you to match observations according to the values of the BY variables you specify as always the data sets must be rst sorted if you are going to use a BY statement theatercom an name a e 86X 118 7 7 theater nance name ssn salary n7 See SASCreatetheatercompanyasas See SASCreatetheater nancesas We want to merge these two data sets matching on the variable name both data sets have already been sorted by name data theatercomp n merge theatercompany theater nance by name take special note of observation 4 See SASCreatetheatercomp nsas DATA SETS WITH MULTIPLE OBSERVATIONS IN A quotBY GROUPquot When neither data set has more than one observation in a BY group merging is simple What happens when a data set has more than one observation in a BY group theaterreptory play role ssn n16 See SASCreatetheaterreptorysas We desire to replace SSN with the names of the individuals Use the data set theater nance contains both SSN and NAME proc sort datatheater nance by ssn proc sort datatheaterreptory by ssn NOTE theater nance has n7 obs39ns and 7 quotby groupsquot theaterreptory has n16 obs39ns and 7 quotby groupsquot Among the 23 total observations there are 7 unique values of SSN data theater nrep merge theater nance theaterreptory by ssn See SASCreatetheater nrepsas Note the number of observations in theater nrep is n16 6 7 creates program data vector which quotby groupquot comes rst reads and copies from 1st data set to program data vector reads and copies from 2Ild data set to program data vector writes to data set and retains values in program data vector except for variables that were created either by you or the SAS System in the data step are set to missing looks for a 2Ild observation in the current quotby groupquot only changes the new information that is needed writes to data set continues until it eXhausts all observations in the quotby groupquot set program data vector to missing values continues processing all quotby groupsquot until all observations are processed Controlling the variables in the new data set Note data newrep dropssn merge theater nance dropsalary theaterreptory by ssn See SASNewReptheatersas dropssn as a data set option retains ssn in the program data vector and hence is available for use but drops when written to the new data set dropsalary as an option to the merge statement excludes salary from the program data vector and it is not available for use in the data step Data sets with the same variables use the quotRENAMEquot data set option If you don39t use the quotRENAMEquot amp there exist variables with the same name in several data sets the value in the new data set will be the last one read Data sets that lack a common variable If you are going to match merge the data sets MUST all have a common variable theatercompany name age sex theater nance name ssn salary theaterreptory play role ssn proc sort datatheater nance by name proc sort datatheatercompany by name data temp merge theatercompany theater nance by name proc sort datatemp by ssn proc sort datatheaterreptory by ssn data theatera11 merge temp theaterreptory by ssn
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'