## 1. Introduction

This module illustrates how to collapse across variables using
retained variables.
First, let’s read in a sample dataset named **kids** which includes the variables
**famid** (family id) and **wt** (kids weight in
pounds).

DATA kids; LENGTH kidname $ 4 sex $ 1; INPUT famid kidname birth age wt sex ; CARDS; 1 Beth 1 9 60 f 1 Bob 2 6 40 m 1 Barb 3 3 20 f 2 Andy 1 8 80 m 2 Al 2 6 50 m 2 Ann 3 2 20 f 3 Pete 1 6 60 m 3 Pam 2 4 40 f 3 Phil 3 2 20 m ; RUN; PROC PRINT DATA=kids; RUN;

The output is shown below.

OBS KIDNAME SEX FAMID BIRTH AGE WT 1 Beth f 1 1 9 60 2 Bob m 1 2 6 40 3 Barb f 1 3 3 20 4 Andy m 2 1 8 80 5 Al m 2 2 6 50 6 Ann f 2 3 2 20 7 Pete m 3 1 6 60 8 Pam f 3 2 4 40 9 Phil m 3 3 2 20

## 2. Computing a running total with implicitly retained variables

There are times when a running total for a particular variable is desired.
For example, suppose that a variable representing the running total of the weights for each person in
the dataset needs to be computed. This can be done by using implicitly retained
variables in a data step. In the example below, the implicitly retained variable is
**sumwt**, where
the weight of the current observation (**wt**) is added to the last value of
**sumwt**. This
results in a new total for each observation. This is why it is called a running total, because the
value of **sumwt** at each observation is the sum of all the previous observations plus the current observation,
NOT the sum of ALL observations in the dataset. The value of **sumwt** at the last observation, however, IS the sum
for ALL observations in the dataset, because it is adding the sum of all the previous observations, plus its own value,
and hence is the sum across ALL observations in the dataset.

DATA sum ; SET kids ; sumwt + wt ; RUN; PROC PRINT DATA=sum; VAR famid wt sumwt ; RUN;

The output is shown below.

OBS FAMID WT SUMWT1 1 60 60 2 1 40 100 3 1 20 120 4 2 80 200 5 2 50 250 6 2 20 270 7 3 60 330 8 3 40 370 9 3 20 390

## 3. Computing a running count and average with implicitly retained variables

Implicitly retained variables can also be used to keep a running count. Hence, if one has the running total,
and the running count, the running mean then is simply the quotient of the two. Below is an example that computes
the running total as **sumwt**, the running count as the variable **cnt**, and the variable
**meanwt**, which is equal to
the **sumwt** divided by **cnt**. Note that **meanwt** is not
retained because it has an equals sign in its formula AND
it is not declared as a retained variable on a **RETAIN** statement. The variables
**sumwt** and **cnt** are
retained
(implicitly) because there is no equals sign, and the terms ‘sumwt + wt’ and ‘cnt + 1’ implicitly declare the variables
**sumwt** and **cnt** as retained variables, which will be used as counters at each observation.

DATA sum2 ; SET kids ; sumwt + wt ; cnt + 1 ; meanwt = sumwt / cnt ; RUN; PROC PRINT DATA=sum2 ; VAR famid wt sumwt cnt meanwt ; RUN;

The output is shown below.

OBS FAMID WT SUMWT CNT MEANWT1 1 60 60 1 60.0000 2 1 40 100 2 50.0000 3 1 20 120 3 40.0000 4 2 80 200 4 50.0000 5 2 50 250 5 50.0000 6 2 20 270 6 45.0000 7 3 60 330 7 47.1429 8 3 40 370 8 46.2500 9 3 20 390 9 43.3333

## 4. Computing a running total using first. variables

This section achieves the same goal as the above section, but uses a different approach. Here
the implicitly retained variables **sumwt** and **cnt** are initialized to zero for the first observation
within each family. This is what the **first.famid** variable is used for. If the current observation is
the first observation within a family, then **sumwt** and **cnt** are set to zero, and the observations that follow
within each family have **sumwt** and **cnt** defined by the terms ‘sumwt + wt’ and ‘cnt + 1’, each being a function
of the previous observations value for **sumwt** and **cnt**.
Note that the variable **first.famid** exists only
because **famid** was declared with the **BY** statement.

DATA sum3 ; SET kids ; BY famid ; * this resets the running total to 0 at the start of a family ; IF first.famid THEN DO; sumwt = 0; cnt = 0; END; sumwt + wt ; cnt + 1 ; meanwt = sumwt / cnt ; RUN; PROC PRINT DATA=sum3 ; VAR famid wt sumwt cnt meanwt ; RUN;

The output is shown below.

OBS FAMID WT SUMWT CNT MEANWT1 1 60 60 1 60 2 1 40 100 2 50 3 1 20 120 3 40 4 2 80 80 1 80 5 2 50 130 2 65 6 2 20 150 3 50 7 3 60 60 1 60 8 3 40 100 2 50 9 3 20 120 3 40

## 5. Outputting observations using last. variables

This next section is almost identical to the above section, except that here ONLY the last observation
within each family is outputted to the dataset **sum4**. This is what the variable
**last.famid**
is used for. Note (again) that the variables **first.famid** and **last.famid** only exist because
**famid** was declared with the
**by**
statement. Lastly, only the variables **famid**, **sumwt**, **cnt** and
**meanwt** are kept in the dataset **sum4**.
This is achieved using the **keep** statement followed by the list of variables one wants to keep.

DATA sum4 ; SET kids ; BY famid ; IF first.famid THEN DO; sumwt = 0; cnt = 0; END; sumwt + wt ; cnt + 1 ; meanwt = sumwt / cnt ; IF last.famid THEN DO; OUTPUT; END; KEEP famid sumwt cnt meanwt ; RUN; PROC PRINT DATA=sum4 ; RUN;

The output is shown below.

OBS FAMID SUMWT CNT MEANWT 1 1 120 3 40 2 2 150 3 50 3 3 120 3 40

## 6. Computing a running total with explicitly retained variables

In the above sections, all retained variables were implicitly declared with the terms ‘sumwt + wt’
and ‘cnt + 1’. retained variables can also be explicitly declared using the **retain** statement. In the example below
notice that the variables **sumwt** and **cnt** are listed in the
**retain** statement. Moreover, notice that the terms
‘sumwt + wt’ and ‘cnt + 1’ have been replaced with the equations ‘sumwt = sumwt + wt’ and ‘cnt = cnt + 1’. When variables
are declared as retained variables, explicitly, the counter equations must by given. However, when variables are
declared as retained variables implicitly, ONLY the terms on the right side of the counter equations are required.

DATA sum5 ; SET kids ; BY famid ; RETAIN sumwt cnt ; IF first.famid THEN DO; sumwt = 0; cnt = 0; END; sumwt = sumwt + wt ; cnt = cnt + 1 ; meanwt = sumwt / cnt ; IF last.famid THEN OUTPUT; KEEP famid sumwt cnt meanwt ; RUN; PROC PRINT DATA=sum5 ; RUN;

The output is shown below.

OBS FAMID SUMWT CNT MEANWT 1 1 120 3 40 2 2 150 3 50 3 3 120 3 40

## 7. Sorting data before collapsing across observations

All of the previous sections have worked on the assumption that the data are sorted
**by** **famid**, which is true
of the sample dataset **kids** defined in section 1. However, if this is not the case, and the data are not sorted
**by ****famid**, then the results of a counter may be incorrect. Additionally, in some instances, you may need to
temporarily sort a dataset, but you may not want to sort the main data file. The example below sorts the dataset
**kids**
with **proc sort** and names the sorted output dataset **sortkids**. The dataset
**sum6** then uses the dataset **sortkids**
instead of the **kids** dataset.

PROC SORT DATA=kids OUT=sortkids ; BY famid ; RUN ; DATA sum6 ; SET sortkids ; RETAIN sumwt cnt ; BY famid ; IF first.famid THEN DO; sumwt = 0; cnt = 0; END; sumwt = sumwt + wt ; cnt = cnt + 1 ; meanwt = sumwt / cnt ; IF last.famid THEN OUTPUT; KEEP famid sumwt cnt meanwt ; RUN; PROC PRINT DATA=sum6 ; RUN;

The output is shown below.

OBS FAMID SUMWT CNT MEANWT 1 1 120 3 40 2 2 150 3 50 3 3 120 3 40