Exploring the SET Statement in SAS

A majority of DATA steps use the SET statement. The primary function of the SET statement in SAS is to read observation from one or more SAS datasets.

Properly using the SET statement in SAS is one of the key techniques for improving the efficiency of SAS programs. The SET statement has options that can be used to control how the data are to be read.

SET statement options

END = It is used to detect the last observation from an incoming dataset.

= It is used to detect the last observation from an incoming dataset. KEY = It specifies an index to be used when reading SAS datasets.

= It specifies an index to be used when reading SAS datasets. INSDNAME = It is used to identify the current data source.

= It is used to identify the current data source. CUROBS = It creates and names a variable which contains the current observation number

= It creates and names a variable which contains the current observation number NOBS = Creates a temporary variable containing the total number of observations of the input dataset.

= Creates a temporary variable containing the total number of observations of the input dataset. POINT = Specifies a temporary variable whose value determines which observation is read

= Specifies a temporary variable whose value determines which observation is read UNIQUE = It is used with the KEY= option to read from the top of the index.

Using the NOBS= and POINT= options

The SET statement in SAS reads values in a sequential manner. i.e One observation after another. Using the POINT= option, you can perform a non-sequential read. The POINT= option tells SAS which observation to read next.

The POINT option allows direct access to the specified number of observations. For example, if you have to read only the 5th observation from a dataset.

The Point variable to be accessed must be declared with the observation number before the SET statement.

Since with the POINT, options data is being randomly accessed, you must tell the DATA step when to ‘OUTPUT’ and when to ‘STOP’.

The variable used for the POINT variable is a temporary variable and not added to the final dataset.

Since the POINT = option provides direct access to SAS data sets, it can not be used with a BY statement or the WHERE= dataset option.

The POINT= and NOBS-= are used together which returns the number of observations in the dataset.

data class; pt=5; set sashelp.class point=pt; put pt; output; stop; run;

You can download the example programs and the datasets from the link provided at the end of this article.

In the above example, the POINT= option tells SAS to read only the 5th observation from the input dataset.

In order to read more observations from the input data set, the POINT= option can be combined with a DO loop as in the below example.

data class; do Sliceobs=2, 3, 6, 10; set sashelp.class point=Sliceobs; output; end; stop; run;

The POINT= and NOBS= options can also be helpful when performing look-ahead or look-back of the data.

In the following example, we need to extract the student names who have scored less than 40 along with the previous semester and the next semester.

data class2(keep=name score semester cnt pt); array flag {100} $1 _temporary_; set datasets.class(keep=name score rename=(name=name1)); cnt+1; put cnt; if score lt 40 then do point=(cnt-1) to (cnt+1); put point=; pt=point; if 1 le point le noobs then do; set datasets.class point=point nobs=noobs; if name1=name and flag{point}=' ' then output class2; flag{point}='x'; end; end; run;

CNT will determine the range of value for the POINT variable.

If any student has a score value of less than 40 we need to print the previous observation and the next observation.

An array can be used to flag an observation once it has been used.

Using the INDSNAME= Option

With the INDSNAME option in the set statement in SAS, you can store the name of datasets from which the current observation is read.

data cars; set sashelp.cars indsname=dataset; put dataset=; run;

The dataset name is created in a temporary variable dataset. It contains a two-level name – SASHELP.CARS.

The default length of the variable created by INDSNAME is 41.

END= option

The END= option can be used to create a numeric (0 or 1) temporary variable that indicates that the last record has been read.

For the last record, the temporary variable would have a value of 1.

data ageTotal; set sashelp.class end=eof; end=eof; total+age; put eof=; if eof then put total=; run;

DATA Step with Two or more SET Statements

The DATA step may contain multiple SET statements. Multiple SET statements can give you more flexibility over the process of reading the data.

By using multiple SET statements, you can vertically stack datasets which will combine observations from two or more data sets into a single observation in the new data set.

Each of the data set is read sequentially from the first observation in the first data set to the last observation in the last data set. All the variables from both datasets are added to the Program Data Vector.

The corresponding values will be set to missing if all the variables are not present in all the datasets.

data data1; input name $ gender $; datalines; A F J M T M ; run; data data2; input name $ age; datalines; B 36 M 35 ; run; data combined; set data1 data2; run;

The output of Multiple SET Statements

Using the CUROBS

The CUROBS options create a variable which writes the current observation number in the SAS dataset.

data Female; set sashelp.class curobs=cobs; where sex='F'; obs=cobs; run;

KEY Option

The KEY= in SAS set statement allows us to access observation which is based on the value of an index variable or a key. The KEY option cannot be used with the POINT option.

This option is helpful if we have to perform a table lookup of values from another dataset. The KEY= option on the SET statement identifies an index that is to be used.

data combine; set data.ProductCat; set data.ProductDetails key=id/unique; if _iorc_ > 0 then price=' '; run;

Product Price Product Category Key Lookup

In order to use the key loop up, the data set must be indexed. The KEY option on statement identifies the index that is to be used.

To create an index, you can write the below proc SQL step.

proc sql; create unique index id on data.ProductPrice; quit; proc sql; create index id on data.ProductCat; quit;

_IORC_ is the automatic variable which stands for INPUT/OUTPUT Return Code. If an index value is found _IORC_ is set to 0.

KEYRESET

With the KEYRESET option in the SET statement in SAS, you can control whether a KEY= search should begin at the top of the index for the data set that is being read.

The index lookup begins at the top when the value of KEYRESET is 1. If the value of the KEYRESET variable is 0, the index lookup is not reset and the lookup continues where the prior lookup ended.

data a(index=(i)); do i=1, 2, 3, 3, 3, 4, 5; j=put(int(ranuni(4)*40), 6.)|| byte(int(65+26*ranuni(0))); output; end; run; data b; input i; datalines; 3 3 ; run; data _null_; set b; reset=2; set a key=i keyreset=reset; put i- j=; run;

When the KEYRESET is set to 0

i=3 j=27W i=3 j=6D

When the KEYRESET is set to 1

i=3 j=27W i=3 j=27W

IN= Option

The IN = data set option is used with multiple data sets to identify which data set contributed an observation.

DATA newdata; SET dataset1 ( in = a ) dataset2 ( in = b ) ; IF a THEN -------; ELSE IF b THEN ----------- ; RUN ;

A separate IN = variable can be specified for each data set defined with the SET statement. The

variable named by the IN = option has a value of 1 for every observation that originated from the corresponding datasets and the datasets are read one after the other.

The IN = option only exist for the duration of the DATA step for the variables defined. These variables and are not added to the output dataset. So, DROP or KEEP options are not applied to these variables.

With the WHERE = dataset option, SAS selects only those observations that meet the condition specified. It functions identically to the WHERE statement.

The WHERE = data set option is more efficient than the WHERE statement as only the matched observations are loaded into the Program Data Vector.

With the WHERE statement, all observations are read from the input data set and non-matched observations are discarded.

DATA newdata ; SET dataset1 ( where = ( condition ) ) dataset2 ( where = ( condition ) ) ; RUN ;

If both a WHERE = and a WHERE statement is used in the same data step, the WHERE statement is ignored for those datasets that have a WHERE = condition defined.

The WHERE = data set option can not be used with the POINT =, FIRSTOBS = , or OBS = data set options.

Download Exercise Files

Rate this post

0

Shares