Using PROC MEANS for detailed analysis of data

PROC MEANS, PROC SUMMARY and PROC FREQ in SAS are used to evaluate quantitative data and to create a summary report for analysis. Using PROC MEANS procedure, you can compute statistics like finding mean, standard deviation, the minimum and maximum values and a lot more statistical calculations.

Applications of PROC MEANS

Describing of quantitative data for analysis.

Describing the means of numeric variables by group

Identifying outliers and extreme values.

SYNTAX

PROC MEANS DATA=<dataset-name> <options> <statistics keywords>; <statements>

The most commonly used options in PROC MEANS are:

MAXDEC – Determines the number of decimal places to print in the output.

NOPRINT – Suppresses the output of descriptive statistics.

ALPHA – Sets the level for confidence limits (default is 0.05)

Statistical keywords are used to calculate statistical measures like mean, median and standard deviation. You can find the list of Statistical keyword in the SAS documentation website.

Difference between Proc Means and Proc Summary The difference between PROC MEANS and PROC SUMMARY is that the :

By default MEANS always creates a table to be printed. If you do not want a printed table you must explicitly turn it off (NOPRINT option).

On the other hand, the SUMMARY procedure never creates a printed table unless it is specifically requested (PRINT option).

Using the CLASS statement

The CLASS statement is used in both the MEANS and SUMMARY procedures. It can be used as a single statement or into a series of CLASS statements.

The order of variables in the CLASS statement determines the order of classification of variables.

Options can be applied in the CLASS statement by preceding the option with a slash.

The MISSING Option

Observations with missing levels of classification are excluded from the analysis. The MISSING option on the PROC statement, it is applied to all of the classification variables if it is used in a single statement.

By using multiple CLASS statements along with the MISSING option on the CLASS statement, you can choose which classification variables are to utilise the MISSING option.

data class; set sashelp.class; if age < 14 then age=.; run; proc means data=class; class age /missing; run;

Using the Missing Option

The CLASS Statement

By using CLASS statement in Proc means you can specify the variables whose values for analysis. You can use the below option with the class statement. To use the options in a CLASS statement, you have to use the ORDER of the classification variables.

The ASCENDING/DESCENDING Options

These options allow you to reverse the order of the display values accordingly.

proc means data=class; class age /order=freq ascending; run;

GROUPINTERVAL and EXCLUSIVE

With these options, you can determine the formats associated with CLASS variables when forming groups.

When a classification variable is associated with a format, that format is used in the formation of the groups.

In the following example, format weightclass is used to classify students (Normal, Overweight, Underweight)based on their BMI.

proc format; value weightClass low - 18.5='Underweight' 18.6-24.9='Normal' 25 - 29.9='Overweight' 30 - high='Obese'; run; data class2; set sashelp.class; bmi=weight*703/(height**2); format bmi weightclass.; run; proc means data=class2 noprint; class bmi/groupinternal; var height weight; output out=class_summary mean = MeanHT MeanWT; run;

The resulting output shows that the MEANS procedure has used the format to collapse the individual levels of BMI into the three levels of the formatted classification variable.

Without using the GROUPINTERVAL option the output would look as below.

The output without the group interval option

MLF

Multilevel formats allow you to have overlapping formatted levels.

ORDER

With this option, you can control the classification variable levels. There are options by which you can determine the order. Below are the options which you can use with the ORDER statement.

DATA – order is based on the order of incoming data

– order is based on the order of incoming data FORMATTED – Values are formatted first and then ordered.

– Values are formatted first and then ordered. FREQ – the order is based on the frequency of class level.

– the order is based on the frequency of class level. INTERVAL – It is same as UNFORMATTED or GROUPINTERVAL

proc means data=class2; class age bmi/order=freq; var height weight; run;

Difference between BY and CLASS Statements The input dataset must be sorted by the BY variables whereas in CLASS variables it is not required to sort the data. The BY statement provides summaries for the groups created by the combination of all BY variables. whereas the CLASS statement will provide summarized values for each class variable separately and also for each possible combination of class variables unless you use the NWAY option.

You can also use the CLASS and BY statements together to analyze the data by the levels of class variables within BY groups.

OUTPUT options in PROC SUMMARY

The OUTPUT statement with the OUT= option is used to store the summary statistics in a SAS dataset. There are other options which you can use on the OUTPUT statements.

AUTONAME – This allows the MEANS and SUMMARY to determine names for the generated variables;

– This allows the MEANS and SUMMARY to determine names for the generated variables; AUTOLABEL – Allows MEANS and SUMMARY to apply a label for each generated variables

– Allows MEANS and SUMMARY to apply a label for each generated variables LEVELS – Adds the LEVELS column to the summary data set.

– Adds the LEVELS column to the summary data set. WAYS – Add the WAYS column to the summary dataset.

IDENTIFYING EXTREME VALUES

To get a correct analysis, it is often necessary to exclude the observation containing the extreme lowest or extreme highest values.

These extreme values are automatically displayed in PROC UNIVARIATE but must be explicitly specified in PROC MEANS and PROC SUMMARY procedures.

The MAX and MIN statistics shows the extreme lowest or highest values, but it does not identify the observation which contains these extreme values.

MAXID and MINID

The two option- MAXID and MINID when used in the OUTPUT statement identifies the observations with extreme values.

proc summary data=sashelp.class; class age; var height; output out=stats max=maxHeight maxid(height(name))=maxStudentName; run;

In the above example, we can see that the output has been generated with the extreme minimum and maximum values for each age group.(Class Variable.).

Using the IDGROUP Option

THE IDGROUP option displays a group of extreme values, unlike the MAXID and MINID which only captures a single extreme value.

The PERCENTILE to create subsets

The percentile statistics are used to create search bounds for potential outlier boundaries. This can help us to find out if any observation falls outside of the defined percentile like 1% or 5%.

Percentile is the percentage of data that is below a certain point in the observation.

data outlier; set stats(keep=age_p1 age_p99); do until(EOF); set sampledata end=EOF; if age_p1 ge age or age ge age_p99 then output outlier; end; run; options nobyline; proc print data=outlier; by age_p1 age_p99; run;

The 1st and 99th are calculated and saved in the data set STATS. The IF condition checks age if it is below or above the 1st and 99th percentile.

We can say that observations C, J & K lies outside the 1 to 99% of the data.

The automatic _TYPE_ variable

TYPE variable is automatically included in the summary dataset. It is a numeric variable which help is to track the level of summarization and to distinguish the group of statistics.

proc summary data=sashelp.class; class age; output out=c1; run;

The _TYPE_ variable is 1 since there is 1 class variable – Age. The type variable is 0 if the means procedure does not have any CLASS variables.

The first observation has type = 0 which means there are no classification and statistics are calculated for all values.

​​The next observations with TYPE = 1 tell us that, statistics like frequency has been calculated for each age.

As you increase the variables in classification, the TYPE variable increase.

For TYPE = 2, the statistics are computed for each of the SEX level (Male and Female). The classification from the AGE variable is not considered here.

FOR TYPE = 3, the statistics are computed for each combination of SEX and AGE values. It tells us that both of the classification variables are used.

Using the NWAY option

You can use the NWAY option if you want the statistics for a combination of variables rather than individual classification.

proc summary data=sashelp.class nway; class sex age; output out=c1; run;

The NWAY option keeps only the observations with the highest TYPE value.

Rate this post

0

Shares