Introduciton to AWK

AWK is a powerful but simple scripting language designed for text extraction and processing. Due to its versatility and simple usage, it is a widely-used tool, and there are entire books written on it.

Why use AWK?

Although it's possible to translate any AWK script into C for faster processing, AWK is often easier to write and debug. Thus, even though a C program may execute faster, it's preferrable to use AWK due to its simplicity and ease of use.

Origins of the name

The name AWK comes from its authors Alfred Aho, Peter Weinberger and Brian Kernighan, who developed it in Bell Labs in the 1970's. AWK also serves as a homonym of its mascot, the auk bird.

The auk bird, which was also on the cover of the first AWK manual.

AWK provides command line users with a variety of functions. You may use a single AWK script to process several files within a pipeline or apply the commands to several files at once. Here are just a few of the features that AWK provides:

Text formatting

Formatted text outputs

Perform mathematical and string operations

Field extraction and rearrangement

In short, AWK can be of immense help to anything that has to do with text processing or data-table manipulation.

Variations of AWK

There have been a variety of AWK implementations as users started expanding on the language.

awk Original awk. nawk New and improved awk. Used by OS X. gawk GNU awk. Mostly ships with Linux distributions. mawk Very fast AWK implementation based on bytecode interpreter.

Installing gawk

For this tutorial, we'll be sticking to gawk , which is the GNU version of AWK (GNU is simply a suite of open-source utilities - learn about the history of UNIX and GNU). To install gawk on a Debian-based Linux platform, use the apt-get package manager.

$ sudo apt-get update $ sudo apt-get install gawk

For RPM based Linux, use yum .

$ sudo yum install gawk

For Mac OS X platform, use homebrew, the package manager for OS X.

$ brew install gawk

Commenting in AWK

One last thing to mention before moving on - to comment in AWK, use either the hashtag symbol ( # ) or two forward slashes ( // ).

Notes before getting started

AWK is a difficult programming language to learn, as most of its concepts, syntax and notations are intertwined with each other. Thus, learning one piece of AWK involves having to know some other parts. Due to this, some of the lessons in this tutorial may introduce concepts that will be covered in more detail in a future lesson. So if you see something new and not explained in detail, try your best to understand it and sit tight until we cover it more formally later. Now let's get started!

Awk's Workflow BEGIN, BODY, and END blocks

Let's begin by looking at the step-by-step methodology of how AWK works. AWK starts by executing the BEGIN block. It then enters the BODY block, reading in some record, executing some command, and repeating until the file is exhausted. Finally, the END block is then executed.

Execute awk commands from BEGIN block. Read in a line from the input stream (may be from a file or directly from std in). Stored in memory. Execute the awk commands on a line. Repeat if not end of file. Execute awk commands from END block.

Awk runs in blocks three blocks - BEGIN, BODY, and END. The BEGIN and END blocks provide the startup and cleanup actions of our program. The BODY block includes lines of pattern & action pairings.

Awk's workflow: BEGIN, BODY, END. Notice that you don't need the BODY keyword before its block.|>

The BEGIN block executes just once, acting to initialize the program. Here, we can define variables such as FS, RS and ORS, which are initially undefined. Additionally, we may create a header for a data table if not exists.

BEGIN { // initialize variables and other commands }

BODY Block

The BODY block runs on every input line that matches an optional pattern. Note that you don't need any keywords before the opening curly brace for the BODY block.

{ /pattern/ { actions } }

END block

The END block is the last block of code to be executed once the file is exhausted. Oftentime it is used to produce summary reports. Precede the block with the END keyword.

END { // cleanup }

The BEGIN and END patterns can occur in any order within the awk program, but convention holds that BEGIN should come first, and END should be last. If there are multiple BEGIN and END blocks, they are processed in order of the AWK file.

Example

Let's now look at an example AWK script. The syntax and variables have not been covered yet, but we wanted to give you a brief gist of what a basic AWK script would look like.

Assume we want to perform two tasks to the grades.txt datafile below. 1) We want to create a header, and 2) we wanted to find out how many students received a B in the class.

# grades.txt

Gil Conrad 98 93 94 A Vern Wynne 85 78 93 B Ingram Dannie 84 85 94 B+ Wright Morty 75 76 79 C+ Johnnie Adair 78 94 87 B

Now we can write our awk script test.awk .

# test.awk BEGIN { # Print the header out before starting anything printf "FName\tLName\tExam1\tExam2\tFinal\tGrade

"; # Initialize any variables n = 0; } { # Print each line (called "record") print $0 # If the sixth column (called "field") is a B, then increment n if($6 == "B") { ++n } } END { # Wrap things up and print out summary variables print "Number of students with a B in the class = " | n; }

To apply our awk script via the command line, use the -f option.

$ gawk -f test.awk grades.txt

FName LName Exam1 Exam2 Final Grade Gil Conrad 98 93 94 A Vern Wynne 85 78 93 B Ingram Dannie 84 85 94 B+ Wright Morty 75 76 79 C+ Johnnie Adair 78 94 87 B Number of students with a B in the class = 2

Now let's move onto Records and Fields , one of the main backbones of AWK.

Records and Fields RS, RT, ORS, FS, OFS, $n

The backbone of AWK's programming model consists of two pieces: 1) records & fields , along with 2) patterns & actions . Let's look at the first core component here, then move onto patterns & actions in the next lesson.

What are records and fields?

AWK views each input stream as a collection of records . Records can be thought of individual lines, which are then divided into fields (each data cell). Take a look at the figure below, which displays the grades.txt file.

Our example grades.txt file. Each row is a record, and each data cell is a field.|>

Record separators (RS & RT)

To specify the character that separates records, we use the built-in RS variable. In the original AWK implementation, the RS variable had to be a single literal character such as the newline or an empty string. In other implementations such as gawk , RS may be a regular expression.

In the case we have a regular expression, RS will hold the literal regex, while RT will hold the matching string.

$ echo firstRecord 111111 secondRecord 222222 thirdRecord 333333 lastRecord | > gawk 'BEGIN { RS = "([[:digit:]]+)" } > { print "RS = " RS " and RT = " RT }' RS = ([[:digit:]]+) and RT = 111111 RS = ([[:digit:]]+) and RT = 222222 RS = ([[:digit:]]+) and RT = 333333

This code snippet sets the RS variable to any number of digits. Notice how the RS variable displays the literal regex, while RT displays the matched regex.

Output Record Separator (ORS)

The Output Record Separator (ORS) is used to specify what should come after an record is printed. The default is a newline character.

In this example, we read and print out the current record in our buffer (denoted by $0 ), followed by a plus ( + ) symbol.

$ echo 'hello; nihao; hola; anyonghasaeyo' | > gawk 'BEGIN { RS = ";"; ORS = " +"} > { print $0 }' hello + nihao + hola + anyonghasaeyo

Field separators (FS)

Fields are separated by the FS variable. The default value is a single space, which translates to one or more whitespace characters with the leading/trailing whitespaces on the line are ignored. Thus, the following fields looks the same to AWK.

Joe John Johanna Joe John Johanna

To specify a literal single space, enclose the space with brackets such that FS = '[ ]'

The field separated may be identified by the -F option via the command line, or by assigning it in the BEGIN block.

$ echo 'Joe John Johanna' | > gawk -F' ' '{ print NF ":" $0 }' 3:Joe John Johanna # Same command as above but using the BEGIN block $ echo 'Joe John Johanna' | > gawk 'BEGIN { FS=" " } > { print NF ":" $0 }' 3:Joe John Johanna # Changing the FS character $ echo ' Joe John Johanna ' | > gawk -F'[ ]' '{ print NF ":" $0 }' 13: Joe John Johanna

Here we can see that the -F variable is used to manipulate the FS variable straight from the command line. We'll formally learn about how to use AWK via the command line in future lesson.

Output Field Separator (OFS)

The Output Field Separator , or OFS stores the variable that separates each field upon output. By default, it is a space.

$ echo 'John Mary; Jacob Teresa; Bob Claire' | > gawk 'BEGIN { OFS=" loves "; RS=";" } > { print $1, $2 }' John loves Mary Jacob loves Teresa Bob loves Claire h4 Field accession ($n)

You may have noticed the use of the $0 variable in the previous example. This variable stores the current record. To access fields, we can simply use a $ , followed by the field number (eg. $1 for the first field, $2 for the second, and so on).

$ echo 'uno dos tres' | gawk -F' ' '{ print "The second | field is: " $2; print "The entire record is: " $0 }' The second field is: dos The entire record is: uno dos tres

Note that that the values start at 1 and not 0, unlike most programming languages with a zero-based index.

Field to integer conversion

Fields are converted to integer values accordingly. Thus, $(2*2) , $(8/2) , $"4.41" and $4 all refer to the fourth field. Note that negative values have no meaning.

Patterns and Actions print

In the previous lesson, we looked at records & fields, and saw how AWK is able to parse and manipulate them. While learning this, you may have noticed that each action performed is enclosed within braces, which then applies to all records.

Patterns

But what if we just wanted to apply actions to lines that matched a specific pattern? We can do this by preceding actions by a regular expression pattern.

/pattern/ { action } // Action is applied only those records that match pattern /pattern/ // Print all lines matching /pattern/ { action } // Apply actions on all lines

This allows us to select which lines to apply our actions to. If you do not specify a pattern, then the action is applied to all lines. On the other hand, if there is no action, then all lines with the specified pattern are printed out.

Actions

Actions tell AWK how to process a specific record or part of its fields. Let's look at the print action, as it's the most basic thing you can do with a record.

Printing

When print is called, it will print out the record with an output record separator ( ORS ), the default of which is a newline character. In the following example, all record will be printed. We have already seen how we can specify the entire record (with $0 ) and the specific field n ( $n ) with the dollar symbol.

$ echo ' uno dos tres ' | gawk -F' ' '{ print $0 }' uno dos tres # Default is to print the record $ echo ' uno dos tres ' | gawk -F' ' '{ print }' uno dos tres # Print a specific field only $ echo ' uno dos tres ' | gawk -F' ' '{ print $2 }' dos

Printing by pattern

Now we can follow a certain pattern and print only those that match. For this example, we'll use grades.txt , which is a file containing grade reports of five students.

$ gawk '$6 ~ /B/ { print $0 }' grades.txt Vern Wynne 85 78 93 B Ingram Dannie 84 85 94 B+ Johnnie Adair 78 94 87 B # Print last name, first name for students with a B $ awk '/B/ {print $2 "\t" $1}' grades.txt Wynne Vern Dannie Ingram Adair Johnnie

Here, we can use the ~ to select those that match field #6.

That's all for now...

This was just part 1 of our Awk series. If you're interested in learning more, please follow us on Twitter or Like us on Facebook for our next update!

Calling GAWK from the command line

Let's take some time to formally learn how to call gawk from the command line. The formal listing of the gawk command and its parameters is:

The -F is for field separator (fs).

The -v var=value parameter is used to assign a value to a variable before the execution of the program. These variables may be accessed by the BEGIN block.

Options that come after the -- are those that you can use for the awk program.

As a one-liner

We have already seen how to use one-liners to apply awk statements to one or more files.

Pipelining

You may also choose to incorporate awk within a pipeline. Simply put in the command as the first argument.

Applying an awk file

It can be a hassle to type out every awk statements via the command line. Thus, we can save our awk commands and use it via the -f option.

Assigning Options

We may also allow the user to declare some variable from the command line. To do so, use the -v option.

Separate paramters per files

Say you have several files, but each file has its own awk variables that apply to it. We can do this all in one line.

Modes

gawk also comes with the --profile option which can be used to gather profiling statistics from the execution of the program.

Furthermore, there is a debug mode, which is indicated by the --debug option.

To revert gawk back to the traditional (awk) mode, use the --traiditional mode.

Further reading

To obtain a further and more complete list of awk options, use the --help parameter, or the man command.

Predefined Variables in Awk

Awk contains a slew of helpful predefined variables. Let's look at them and how we can incorporate them into our awk scripts.

Command Line Arguments

Awk allows you to access any command line arguments that the user may have passed in via the command line. ARGC gives the argument count, while ARGV provides an array of argument values.

It is possible to modify ARGC and ARGC . When deleting from ARGV , be sure to decrement ARGC .

Environment variables

only Three=3, file1, Four=4 file2 and file3 are available awk 'BEGIN{ for (k = 0; k < ARGC; k++) print "ARGV[" k "] = [" ARGV[k] "]" }' a b c ARGV[0] = [awk] ARGV[1] = [a] ARGV[2] = [b] ARGV[3] = [c] awk stop sinterpreting arguments as options as soon as it has seen either an argument containing the program text or the special -- option. any following arguments that looks like options should be handled by your program. #!/bin/sh - AWK=${AWK:-nawk} AWKPROG=' long program here ' $AWK "$AWKPROG" "$@"

You may access the user's environment varialbes with the built-in array ENVIRON.

Scalar variables

$ awk 'BEGIN { print ENVIRON["HOME"]; print ENVIRON["USER"] }' you can add, delete and modify entries as needed. POSIX requries that subprocesses inherit the environment in effect when awk is started.

The following are scalar variables that hold a single value.

FILENAME Name of current input file. FNR Record number in the current input file. FS Field Separator default is " ". NF number of fields in current record. NR record number in the job. OFS output field separator (default = " "). ORS Output record separator (default = "

"). RS Input record separator (regular expression in gawk and mawk only (default: "

").

Printing records of a specific length

(FNR == 3) && (FILENAME ~ /[.][ch]$/) // select record 3 in C source files $1 ~ /jones/ /[Xx][Mm][Ll]/ // select records containing "XML", ignoring lettercase $0 ~ /[Xx][Mm][Ll]/ // same as above

There are built-in functions that we will see This function uses the built-in length function.

Pattern Expressions

1 NR > 0 {print} 1 {print} {print} {print $0} $ echo 'one two three four' | awk '{ print $1, $2, $3 }' one two three $ echo 'one two three four' | awk '{ OFS="..."; print $1, $2, $3 }' one...two...three $ echo 'one two three four' | awk '{ OFS="

"; print $1, $2, $3 }' one two three $ echo 'one two three four' | awk '{ OFS="

"; print $0 }' one two three four $ echo 'one two three four' | awk '{ OFS="

"; $1 = $1; print $0 }' one two three four reassign forces reassembly of record with the new field separator

You may also use built-in awk variables within your pattern. Here is a short list of expressions you may use. We will go through built-in variables in a future lesson, but here's a quick sneak-peek.

NF == 0 Select empty records. NF > 4 Select records containing more more than 4 fields. NF < 4 Select records that contain 1 to 4 fields. $1 ~ /Ingram/ Select records that contain "Ingram" as the first field

Pattern range expressions

In addition to patterns above, we can specify a range of text. We may do this with two expressions separated by a comma.