The GNU Awk User’s Guide

General Introduction

This file documents awk , a program that you can use to select particular records in a file and perform operations upon them.

Copyright © 1989, 1991, 1992, 1993, 1996–2005, 2007, 2009–2020

Free Software Foundation, Inc.

This is Edition 5.1 of GAWK: Effective AWK Programming: A User’s Guide for GNU Awk , for the 5.1.0 (or later) version of the GNU implementation of AWK.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.3 or any later version published by the Free Software Foundation; with the Invariant Sections being “GNU General Public License”, with the Front-Cover Texts being “A GNU Manual”, and with the Back-Cover Texts as in (a) below. A copy of the license is included in the section entitled “GNU Free Documentation License”.

The FSF’s Back-Cover Text is: “You have the freedom to copy and modify this GNU manual.”

Short Table of Contents

Table of Contents

Foreword to the Third Edition

Arnold Robbins and I are good friends. We were introduced in 1990 by circumstances—and our favorite programming language, AWK. The circumstances started a couple of years earlier. I was working at a new job and noticed an unplugged Unix computer sitting in the corner. No one knew how to use it, and neither did I. However, a couple of days later, it was running, and I was root and the one-and-only user. That day, I began the transition from statistician to Unix programmer.

On one of many trips to the library or bookstore in search of books on Unix, I found the gray AWK book, a.k.a. Alfred V. Aho, Brian W. Kernighan, and Peter J. Weinberger’s The AWK Programming Language (Addison-Wesley, 1988). awk ’s simple programming paradigm—find a pattern in the input and then perform an action—often reduced complex or tedious data manipulations to a few lines of code. I was excited to try my hand at programming in AWK.

Alas, the awk on my computer was a limited version of the language described in the gray book. I discovered that my computer had “old awk ” and the book described “new awk .” I learned that this was typical; the old version refused to step aside or relinquish its name. If a system had a new awk , it was invariably called nawk , and few systems had it. The best way to get a new awk was to ftp the source code for gawk from prep.ai.mit.edu . gawk was a version of new awk written by David Trueman and Arnold, and available under the GNU General Public License.

(Incidentally, it’s no longer difficult to find a new awk . gawk ships with GNU/Linux, and you can download binaries or source code for almost any system; my wife uses gawk on her VMS box.)

My Unix system started out unplugged from the wall; it certainly was not plugged into a network. So, oblivious to the existence of gawk and the Unix community in general, and desiring a new awk , I wrote my own, called mawk . Before I was finished, I knew about gawk , but it was too late to stop, so I eventually posted to a comp.sources newsgroup.

A few days after my posting, I got a friendly email from Arnold introducing himself. He suggested we share design and algorithms and attached a draft of the POSIX standard so that I could update mawk to support language extensions added after publication of The AWK Programming Language .

Frankly, if our roles had been reversed, I would not have been so open and we probably would have never met. I’m glad we did meet. He is an AWK expert’s AWK expert and a genuinely nice person. Arnold contributes significant amounts of his expertise and time to the Free Software Foundation.

This book is the gawk reference manual, but at its core it is a book about AWK programming that will appeal to a wide audience. It is a definitive reference to the AWK language as defined by the 1987 Bell Laboratories release and codified in the 1992 POSIX Utilities standard.

On the other hand, the novice AWK programmer can study a wealth of practical programs that emphasize the power of AWK’s basic idioms: data-driven control flow, pattern matching with regular expressions, and associative arrays. Those looking for something new can try out gawk ’s interface to network protocols via special /inet files.

The programs in this book make clear that an AWK program is typically much smaller and faster to develop than a counterpart written in C. Consequently, there is often a payoff to prototyping an algorithm or design in AWK to get it running quickly and expose problems early. Often, the interpreted performance is adequate and the AWK prototype becomes the product.

The new pgawk (profiling gawk ), produces program execution counts. I recently experimented with an algorithm that for n lines of input, exhibited ~ C n^2 performance, while theory predicted ~ C n log n behavior. A few minutes poring over the awkprof.out profile pinpointed the problem to a single line of code. pgawk is a welcome addition to my programmer’s toolbox.

Arnold has distilled over a decade of experience writing and using AWK programs, and developing gawk , into this book. If you use AWK or want to learn how, then read this book.

Michael Brennan Author of mawk March 2001

Foreword to the Fourth Edition

Some things don’t change. Thirteen years ago I wrote: “If you use AWK or want to learn how, then read this book.” True then, and still true today.

Learning to use a programming language is about more than mastering the syntax. One needs to acquire an understanding of how to use the features of the language to solve practical programming problems. A focus of this book is many examples that show how to use AWK.

Some things do change. Our computers are much faster and have more memory. Consequently, speed and storage inefficiencies of a high-level language matter less. Prototyping in AWK and then rewriting in C for performance reasons happens less, because more often the prototype is fast enough.

Of course, there are computing operations that are best done in C or C++. With gawk 4.1 and later, you do not have to choose between writing your program in AWK or in C/C++. You can write most of your program in AWK and the aspects that require C/C++ capabilities can be written in C/C++, and then the pieces glued together when the gawk module loads the C/C++ module as a dynamic plug-in. Writing Extensions for gawk , has all the details, and, as expected, many examples to help you learn the ins and outs.

I enjoy programming in AWK and had fun (re)reading this book. I think you will too.

Michael Brennan Author of mawk October 2014

Preface

Several kinds of tasks occur repeatedly when working with text files. You might want to extract certain lines and discard the rest. Or you may need to make changes wherever certain patterns appear, but leave the rest of the file alone. Such jobs are often easy with awk . The awk utility interprets a special-purpose programming language that makes it easy to handle simple data-reformatting jobs.

The GNU implementation of awk is called gawk ; if you invoke it with the proper options or environment variables, it is fully compatible with the POSIX1 specification of the awk language and with the Unix version of awk maintained by Brian Kernighan. This means that all properly written awk programs should work with gawk . So most of the time, we don’t distinguish between gawk and other awk implementations.

Using awk you can:

Manage small, personal databases

Generate reports

Validate data

Produce indexes and perform other document-preparation tasks

Experiment with algorithms that you can adapt later to other computer languages

In addition, gawk provides facilities that make it easy to:

Extract bits and pieces of data for processing

Sort data

Perform simple network communications

Profile and debug awk programs

programs Extend the language with functions written in C or C++

This Web page teaches you about the awk language and how you can use it effectively. You should already be familiar with basic system commands, such as cat and ls ,2 as well as basic shell facilities, such as input/output (I/O) redirection and pipes.

Implementations of the awk language are available for many different computing environments. This Web page, while describing the awk language in general, also describes the particular implementation of awk called gawk (which stands for “GNU awk ”). gawk runs on a broad range of Unix systems, ranging from Intel-architecture PC-based computers up through large-scale systems. gawk has also been ported to Mac OS X, Microsoft Windows (all versions), and OpenVMS.3

• History The history of gawk and awk . • Names What name to use to find awk . • This Manual Using this Web page. Includes sample input files that you can use. • Conventions Typographical Conventions. • Manual History Brief history of the GNU project and this Web page. • How To Contribute Helping to save the world. • Acknowledgments Acknowledgments.

History of awk and gawk

Recipe for a Programming Language 1 part egrep 1 part snobol 2 parts ed 3 parts C Blend all parts well using lex and yacc . Document minimally and release. After eight years, add another part egrep and two more parts C. Document very well and release.

The name awk comes from the initials of its designers: Alfred V. Aho, Peter J. Weinberger, and Brian W. Kernighan. The original version of awk was written in 1977 at AT&T Bell Laboratories. In 1985, a new version made the programming language more powerful, introducing user-defined functions, multiple input streams, and computed regular expressions. This new version became widely available with Unix System V Release 3.1 (1987). The version in System V Release 4 (1989) added some new features and cleaned up the behavior in some of the “dark corners” of the language. The specification for awk in the POSIX Command Language and Utilities standard further clarified the language. Both the gawk designers and the original awk designers at Bell Laboratories provided feedback for the POSIX specification.

Paul Rubin wrote gawk in 1986. Jay Fenlason completed it, with advice from Richard Stallman. John Woods contributed parts of the code as well. In 1988 and 1989, David Trueman, with help from me, thoroughly reworked gawk for compatibility with the newer awk . Circa 1994, I became the primary maintainer. Current development focuses on bug fixes, performance improvements, standards compliance, and, occasionally, new features.

In May 1997, Jürgen Kahrs felt the need for network access from awk , and with a little help from me, set about adding features to do this for gawk . At that time, he also wrote the bulk of TCP/IP Internetworking with gawk (a separate document, available as part of the gawk distribution). His code finally became part of the main gawk distribution with gawk version 3.1.

John Haque rewrote the gawk internals, in the process providing an awk -level debugger. This version became available as gawk version 4.0 in 2011.

See section Major Contributors to gawk for a full list of those who have made important contributions to gawk .

A Rose by Any Other Name

The awk language has evolved over the years. Full details are provided in The Evolution of the awk Language. The language described in this Web page is often referred to as “new awk .” By analogy, the original version of awk is referred to as “old awk .”

On most current systems, when you run the awk utility you get some version of new awk .4 If your system’s standard awk is the old one, you will see something like this if you try the following test program:

$ awk 1 /dev/null error→ awk: syntax error near line 1 error→ awk: bailing out near line 1

In this case, you should find a version of new awk , or just install gawk !

Throughout this Web page, whenever we refer to a language feature that should be available in any complete implementation of POSIX awk , we simply use the term awk . When referring to a feature that is specific to the GNU implementation, we use the term gawk .

Using This Book

The term awk refers to a particular program as well as to the language you use to tell this program what to do. When we need to be careful, we call the language “the awk language,” and the program “the awk utility.” This Web page explains both how to write programs in the awk language and how to run the awk utility. The term “ awk program” refers to a program written by you in the awk programming language.

Primarily, this Web page explains the features of awk as defined in the POSIX standard. It does so in the context of the gawk implementation. While doing so, it also attempts to describe important differences between gawk and other awk implementations.5 Finally, it notes any gawk features that are not in the POSIX standard for awk .

This Web page has the difficult task of being both a tutorial and a reference. If you are a novice, feel free to skip over details that seem too complex. You should also ignore the many cross-references; they are for the expert user and for the Info and HTML versions of the Web page.

There are sidebars scattered throughout the Web page. They add a more complete explanation of points that are relevant, but not likely to be of interest on first reading. All appear in the index, under the heading “sidebar.”

Most of the time, the examples use complete awk programs. Some of the more advanced sections show only the part of the awk program that illustrates the concept being described.

Although this Web page is aimed principally at people who have not been exposed to awk , there is a lot of information here that even the awk expert should find useful. In particular, the description of POSIX awk and the example programs in A Library of awk Functions, and in Practical awk Programs, should be of interest.

This Web page is split into several parts, as follows:

Typographical Conventions

This Web page is written in Texinfo, the GNU documentation formatting language. A single Texinfo source file is used to produce both the printed and online versions of the documentation. Because of this, the typographical conventions are slightly different than in other books you may have read.

Examples you would type at the command line are preceded by the common shell primary and secondary prompts, ‘ $ ’ and ‘ > ’, respectively. Input that you type is shown like this . Output from the command is preceded by the glyph “-|”. This typically represents the command’s standard output. Error messages and other output on the command’s standard error are preceded by the glyph “error→”. For example:

$ echo hi on stdout -| hi on stdout $ echo hello on stderr 1>&2 error→ hello on stderr

In the text, almost anything related to programming, such as command names, variable and function names, and string, numeric and regexp constants appear in this font . Code fragments appear in the same font and quoted, ‘ like this ’. Things that are replaced by the user or programmer appear in this font . Options look like this: -f . File names are indicated like this: /path/to/ourfile . Some things are emphasized like this, and if a point needs to be made strongly, it is done like this. The first occurrence of a new term is usually its definition and appears in the same font as the previous occurrence of “definition” in this sentence.

Characters that you type at the keyboard look like this . In particular, there are special characters called “control characters.” These are characters that you type by holding down both the CONTROL key and another key, at the same time. For example, a Ctrl-d is typed by first pressing and holding the CONTROL key, next pressing the d key, and finally releasing both keys.

For the sake of brevity, throughout this Web page, we refer to Brian Kernighan’s version of awk as “BWK awk .” (See section Other Freely Available awk Implementations for information on his and other versions.)

Dark Corners

Dark corners are basically fractal—no matter how much you illuminate, there’s always a smaller but darker one.

— Brian Kernighan

Until the POSIX standard (and GAWK: Effective AWK Programming ), many features of awk were either poorly documented or not documented at all. Descriptions of such features (often called “dark corners”) are noted in this Web page with “(d.c.).” They also appear in the index under the heading “dark corner.”

But, as noted by the opening quote, any coverage of dark corners is by definition incomplete.

Extensions to the standard awk language that are supported by more than one awk implementation are marked “(c.e.),” and listed in the index under “common extensions” and “extensions, common.”

The GNU Project and This Book

The Free Software Foundation (FSF) is a nonprofit organization dedicated to the production and distribution of freely distributable software. It was founded by Richard M. Stallman, the author of the original Emacs editor. GNU Emacs is the most widely used version of Emacs today.

The GNU6 Project is an ongoing effort on the part of the Free Software Foundation to create a complete, freely distributable, POSIX-compliant computing environment. The FSF uses the GNU General Public License (GPL) to ensure that its software’s source code is always available to the end user. A copy of the GPL is included in this Web page for your reference (see section GNU General Public License). The GPL applies to the C language source code for gawk . To find out more about the FSF and the GNU Project online, see the GNU Project’s home page. This Web page may also be read from GNU’s website.

A shell, an editor (Emacs), highly portable optimizing C, C++, and Objective-C compilers, a symbolic debugger and dozens of large and small utilities (such as gawk ), have all been completed and are freely available. The GNU operating system kernel (the HURD), has been released but remains in an early stage of development.

Until the GNU operating system is more fully developed, you should consider using GNU/Linux, a freely distributable, Unix-like operating system for Intel, Power Architecture, Sun SPARC, IBM S/390, and other systems.7 Many GNU/Linux distributions are available for download from the Internet.

The Web page you are reading is actually free—at least, the information in it is free to anyone. The machine-readable source code for the Web page comes with gawk . (Take a moment to check the Free Documentation License in GNU Free Documentation License.)

The Web page itself has gone through multiple previous editions. Paul Rubin wrote the very first draft of The GAWK Manual ; it was around 40 pages long. Diane Close and Richard Stallman improved it, yielding a version that was around 90 pages and barely described the original, “old” version of awk .

I started working with that version in the fall of 1988. As work on it progressed, the FSF published several preliminary versions (numbered 0. x ). In 1996, edition 1.0 was released with gawk 3.0.0. The FSF published the first two editions under the title The GNU Awk User’s Guide .

This edition maintains the basic structure of the previous editions. For FSF edition 4.0, the content was thoroughly reviewed and updated. All references to gawk versions prior to 4.0 were removed. Of significant note for that edition was the addition of Debugging awk Programs.

For FSF edition 5.0, the content has been reorganized into parts, and the major new additions are Arithmetic and Arbitrary-Precision Arithmetic with gawk , and Writing Extensions for gawk .

This Web page will undoubtedly continue to evolve. If you find an error in the Web page, please report it! See section Reporting Problems and Bugs for information on submitting problem reports electronically.

How to Contribute

As the maintainer of GNU awk , I once thought that I would be able to manage a collection of publicly available awk programs and I even solicited contributions. Making things available on the Internet helps keep the gawk distribution down to manageable size.

The initial collection of material, such as it is, is still available at ftp://ftp.freefriends.org/arnold/Awkstuff.

In the hopes of doing something more broad, I acquired the awklang.org domain. Late in 2017, a volunteer took on the task of managing it.

If you have written an interesting awk program, that you would like to share with the rest of the world, please see http://www.awklang.org and use the “Contact” link.

If you have written a gawk extension, please see The gawkextlib Project.

Acknowledgments

The initial draft of The GAWK Manual had the following acknowledgments:

Many people need to be thanked for their assistance in producing this manual. Jay Fenlason contributed many ideas and sample programs. Richard Mlynarik and Robert Chassell gave helpful comments on drafts of this manual. The paper A Supplemental Document for AWK by John W. Pierce of the Chemistry Department at UC San Diego, pinpointed several issues relevant both to awk implementation and to this manual, that would otherwise have escaped us.

I would like to acknowledge Richard M. Stallman, for his vision of a better world and for his courage in founding the FSF and starting the GNU Project.

Earlier editions of this Web page had the following acknowledgements:

The following people (in alphabetical order) provided helpful comments on various versions of this book: Rick Adams, Dr. Nelson H.F. Beebe, Karl Berry, Dr. Michael Brennan, Rich Burridge, Claire Cloutier, Diane Close, Scott Deifik, Christopher (“Topher”) Eliot, Jeffrey Friedl, Dr. Darrel Hankerson, Michal Jaegermann, Dr. Richard J. LeBlanc, Michael Lijewski, Pat Rankin, Miriam Robbins, Mary Sheehan, and Chuck Toporek. Robert J. Chassell provided much valuable advice on the use of Texinfo. He also deserves special thanks for convincing me not to title this Web page How to Gawk Politely . Karl Berry helped significantly with the TeX part of Texinfo. I would like to thank Marshall and Elaine Hartholz of Seattle and Dr. Bert and Rita Schreiber of Detroit for large amounts of quiet vacation time in their homes, which allowed me to make significant progress on this Web page and on gawk itself. Phil Hughes of SSC contributed in a very important way by loaning me his laptop GNU/Linux system, not once, but twice, which allowed me to do a lot of work while away from home. David Trueman deserves special credit; he has done a yeoman job of evolving gawk so that it performs well and without bugs. Although he is no longer involved with gawk , working with him on this project was a significant pleasure. The intrepid members of the GNITS mailing list, and most notably Ulrich Drepper, provided invaluable help and feedback for the design of the internationalization features. Chuck Toporek, Mary Sheehan, and Claire Cloutier of O’Reilly & Associates contributed significant editorial help for this Web page for the 3.1 release of gawk .

Dr. Nelson Beebe, Andreas Buening, Dr. Manuel Collado, Antonio Colombo, Stephen Davies, Scott Deifik, Akim Demaille, Daniel Richard G., Juan Manuel Guerrero, Darrel Hankerson, Michal Jaegermann, Jürgen Kahrs, Stepan Kasal, John Malmberg, Chet Ramey, Pat Rankin, Andrew Schorr, Corinna Vinschen, and Eli Zaretskii (in alphabetical order) make up the current gawk “crack portability team.” Without their hard work and help, gawk would not be nearly the robust, portable program it is today. It has been and continues to be a pleasure working with this team of fine people.

Notable code and documentation contributions were made by a number of people. See section Major Contributors to gawk for the full list.

Thanks to Michael Brennan for the Forewords.

Thanks to Patrice Dumas for the new makeinfo program. Thanks to Karl Berry for his past work on Texinfo, and to Gavin Smith, who continues to work to improve the Texinfo markup language.

Robert P.J. Day, Michael Brennan, and Brian Kernighan kindly acted as reviewers for the 2015 edition of this Web page. Their feedback helped improve the final work.

I would also like to thank Brian Kernighan for his invaluable assistance during the testing and debugging of gawk , and for his ongoing help and advice in clarifying numerous points about the language. We could not have done nearly as good a job on either gawk or its documentation without his help.

Brian is in a class by himself as a programmer and technical author. I have to thank him (yet again) for his ongoing friendship and for being a role model to me for over 30 years! Having him as a reviewer is an exciting privilege. It has also been extremely humbling ...

I must thank my wonderful wife, Miriam, for her patience through the many versions of this project, for her proofreading, and for sharing me with the computer. I would like to thank my parents for their love, and for the grace with which they raised and educated me. Finally, I also must acknowledge my gratitude to G-d, for the many opportunities He has sent my way, as well as for the gifts He has given me with which to take advantage of those opportunities.

Arnold Robbins

Nof Ayalon

Israel

March, 2020

Part I:

The awk Language

1 Getting Started with awk

The basic function of awk is to search files for lines (or other units of text) that contain certain patterns. When a line matches one of the patterns, awk performs specified actions on that line. awk continues to process input lines in this way until it reaches the end of the input files.

Programs in awk are different from programs in most other languages, because awk programs are data driven (i.e., you describe the data you want to work with and then what to do when you find it). Most other languages are procedural; you have to describe, in great detail, every step the program should take. When working with procedural languages, it is usually much harder to clearly describe the data your program will process. For this reason, awk programs are often refreshingly easy to read and write.

When you run awk , you specify an awk program that tells awk what to do. The program consists of a series of rules (it may also contain function definitions, an advanced feature that we will ignore for now; see section User-Defined Functions). Each rule specifies one pattern to search for and one action to perform upon finding the pattern.

Syntactically, a rule consists of a pattern followed by an action. The action is enclosed in braces to separate it from the pattern. Newlines usually separate rules. Therefore, an awk program looks like this:

pattern { action } pattern { action } …

1.1 How to Run awk Programs

There are several ways to run an awk program. If the program is short, it is easiest to include it in the command that runs awk , like this:

awk ' program ' input-file1 input-file2 …

When the program is long, it is usually more convenient to put it in a file and run it with a command like this:

awk -f program-file input-file1 input-file2 …

This section discusses both mechanisms, along with several variations of each.

• One-shot Running a short throwaway awk program. • Read Terminal Using no input files (input from the keyboard instead). • Long Putting permanent awk programs in files. • Executable Scripts Making self-contained awk programs. • Comments Adding documentation to gawk programs. • Quoting More discussion of shell quoting issues.

1.1.1 One-Shot Throwaway awk Programs

Once you are familiar with awk , you will often type in simple programs the moment you want to use them. Then you can write the program as the first argument of the awk command, like this:

awk ' program ' input-file1 input-file2 …

where program consists of a series of patterns and actions, as described earlier.

This command format instructs the shell, or command interpreter, to start awk and use the program to process records in the input file(s). There are single quotes around program so the shell won’t interpret any awk characters as special shell characters. The quotes also cause the shell to treat all of program as a single argument for awk , and allow program to be more than one line long.

This format is also useful for running short or medium-sized awk programs from shell scripts, because it avoids the need for a separate file for the awk program. A self-contained shell script is more reliable because there are no other files to misplace.

Later in this chapter, in Some Simple Examples, we’ll see examples of several short, self-contained programs.

1.1.2 Running awk Without Input Files

You can also run awk without any input files. If you type the following command line:

awk ' program '

awk applies the program to the standard input, which usually means whatever you type on the keyboard. This continues until you indicate end-of-file by typing Ctrl-d . (On non-POSIX operating systems, the end-of-file character may be different.)

As an example, the following program prints a friendly piece of advice (from Douglas Adams’s The Hitchhiker’s Guide to the Galaxy ), to keep you from worrying about the complexities of computer programming:

$ awk 'BEGIN { print "Don\47t Panic!" }' -| Don't Panic!

awk executes statements associated with BEGIN before reading any input. If there are no other statements in your program, as is the case here, awk just stops, instead of trying to read input it doesn’t know how to process. The ‘ \47 ’ is a magic way (explained later) of getting a single quote into the program, without having to engage in ugly shell quoting tricks.

NOTE: If you use Bash as your shell, you should execute the command ‘ set +H ’ before running this program interactively, to disable the C shell-style command history, which treats ‘ ! ’ as a special character. We recommend putting this command into your personal startup file.

This next simple awk program emulates the cat utility; it copies whatever you type on the keyboard to its standard output (why this works is explained shortly):

$ awk '{ print }' Now is the time for all good men -| Now is the time for all good men to come to the aid of their country. -| to come to the aid of their country. Four score and seven years ago, ... -| Four score and seven years ago, ... What, me worry? -| What, me worry? Ctrl-d

1.1.3 Running Long Programs

Sometimes awk programs are very long. In these cases, it is more convenient to put the program into a separate file. In order to tell awk to use that file for its program, you type:

awk -f source-file input-file1 input-file2 …

The -f instructs the awk utility to get the awk program from the file source-file (see section Command-Line Options). Any file name can be used for source-file . For example, you could put the program:

BEGIN { print "Don't Panic!" }

into the file advice . Then this command:

awk -f advice

does the same thing as this one:

awk 'BEGIN { print "Don\47t Panic!" }'

This was explained earlier (see section Running awk Without Input Files). Note that you don’t usually need single quotes around the file name that you specify with -f , because most file names don’t contain any of the shell’s special characters. Notice that in advice , the awk program did not have single quotes around it. The quotes are only needed for programs that are provided on the awk command line. (Also, placing the program in a file allows us to use a literal single quote in the program text, instead of the magic ‘ \47 ’.)

If you want to clearly identify an awk program file as such, you can add the extension .awk to the file name. This doesn’t affect the execution of the awk program but it does make “housekeeping” easier.

1.1.4 Executable awk Programs

Once you have learned awk , you may want to write self-contained awk scripts, using the ‘ #! ’ script mechanism. You can do this on many systems.8 For example, you could update the file advice to look like this:

#! /bin/awk -f BEGIN { print "Don't Panic!" }

After making this file executable (with the chmod utility), simply type ‘ advice ’ at the shell and the system arranges to run awk as if you had typed ‘ awk -f advice ’:

$ chmod +x advice $ ./advice -| Don't Panic!

Self-contained awk scripts are useful when you want to write a program that users can invoke without their having to know that the program is written in awk .

Understanding ‘ #! ’ awk is an interpreted language. This means that the awk utility reads your program and then processes your data according to the instructions in your program. (This is different from a compiled language such as C, where your program is first compiled into machine code that is executed directly by your system’s processor.) The awk utility is thus termed an interpreter. Many modern languages are interpreted. The line beginning with ‘ #! ’ lists the full file name of an interpreter to run and a single optional initial command-line argument to pass to that interpreter. The operating system then runs the interpreter with the given argument and the full argument list of the executed program. The first argument in the list is the full file name of the awk program. The rest of the argument list contains either options to awk , or data files, or both. (Note that on many systems awk is found in /usr/bin instead of in /bin .) Some systems limit the length of the interpreter name to 32 characters. Often, this can be dealt with by using a symbolic link. You should not put more than one argument on the ‘ #! ’ line after the path to awk . It does not work. The operating system treats the rest of the line as a single argument and passes it to awk . Doing this leads to confusing behavior—most likely a usage diagnostic of some sort from awk . Finally, the value of ARGV[0] (see section Predefined Variables) varies depending upon your operating system. Some systems put ‘ awk ’ there, some put the full pathname of awk (such as /bin/awk ), and some put the name of your script (‘ advice ’). (d.c.) Don’t rely on the value of ARGV[0] to provide your script name.

1.1.5 Comments in awk Programs

A comment is some text that is included in a program for the sake of human readers; it is not really an executable part of the program. Comments can explain what the program does and how it works. Nearly all programming languages have provisions for comments, as programs are typically hard to understand without them.

In the awk language, a comment starts with the number sign character (‘ # ’) and continues to the end of the line. The ‘ # ’ does not have to be the first character on the line. The awk language ignores the rest of a line following a number sign. For example, we could have put the following into advice :

# This program prints a nice, friendly message. It helps # keep novice users from being afraid of the computer. BEGIN { print "Don't Panic!" }

You can put comment lines into keyboard-composed throwaway awk programs, but this usually isn’t very useful; the purpose of a comment is to help you or another person understand the program when reading it at a later time.

CAUTION: As mentioned in One-Shot Throwaway awk Programs, you can enclose short to medium-sized programs in single quotes, in order to keep your shell scripts self-contained. When doing so, don’t put an apostrophe (i.e., a single quote) into a comment (or anywhere else in your program). The shell interprets the quote as the closing quote for the entire program. As a result, usually the shell prints a message about mismatched quotes, and if awk actually runs, it will probably print strange messages about syntax errors. For example, look at the following: $ awk 'BEGIN { print "hello" } # let's be cute' > The shell sees that the first two quotes match, and that a new quoted object begins at the end of the command line. It therefore prompts with the secondary prompt, waiting for more input. With Unix awk , closing the quoted string produces this result: $ awk '{ print "hello" } # let's be cute' > ' error→ awk: can't open file be error→ source line number 1 Putting a backslash before the single quote in ‘ let's ’ wouldn’t help, because backslashes are not special inside single quotes. The next subsection describes the shell’s quoting rules.

1.1.6 Shell Quoting Issues

• DOS Quoting Quoting in Windows Batch Files.

For short to medium-length awk programs, it is most convenient to enter the program on the awk command line. This is best done by enclosing the entire program in single quotes. This is true whether you are entering the program interactively at the shell prompt, or writing it as part of a larger shell script:

awk ' program text ' input-file1 input-file2 …

Once you are working with the shell, it is helpful to have a basic knowledge of shell quoting rules. The following rules apply only to POSIX-compliant, Bourne-style shells (such as Bash, the GNU Bourne-Again Shell). If you use the C shell, you’re on your own.

Before diving into the rules, we introduce a concept that appears throughout this Web page, which is that of the null, or empty, string.

The null string is character data that has no value. In other words, it is empty. It is written in awk programs like this: "" . In the shell, it can be written using single or double quotes: "" or '' . Although the null string has no characters in it, it does exist. For example, consider this command:

$ echo ""

Here, the echo utility receives a single argument, even though that argument has no characters in it. In the rest of this Web page, we use the terms null string and empty string interchangeably. Now, on to the quoting rules:

Quoted items can be concatenated with nonquoted items as well as with other quoted items. The shell turns everything into one argument for the command.

Preceding any single character with a backslash (‘ \ ’) quotes that character. The shell removes the backslash and passes the quoted character on to the command.

’) quotes that character. The shell removes the backslash and passes the quoted character on to the command. Single quotes protect everything between the opening and closing quotes. The shell does no interpretation of the quoted text, passing it on verbatim to the command. It is impossible to embed a single quote inside single-quoted text. Refer back to Comments in awk Programs for an example of what happens if you try.

Single quotes protect everything between the opening and closing quotes. The shell does no interpretation of the quoted text, passing it on verbatim to the command. It is impossible to embed a single quote inside single-quoted text. Refer back to Comments in Programs for an example of what happens if you try. Double quotes protect most things between the opening and closing quotes. The shell does at least variable and command substitution on the quoted text. Different shells may do additional kinds of processing on double-quoted text. Because certain characters within double-quoted text are processed by the shell, they must be escaped within the text. Of note are the characters ‘ $ ’, ‘ ` ’, ‘ \ ’, and ‘ " ’, all of which must be preceded by a backslash within double-quoted text if they are to be passed on literally to the program. (The leading backslash is stripped first.) Thus, the example seen previously in Running awk Without Input Files: awk 'BEGIN { print "Don\47t Panic!" }' could instead be written this way: $ awk "BEGIN { print \"Don't Panic!\" }" -| Don't Panic! Note that the single quote is not special within double quotes.

Double quotes protect most things between the opening and closing quotes. The shell does at least variable and command substitution on the quoted text. Different shells may do additional kinds of processing on double-quoted text. Null strings are removed when they occur as part of a non-null command-line argument, while explicit null objects are kept. For example, to specify that the field separator FS should be set to the null string, use: awk -F "" ' program ' files # correct Don’t use this: awk -F"" ' program ' files # wrong! In the second case, awk attempts to use the text of the program as the value of FS , and the first file name as the text of the program! This results in syntax errors at best, and confusing behavior at worst.

Mixing single and double quotes is difficult. You have to resort to shell quoting tricks, like this:

$ awk 'BEGIN { print "Here is a single quote <'"'"'>" }' -| Here is a single quote <'>

This program consists of three concatenated quoted strings. The first and the third are single-quoted, and the second is double-quoted.

This can be “simplified” to:

$ awk 'BEGIN { print "Here is a single quote <'\''>" }' -| Here is a single quote <'>

Judge for yourself which of these two is the more readable.

Another option is to use double quotes, escaping the embedded, awk -level double quotes:

$ awk "BEGIN { print \"Here is a single quote <'>\" }" -| Here is a single quote <'>

This option is also painful, because double quotes, backslashes, and dollar signs are very common in more advanced awk programs.

A third option is to use the octal escape sequence equivalents (see section Escape Sequences) for the single- and double-quote characters, like so:

$ awk 'BEGIN { print "Here is a single quote <\47>" }' -| Here is a single quote <'> $ awk 'BEGIN { print "Here is a double quote <\42>" }' -| Here is a double quote <">

This works nicely, but you should comment clearly what the escape sequences mean.

A fourth option is to use command-line variable assignment, like this:

$ awk -v sq="'" 'BEGIN { print "Here is a single quote <" sq ">" }' -| Here is a single quote <'>

(Here, the two string constants and the value of sq are concatenated into a single string that is printed by print .)

If you really need both single and double quotes in your awk program, it is probably best to move it into a separate file, where the shell won’t be part of the picture and you can say what you mean.

1.1.6.1 Quoting in MS-Windows Batch Files

Although this Web page generally only worries about POSIX systems and the POSIX shell, the following issue arises often enough for many users that it is worth addressing.

The “shells” on Microsoft Windows systems use the double-quote character for quoting, and make it difficult or impossible to include an escaped double-quote character in a command-line script. The following example, courtesy of Jeroen Brink, shows how to escape the double quotes from this one liner script that prints all lines in a file surrounded by double quotes:

{ print "\"" $0 "\"" }

In an MS-Windows command-line the one-liner script above may be passed as follows:

gawk "{ print \"\042\" $0 \"\042\" }" file

In this example the ‘ \042 ’ is the octal code for a double-quote; gawk converts it into a real double-quote for output by the print statement.

In MS-Windows escaping double-quotes is a little tricky because you use backslashes to escape double-quotes, but backslashes themselves are not escaped in the usual way; indeed they are either duplicated or not, depending upon whether there is a subsequent double-quote. The MS-Windows rule for double-quoting a string is the following:

For each double quote in the original string, let N be the number of backslash(es) before it, N might be zero. Replace these N backslash(es) by 2* N +1 backslash(es) Let N be the number of backslash(es) tailing the original string, N might be zero. Replace these N backslash(es) by 2* N backslash(es) Surround the resulting string by double-quotes.

So to double-quote the one-liner script ‘ { print "\"" $0 "\"" } ’ from the previous example you would do it this way:

gawk "{ print \"\\\"\" $0 \"\\\"\" }" file

However, the use of ‘ \042 ’ instead of ‘ \\\" ’ is also possible and easier to read, because backslashes that are not followed by a double-quote don’t need duplication.

1.2 Data files for the Examples

Many of the examples in this Web page take their input from two sample data files. The first, mail-list , represents a list of peoples’ names together with their email addresses and information about those people. The second data file, called inventory-shipped , contains information about monthly shipments. In both files, each line is considered to be one record.

In mail-list , each record contains the name of a person, his/her phone number, his/her email address, and a code for his/her relationship with the author of the list. The columns are aligned using spaces. An ‘ A ’ in the last column means that the person is an acquaintance. An ‘ F ’ in the last column means that the person is a friend. An ‘ R ’ means that the person is a relative:

Amelia 555-5553 amelia.zodiacusque@gmail.com F Anthony 555-3412 anthony.asserturo@hotmail.com A Becky 555-7685 becky.algebrarum@gmail.com A Bill 555-1675 bill.drowning@hotmail.com A Broderick 555-0542 broderick.aliquotiens@yahoo.com R Camilla 555-2912 camilla.infusarum@skynet.be R Fabius 555-1234 fabius.undevicesimus@ucb.edu F Julie 555-6699 julie.perscrutabor@skeeve.com F Martin 555-6480 martin.codicibus@hotmail.com A Samuel 555-3430 samuel.lanceolis@shu.edu A Jean-Paul 555-2127 jeanpaul.campanorum@nyu.edu R

The data file inventory-shipped represents information about shipments during the year. Each record contains the month, the number of green crates shipped, the number of red boxes shipped, the number of orange bags shipped, and the number of blue packages shipped, respectively. There are 16 entries, covering the 12 months of last year and the first four months of the current year. An empty line separates the data for the two years:

Jan 13 25 15 115 Feb 15 32 24 226 Mar 15 24 34 228 Apr 31 52 63 420 May 16 34 29 208 Jun 31 42 75 492 Jul 24 34 67 436 Aug 15 34 47 316 Sep 13 55 37 277 Oct 29 54 68 525 Nov 20 87 82 577 Dec 17 35 61 401 Jan 21 36 64 620 Feb 26 58 80 652 Mar 24 75 70 495 Apr 21 70 74 514

The sample files are included in the gawk distribution, in the directory awklib/eg/data .

1.3 Some Simple Examples

The following command runs a simple awk program that searches the input file mail-list for the character string ‘ li ’ (a grouping of characters is usually called a string; the term string is based on similar usage in English, such as “a string of pearls” or “a string of cars in a train”):

awk '/li/ { print $0 }' mail-list

When lines containing ‘ li ’ are found, they are printed because ‘ print $0 ’ means print the current line. (Just ‘ print ’ by itself means the same thing, so we could have written that instead.)

You will notice that slashes (‘ / ’) surround the string ‘ li ’ in the awk program. The slashes indicate that ‘ li ’ is the pattern to search for. This type of pattern is called a regular expression, which is covered in more detail later (see section Regular Expressions). The pattern is allowed to match parts of words. There are single quotes around the awk program so that the shell won’t interpret any of it as special shell characters.

Here is what this program prints:

$ awk '/li/ { print $0 }' mail-list -| Amelia 555-5553 amelia.zodiacusque@gmail.com F -| Broderick 555-0542 broderick.aliquotiens@yahoo.com R -| Julie 555-6699 julie.perscrutabor@skeeve.com F -| Samuel 555-3430 samuel.lanceolis@shu.edu A

In an awk rule, either the pattern or the action can be omitted, but not both. If the pattern is omitted, then the action is performed for every input line. If the action is omitted, the default action is to print all lines that match the pattern.

Thus, we could leave out the action (the print statement and the braces) in the previous example and the result would be the same: awk prints all lines matching the pattern ‘ li ’. By comparison, omitting the print statement but retaining the braces makes an empty action that does nothing (i.e., no lines are printed).

Many practical awk programs are just a line or two long. Following is a collection of useful, short programs to get you started. Some of these programs contain constructs that haven’t been covered yet. (The description of the program will give you a good idea of what is going on, but you’ll need to read the rest of the Web page to become an awk expert!) Most of the examples use a data file named data . This is just a placeholder; if you use these programs yourself, substitute your own file names for data . For future reference, note that there is often more than one way to do things in awk . At some point, you may want to look back at these examples and see if you can come up with different ways to do the same things shown here:

Print every line that is longer than 80 characters: awk 'length($0) > 80' data The sole rule has a relational expression as its pattern and has no action—so it uses the default action, printing the record.

Print the length of the longest input line: awk '{ if (length($0) > max) max = length($0) } END { print max }' data The code associated with END executes after all input has been read; it’s the other side of the coin to BEGIN .

Print the length of the longest line in data : expand data | awk '{ if (x < length($0)) x = length($0) } END { print "maximum line length is " x }' This example differs slightly from the previous one: the input is processed by the expand utility to change TABs into spaces, so the widths compared are actually the right-margin columns, as opposed to the number of input characters on each line.

Print the length of the longest line in : Print every line that has at least one field: awk 'NF > 0' data This is an easy way to delete blank lines from a file (or rather, to create a new file similar to the old file but from which the blank lines have been removed).

Print seven random numbers from 0 to 100, inclusive: awk 'BEGIN { for (i = 1; i <= 7; i++) print int(101 * rand()) }'

Print the total number of bytes used by files : ls -l files | awk '{ x += $5 } END { print "total bytes: " x }'

: Print the total number of kilobytes used by files : ls -l files | awk '{ x += $5 } END { print "total K-bytes:", x / 1024 }'

: Print a sorted list of the login names of all users: awk -F: '{ print $1 }' /etc/passwd | sort

Count the lines in a file: awk 'END { print NR }' data

Print the even-numbered lines in the data file: awk 'NR % 2 == 0' data If you used the expression ‘ NR % 2 == 1 ’ instead, the program would print the odd-numbered lines.

1.4 An Example with Two Rules

The awk utility reads the input files one line at a time. For each line, awk tries the patterns of each rule. If several patterns match, then several actions execute in the order in which they appear in the awk program. If no patterns match, then no actions run.

After processing all the rules that match the line (and perhaps there are none), awk reads the next line. (However, see section The next Statement and also see section The nextfile Statement.) This continues until the program reaches the end of the file. For example, the following awk program contains two rules:

/12/ { print $0 } /21/ { print $0 }

The first rule has the string ‘ 12 ’ as the pattern and ‘ print $0 ’ as the action. The second rule has the string ‘ 21 ’ as the pattern and also has ‘ print $0 ’ as the action. Each rule’s action is enclosed in its own pair of braces.

This program prints every line that contains the string ‘ 12 ’ or the string ‘ 21 ’. If a line contains both strings, it is printed twice, once by each rule.

This is what happens if we run this program on our two sample data files, mail-list and inventory-shipped :

$ awk '/12/ { print $0 } > /21/ { print $0 }' mail-list inventory-shipped -| Anthony 555-3412 anthony.asserturo@hotmail.com A -| Camilla 555-2912 camilla.infusarum@skynet.be R -| Fabius 555-1234 fabius.undevicesimus@ucb.edu F -| Jean-Paul 555-2127 jeanpaul.campanorum@nyu.edu R -| Jean-Paul 555-2127 jeanpaul.campanorum@nyu.edu R -| Jan 21 36 64 620 -| Apr 21 70 74 514

Note how the line beginning with ‘ Jean-Paul ’ in mail-list was printed twice, once for each rule.

1.5 A More Complex Example

Now that we’ve mastered some simple tasks, let’s look at what typical awk programs do. This example shows how awk can be used to summarize, select, and rearrange the output of another utility. It uses features that haven’t been covered yet, so don’t worry if you don’t understand all the details:

ls -l | awk '$6 == "Nov" { sum += $5 } END { print sum }'

This command prints the total number of bytes in all the files in the current directory that were last modified in November (of any year). The ‘ ls -l ’ part of this example is a system command that gives you a listing of the files in a directory, including each file’s size and the date the file was last modified. Its output looks like this:

-rw-r--r-- 1 arnold user 1933 Nov 7 13:05 Makefile -rw-r--r-- 1 arnold user 10809 Nov 7 13:03 awk.h -rw-r--r-- 1 arnold user 983 Apr 13 12:14 awk.tab.h -rw-r--r-- 1 arnold user 31869 Jun 15 12:20 awkgram.y -rw-r--r-- 1 arnold user 22414 Nov 7 13:03 awk1.c -rw-r--r-- 1 arnold user 37455 Nov 7 13:03 awk2.c -rw-r--r-- 1 arnold user 27511 Dec 9 13:07 awk3.c -rw-r--r-- 1 arnold user 7989 Nov 7 13:03 awk4.c

The first field contains read-write permissions, the second field contains the number of links to the file, and the third field identifies the file’s owner. The fourth field identifies the file’s group. The fifth field contains the file’s size in bytes. The sixth, seventh, and eighth fields contain the month, day, and time, respectively, that the file was last modified. Finally, the ninth field contains the file name.

The ‘ $6 == "Nov" ’ in our awk program is an expression that tests whether the sixth field of the output from ‘ ls -l ’ matches the string ‘ Nov ’. Each time a line has the string ‘ Nov ’ for its sixth field, awk performs the action ‘ sum += $5 ’. This adds the fifth field (the file’s size) to the variable sum . As a result, when awk has finished reading all the input lines, sum is the total of the sizes of the files whose lines matched the pattern. (This works because awk variables are automatically initialized to zero.)

After the last line of output from ls has been processed, the END rule executes and prints the value of sum . In this example, the value of sum is 80600.

These more advanced awk techniques are covered in later sections (see section Actions). Before you can move on to more advanced awk programming, you have to know how awk interprets your input and displays your output. By manipulating fields and using print statements, you can produce some very useful and impressive-looking reports.

1.6 awk Statements Versus Lines

Most often, each line in an awk program is a separate statement or separate rule, like this:

awk '/12/ { print $0 } /21/ { print $0 }' mail-list inventory-shipped

However, gawk ignores newlines after any of the following symbols and keywords:

, { ? : || && do else

A newline at any other point is considered the end of the statement.9

If you would like to split a single statement into two lines at a point where a newline would terminate it, you can continue it by ending the first line with a backslash character (‘ \ ’). The backslash must be the final character on the line in order to be recognized as a continuation character. A backslash followed by a newline is allowed anywhere in the statement, even in the middle of a string or regular expression. For example:

awk '/This regular expression is too long, so continue it\ on the next line/ { print $1 }'

We have generally not used backslash continuation in our sample programs. gawk places no limit on the length of a line, so backslash continuation is never strictly necessary; it just makes programs more readable. For this same reason, as well as for clarity, we have kept most statements short in the programs presented throughout the Web page.

Backslash continuation is most useful when your awk program is in a separate source file instead of entered from the command line. You should also note that many awk implementations are more particular about where you may use backslash continuation. For example, they may not allow you to split a string constant using backslash continuation. Thus, for maximum portability of your awk programs, it is best not to split your lines in the middle of a regular expression or a string.

CAUTION: Backslash continuation does not work as described with the C shell. It works for awk programs in files and for one-shot programs, provided you are using a POSIX-compliant shell, such as the Unix Bourne shell or Bash. But the C shell behaves differently! There you must use two backslashes in a row, followed by a newline. Note also that when using the C shell, every newline in your awk program must be escaped with a backslash. To illustrate: % awk 'BEGIN { \ ? print \\ ? "hello, world" \ ? }' -| hello, world Here, the ‘ % ’ and ‘ ? ’ are the C shell’s primary and secondary prompts, analogous to the standard shell’s ‘ $ ’ and ‘ > ’. Compare the previous example to how it is done with a POSIX-compliant shell: $ awk 'BEGIN { > print \ > "hello, world" > }' -| hello, world

awk is a line-oriented language. Each rule’s action has to begin on the same line as the pattern. To have the pattern and action on separate lines, you must use backslash continuation; there is no other option.

Another thing to keep in mind is that backslash continuation and comments do not mix. As soon as awk sees the ‘ # ’ that starts a comment, it ignores everything on the rest of the line. For example:

$ gawk 'BEGIN { print "dont panic" # a friendly \ > BEGIN rule > }' error→ gawk: cmd. line:2: BEGIN rule error→ gawk: cmd. line:2: ^ syntax error

In this case, it looks like the backslash would continue the comment onto the next line. However, the backslash-newline combination is never even noticed because it is “hidden” inside the comment. Thus, the BEGIN is noted as a syntax error.

When awk statements within one rule are short, you might want to put more than one of them on a line. This is accomplished by separating the statements with a semicolon (‘ ; ’). This also applies to the rules themselves. Thus, the program shown at the start of this section could also be written this way:

/12/ { print $0 } ; /21/ { print $0 }

NOTE: The requirement that states that rules on the same line must be separated with a semicolon was not in the original awk language; it was added for consistency with the treatment of statements within an action.

1.7 Other Features of awk

The awk language provides a number of predefined, or built-in, variables that your programs can use to get information from awk . There are other variables your program can set as well to control how awk processes your data.

In addition, awk provides a number of built-in functions for doing common computational and string-related operations. gawk provides built-in functions for working with timestamps, performing bit manipulation, for runtime string translation (internationalization), determining the type of a variable, and array sorting.

As we develop our presentation of the awk language, we will introduce most of the variables and many of the functions. They are described systematically in Predefined Variables and in Built-in Functions.

1.8 When to Use awk

Now that you’ve seen some of what awk can do, you might wonder how awk could be useful for you. By using utility programs, advanced patterns, field separators, arithmetic statements, and other selection criteria, you can produce much more complex output. The awk language is very useful for producing reports from large amounts of raw data, such as summarizing information from the output of other utility programs like ls . (See section A More Complex Example.)

Programs written with awk are usually much smaller than they would be in other languages. This makes awk programs easy to compose and use. Often, awk programs can be quickly composed at your keyboard, used once, and thrown away. Because awk programs are interpreted, you can avoid the (usually lengthy) compilation part of the typical edit-compile-test-debug cycle of software development.

Complex programs have been written in awk , including a complete retargetable assembler for eight-bit microprocessors (see section Glossary, for more information), and a microcode assembler for a special-purpose Prolog computer. The original awk ’s capabilities were strained by tasks of such complexity, but modern versions are more capable.

If you find yourself writing awk scripts of more than, say, a few hundred lines, you might consider using a different programming language. The shell is good at string and pattern matching; in addition, it allows powerful use of the system utilities. Python offers a nice balance between high-level ease of programming and access to system facilities.10

1.9 Summary

Programs in awk consist of pattern – action pairs.

consist of – pairs. An action without a pattern always runs. The default action for a pattern without one is ‘ { print $0 } ’.

without a always runs. The default for a pattern without one is ‘ ’. Use either ‘ awk ' program ' files ’ or ‘ awk -f program-file files ’ to run awk .

’ or ‘ ’ to run . You may use the special ‘ #! ’ header line to create awk programs that are directly executable.

’ header line to create programs that are directly executable. Comments in awk programs start with ‘ # ’ and continue to the end of the same line.

programs start with ‘ ’ and continue to the end of the same line. Be aware of quoting issues when writing awk programs as part of a larger shell script (or MS-Windows batch file).

programs as part of a larger shell script (or MS-Windows batch file). You may use backslash continuation to continue a source line. Lines are automatically continued after a comma, open brace, question mark, colon, ‘ || ’, ‘ && ’, do , and else .

2 Running awk and gawk

This chapter covers how to run awk , both POSIX-standard and gawk -specific command-line options, and what awk and gawk do with nonoption arguments. It then proceeds to cover how gawk searches for source files, reading standard input along with other files, gawk ’s environment variables, gawk ’s exit status, using include files, and obsolete and undocumented options and/or features.

Many of the options and features described here are discussed in more detail later in the Web page; feel free to skip over things in this chapter that don’t interest you right now.

2.1 Invoking awk

There are two ways to run awk —with an explicit program or with one or more program files. Here are templates for both of them; items enclosed in […] in these templates are optional:

awk [ options ] -f progfile [ -- ] file … awk [ options ] [ -- ] ' program ' file …

In addition to traditional one-letter POSIX-style options, gawk also supports GNU long options.

It is possible to invoke awk with an empty program:

awk '' datafile1 datafile2

Doing so makes little sense, though; awk exits silently when given an empty program. (d.c.) If --lint has been specified on the command line, gawk issues a warning that the program is empty.

2.2 Command-Line Options

Options begin with a dash and consist of a single character. GNU-style long options consist of two dashes and a keyword. The keyword can be abbreviated, as long as the abbreviation allows the option to be uniquely identified. If the option takes an argument, either the keyword is immediately followed by an equals sign (‘ = ’) and the argument’s value, or the keyword and the argument’s value are separated by whitespace (spaces or TABs). If a particular option with a value is given more than once, it is the last value that counts.

Each long option for gawk has a corresponding POSIX-style short option. The long and short options are interchangeable in all contexts. The following list describes options mandated by the POSIX standard:

-F fs --field-separator fs Set the FS variable to fs (see section Specifying How Fields Are Separated). -f source-file --file source-file Read the awk program source from source-file instead of in the first nonoption argument. This option may be given multiple times; the awk program consists of the concatenation of the contents of each specified source-file . Files named with -f are treated as if they had ‘ @namespace "awk" ’ at their beginning. See section Changing The Namespace, for more information on this advanced feature. -v var = val --assign var = val Set the variable var to the value val before execution of the program begins. Such variable values are available inside the BEGIN rule (see section Other Command-Line Arguments). The -v option can only set one variable, but it can be used more than once, setting another variable each time, like this: ‘ awk -v foo=1 -v bar=2 … ’. CAUTION: Using -v to set the values of the built-in variables may lead to surprising results. awk will reset the values of those variables as it needs to, possibly ignoring any initial value you may have given. -W gawk-opt Provide an implementation-specific option. This is the POSIX convention for providing implementation-specific options. These options also have corresponding GNU-style long options. Note that the long options may be abbreviated, as long as the abbreviations remain unique. The full list of gawk -specific options is provided next. -- Signal the end of the command-line options. The following arguments are not treated as options even if they begin with ‘ - ’. This interpretation of -- follows the POSIX argument parsing conventions. This is useful if you have file names that start with ‘ - ’, or in shell scripts, if you have file names that will be specified by the user that could start with ‘ - ’. It is also useful for passing options on to the awk program; see Processing Command-Line Options.

The following list describes gawk -specific options:

As long as program text has been supplied, any other options are flagged as invalid with a warning message but are otherwise ignored.

In compatibility mode, as a special case, if the value of fs supplied to the -F option is ‘ t ’, then FS is set to the TAB character ( "\t" ). This is true only for --traditional and not for --posix (see section Specifying How Fields Are Separated).

The -f option may be used more than once on the command line. If it is, awk reads its program source from all of the named files, as if they had been concatenated together into one big file. This is useful for creating libraries of awk functions. These functions can be written once and then retrieved from a standard place, instead of having to be included in each individual program. The -i option is similar in this regard. (As mentioned in Function Definition Syntax, function names must be unique.)

With standard awk , library functions can still be used, even if the program is entered at the keyboard, by specifying ‘ -f /dev/tty ’. After typing your program, type Ctrl-d (the end-of-file character) to terminate it. (You may also use ‘ -f - ’ to read program source from the standard input, but then you will not be able to also use the standard input as a source of data.)

Because it is clumsy using the standard awk mechanisms to mix source file and command-line awk programs, gawk provides the -e option. This does not require you to preempt the standard input for your source code, and it allows you to easily mix command-line and library source code (see section The AWKPATH Environment Variable). As with -f , the -e and -i options may also be used multiple times on the command line.

If no -f option (or -e option for gawk ) is specified, then awk uses the first nonoption command-line argument as the text of the program source code. Arguments on the command line that follow the program text are entered into the ARGV array; awk does not continue to parse the command line looking for options.

If the environment variable POSIXLY_CORRECT exists, then gawk behaves in strict POSIX mode, exactly as if you had supplied --posix . Many GNU programs look for this environment variable to suppress extensions that conflict with POSIX, but gawk behaves differently: it suppresses all extensions, even those that do not conflict with POSIX, and behaves in strict POSIX mode. If --lint is supplied on the command line and gawk turns on POSIX mode because of POSIXLY_CORRECT , then it issues a warning message indicating that POSIX mode is in effect. You would typically set this variable in your shell’s startup file. For a Bourne-compatible shell (such as Bash), you would add these lines to the .profile file in your home directory:

POSIXLY_CORRECT=true export POSIXLY_CORRECT

For a C shell-compatible shell,12 you would add this line to the .login file in your home directory:

setenv POSIXLY_CORRECT true

Having POSIXLY_CORRECT set is not recommended for daily use, but it is good for testing the portability of your programs to other environments.

2.3 Other Command-Line Arguments

Any additional arguments on the command line are normally treated as input files to be processed in the order specified. However, an argument that has the form var = value , assigns the value value to the variable var —it does not specify a file at all. (See Assigning Variables on the Command Line.) In the following example, count=1 is a variable assignment, not a file name:

awk -f program.awk file1 count=1 file2

As a side point, should you really need to have awk process a file named count=1 (or any file whose name looks like a variable assignment), precede the file name with ‘ ./ ’, like so:

awk -f program.awk file1 ./count=1 file2

All the command-line arguments are made available to your awk program in the ARGV array (see section Predefined Variables). Command-line options and the program text (if present) are omitted from ARGV . All other arguments, including variable assignments, are included. As each element of ARGV is processed, gawk sets ARGIND to the index in ARGV of the current element. ( gawk makes the full command line, including program text and options, available in PROCINFO["argv"] ; see section Built-in Variables That Convey Information.)

Changing ARGC and ARGV in your awk program lets you control how awk processes the input files; this is described in more detail in Using ARGC and ARGV .

The distinction between file name arguments and variable-assignment arguments is made when awk is about to open the next input file. At that point in execution, it checks the file name to see whether it is really a variable assignment; if so, awk sets the variable instead of reading a file.

Therefore, the variables actually receive the given values after all previously specified files have been read. In particular, the values of variables assigned in this fashion are not available inside a BEGIN rule (see section The BEGIN and END Special Patterns), because such rules are run before awk begins scanning the argument list.

The variable values given on the command line are processed for escape sequences (see section Escape Sequences). (d.c.)

In some very early implementations of awk , when a variable assignment occurred before any file names, the assignment would happen before the BEGIN rule was executed. awk ’s behavior was thus inconsistent; some command-line assignments were available inside the BEGIN rule, while others were not. Unfortunately, some applications came to depend upon this “feature.” When awk was changed to be more consistent, the -v option was added to accommodate applications that depended upon the old behavior.

The variable assignment feature is most useful for assigning to variables such as RS , OFS , and ORS , which control input and output formats, before scanning the data files. It is also useful for controlling state if multiple passes are needed over a data file. For example:

awk 'pass == 1 { pass 1 stuff } pass == 2 { pass 2 stuff }' pass=1 mydata pass=2 mydata

Given the variable assignment feature, the -F option for setting the value of FS is not strictly necessary. It remains for historical compatibility.

2.4 Naming Standard Input

Often, you may wish to read standard input together with other files. For example, you may wish to read one file, read standard input coming from a pipe, and then read another file.

The way to name the standard input, with all versions of awk , is to use a single, standalone minus sign or dash, ‘ - ’. For example:

some_command | awk -f myprog.awk file1 - file2

Here, awk first reads file1 , then it reads the output of some_command , and finally it reads file2 .

You may also use "-" to name standard input when reading files with getline (see section Using getline from a File). And, you can even use "-" with the -f option to read program source code from standard input (see section Command-Line Options).

In addition, gawk allows you to specify the special file name /dev/stdin , both on the command line and with getline . Some other versions of awk also support this, but it is not standard. (Some operating systems provide a /dev/stdin file in the filesystem; however, gawk always processes this file name itself.)

2.5 The Environment Variables gawk Uses

A number of environment variables influence how gawk behaves.

• AWKPATH Variable Searching directories for awk programs. • AWKLIBPATH Variable Searching directories for awk shared libraries. • Other Environment Variables The environment variables.

2.5.1 The AWKPATH Environment Variable

In most awk implementations, you must supply a precise pathname for each program file, unless the file is in the current directory. But with gawk , if the file name supplied to the -f or -i options does not contain a directory separator ‘ / ’, then gawk searches a list of directories (called the search path) one by one, looking for a file with the specified name.

The search path is a string consisting of directory names separated by colons.13 gawk gets its search path from the AWKPATH environment variable. If that variable does not exist, or if it has an empty value, gawk uses a default path (described shortly).

The search path feature is particularly helpful for building libraries of useful awk functions. The library files can be placed in a standard directory in the default path and then specified on the command line with a short file name. Otherwise, you would have to type the full file name for each file.

By using the -i or -f options, your command-line awk programs can use facilities in awk library files (see section A Library of awk Functions). Path searching is not done if gawk is in compatibility mode. This is true for both --traditional and --posix . See section Command-Line Options.

If the source code file is not found after the initial search, the path is searched again after adding the suffix ‘ .awk ’ to the file name.

gawk ’s path search mechanism is similar to the shell’s. (See The Bourne-Again SHell manual .) It treats a null entry in the path as indicating the current directory. (A null entry is indicated by starting or ending the path with a colon or by placing two colons next to each other [‘ :: ’].)

NOTE: To include the current directory in the path, either place . as an entry in the path or write a null entry in the path. Different past versions of gawk would also look explicitly in the current directory, either before or after the path search. As of version 4.1.2, this no longer happens; if you wish to look in the current directory, you must include . either as a separate entry or as a null entry in the search path.

The default value for AWKPATH is ‘ .:/usr/local/share/awk ’.14 Since . is included at the beginning, gawk searches first in the current directory and then in /usr/local/share/awk . In practice, this means that you will rarely need to change the value of AWKPATH .

See section Shell Startup Files, for information on functions that help to manipulate the AWKPATH variable.

gawk places the value of the search path that it used into ENVIRON["AWKPATH"] . This provides access to the actual search path value from within an awk program.

Although you can change ENVIRON["AWKPATH"] within your awk program, this has no effect on the running program’s behavior. This makes sense: the AWKPATH environment variable is used to find the program source files. Once your program is running, all the files have been found, and gawk no longer needs to use AWKPATH .

2.5.2 The AWKLIBPATH Environment Variable

The AWKLIBPATH environment variable is similar to the AWKPATH variable, but it is used to search for loadable extensions (stored as system shared libraries) specified with the -l option rather than for source files. If the extension is not found, the path is searched again after adding the appropriate shared library suffix for the platform. For example, on GNU/Linux systems, the suffix ‘ .so ’ is used. The search path specified is also used for extensions loaded via the @load keyword (see section Loading Dynamic Extensions into Your Program).

If AWKLIBPATH does not exist in the environment, or if it has an empty value, gawk uses a default path; this is typically ‘ /usr/local/lib/gawk ’, although it can vary depending upon how gawk was built.15

See section Shell Startup Files, for information on functions that help to manipulate the AWKLIBPATH variable.

gawk places the value of the search path that it used into ENVIRON["AWKLIBPATH"] . This provides access to the actual search path value from within an awk program.

Although you can change ENVIRON["AWKLIBPATH"] within your awk program, this has no effect on the running program’s behavior. This makes sense: the AWKLIBPATH environment variable is used to find any requested extensions, and they are loaded before the program starts to run. Once your program is running, all the extensions have been found, and gawk no longer needs to use AWKLIBPATH .

2.5.3 Other Environment Variables

A number of other environment variables affect gawk ’s behavior, but they are more specialized. Those in the following list are meant to be used by regular users:

GAWK_MSEC_SLEEP Specifies the interval between connection retries, in milliseconds. On systems that do not support the usleep() system call, the value is rounded up to an integral number of seconds. GAWK_READ_TIMEOUT Specifies the time, in milliseconds, for gawk to wait for input before returning with an error. See section Reading Input with a Timeout. GAWK_SOCK_RETRIES Controls the number of times gawk attempts to retry a two-way TCP/IP (socket) connection before giving up. See section Using gawk for Network Programming. Note that when nonfatal I/O is enabled (see section Enabling Nonfatal Output), gawk only tries to open a TCP/IP socket once. POSIXLY_CORRECT Causes gawk to switch to POSIX-compatibility mode, disabling all traditional and GNU extensions. See section Command-Line Options.

The environment variables in the following list are meant for use by the gawk developers for testing and tuning. They are subject to change. The variables are:

AWKBUFSIZE This variable only affects gawk on POSIX-compliant systems. With a value of ‘ exact ’, gawk uses the size of each input file as the size of the memory buffer to allocate for I/O. Otherwise, the value should be a number, and gawk uses that number as the size of the buffer to allocate. (When this variable is not set, gawk uses the smaller of the file’s size and the “default” blocksize, which is usually the filesystem’s I/O blocksize.) AWK_HASH If this variable exists with a value of ‘ gst ’, gawk switches to using the hash function from GNU Smalltalk for managing arrays. This function may be marginally faster than the standard function. AWKREADFUNC If this variable exists, gawk switches to reading source files one line at a time, instead of reading in blocks. This exists for debugging problems on filesystems on non-POSIX operating systems where I/O is performed in records, not in blocks. GAWK_MSG_SRC If this variable exists, gawk includes the file name and line number within the gawk source code from which warning and/or fatal messages are generated. Its purpose is to help isolate the source of a message, as there are multiple places that produce the same warning or error message. GAWK_LOCALE_DIR Specifies the location of compiled message object files for gawk itself. This is passed to the bindtextdomain() function when gawk starts up. GAWK_NO_DFA If this variable exists, gawk does not use the DFA regexp matcher for “does it match” kinds of tests. This can cause gawk to be slower. Its purpose is to help isolate differences between the two regexp matchers that gawk uses internally. (There aren’t supposed to be differences, but occasionally theory and practice don’t coordinate with each other.) GAWK_STACKSIZE This specifies the amount by which gawk should grow its internal evaluation stack, when needed. INT_CHAIN_MAX This specifies intended maximum number of items gawk will maintain on a hash chain for managing arrays indexed by integers. STR_CHAIN_MAX This specifies intended maximum number of items gawk will maintain on a hash chain for managing arrays indexed by strings. TIDYMEM If this variable exists, gawk uses the mtrace() library calls from the GNU C library to help track down possible memory leaks.

2.6 gawk ’s Exit Status

If the exit statement is used with a value (see section The exit Statement), then gawk exits with the numeric value given to it.

Otherwise, if there were no problems during execution, gawk exits with the value of the C constant EXIT_SUCCESS . This is usually zero.

If an error occurs, gawk exits with the value of the C constant EXIT_FAILURE . This is usually one.

If gawk exits because of a fatal error, the exit status is two. On non-POSIX systems, this value may be mapped to EXIT_FAILURE .

2.7 Including Other Files into Your Program

This section describes a feature that is specific to gawk .

The @include keyword can be used to read external awk source files. This gives you the ability to split large awk source files into smaller, more manageable pieces, and also lets you reuse common awk code from various awk scripts. In other words, you can group together awk functions used to carry out specific tasks into external files. These files can be used just like function libraries, using the @include keyword in conjunction with the AWKPATH environment variable. Note that source files may also be included using the -i option.

Let’s see an example. We’ll start with two (trivial) awk scripts, namely test1 and test2 . Here is the test1 script:

BEGIN { print "This is script test1." }

and here is test2 :

@include "test1" BEGIN { print "This is script test2." }

Running gawk with test2 produces the following result:

$ gawk -f test2 -| This is script test1. -| This is script test2.

gawk runs the test2 script, which includes test1 using the @include keyword. So, to include external awk source files, you just use @include followed by the name of the file to be included, enclosed in double quotes.

NOTE: Keep in mind that this is a language construct and the file name cannot be a string variable, but rather just a literal string constant in double quotes.

The files to be included may be nested; e.g., given a third script, namely test3 :

@include "test2" BEGIN { print "This is script test3." }

Running gawk with the test3 script produces the following results:

$ gawk -f test3 -| This is script test1. -| This is script test2. -| This is script test3.

The file name can, of course, be a pathname. For example:

@include "../io_funcs"

and:

@include "/usr/awklib/network"

are both valid. The AWKPATH environment variable can be of great value when using @include . The same rules for the use of the AWKPATH variable in command-line file searches (see section The AWKPATH Environment Variable) apply to @include also.

This is very helpful in constructing gawk function libraries. If you have a large script with useful, general-purpose awk functions, you can break it down into library files and put those files in a special directory. You can then include those “libraries,” either by using the full pathnames of the files, or by setting the AWKPATH environment variable accordingly and then using @include with just the file part of the full pathname. Of course, you can keep library files in more than one directory; the more complex the working environment is, the more directories you may need to organize the files to be included.

Given the ability to specify multiple -f options, the @include mechanism is not strictly necessary. However, the @include keyword can help you in constructing self-contained gawk programs, thus reducing the need for writing complex and tedious command lines. In particular, @include is very useful for writing CGI scripts to be run from web pages.

The rules for finding a source file described in The AWKPATH Environment Variable also apply to files loaded with @include .

Finally, files included with @include are treated as if they had ‘ @namespace "awk" ’ at their beginning. See section Changing The Namespace, for more information.

2.8 Loading Dynamic Extensions into Your Program

This section describes a feature that is specific to gawk .

The @load keyword can be used to read external awk extensions (stored as system shared libraries). This allows you to link in compiled code that may offer superior performance and/or give you access to extended capabilities not supported by the awk language. The AWKLIBPATH variable is used to search for the extension. Using @load is completely equivalent to using the -l command-line option.

If the extension is not initially found in AWKLIBPATH , another search is conducted after appending the platform’s default shared library suffix to the file name. For example, on GNU/Linux systems, the suffix ‘ .so ’ is used:

$ gawk '@load "ordchr"; BEGIN {print chr(65)}' -| A

This is equivalent to the following example:

$ gawk -lordchr 'BEGIN {print chr(65)}' -| A

For command-line usage, the -l option is more convenient, but @load is useful for embedding inside an awk source file that requires access to an extension.

Writing Extensions for gawk , describes how to write extensions (in C or C++) that can be loaded with either @load or the -l option. It also describes the ordchr extension.

2.9 Obsolete Options and/or Features

This section describes features and/or command-line options from previous releases of gawk that either are not available in the current version or are still supported but deprecated (meaning that they will not be in the next release).

The process-related special files /dev/pid , /dev/ppid , /dev/pgrpid , and /dev/user were deprecated in gawk 3.1, but still worked. As of version 4.0, they are no longer interpreted specially by gawk . (Use PROCINFO instead; see Built-in Variables That Convey Information.)

2.10 Undocumented Options and Features

Use the Source, Luke!

— Obi-Wan

This section intentionally left blank.

2.11 Summary

gawk parses arguments on the command line, left to right, to determine if they should be treated as options or as non-option arguments.

parses arguments on the command line, left to right, to determine if they should be treated as options or as non-option arguments. gawk recognizes several options which control its operation, as described in Command-Line Options. All options begin with ‘ - ’.

recognizes several options which control its operation, as described in Command-Line Options. All options begin with ‘ ’. Any argument that is not recognized as an option is treated as a non-option argument, even if it begins with ‘ - ’. - However, when an option itself requires an argument, and the option is separated from that argument on the command line by at least one space, the space is ignored, and the argument is considered to be related to the option. Thus, in the invocation, ‘ gawk -F x ’, the ‘ x ’ is treated as belonging to the -F option, not as a separate non-option argument.

’. Once gawk finds a non-option argument, it stops looking for options. Therefore, all following arguments are also non-option arguments, even if they resemble recognized options.

finds a non-option argument, it stops looking for options. Therefore, all following arguments are also non-option arguments, even if they resemble recognized options. If no -e or -f options are present, gawk expects the program text to be in the first non-option argument.

or options are present, expects the program text to be in the first non-option argument. All non-option arguments, except program text provided in the first non-option argument, are placed in ARGV as explained in Using ARGC and ARGV , and are processed as described in Other Command-Line Arguments. Adjusting ARGC and ARGV affects how awk processes input.

as explained in Using and , and are processed as described in Other Command-Line Arguments. Adjusting and affects how processes input. The three standard options for all versions of awk are -f , -F , and -v . gawk supplies these and many others, as well as corresponding GNU-style long options.

are , , and . supplies these and many others, as well as corresponding GNU-style long options. Nonoption command-line arguments are usually treated as file names, unless they have the form ‘ var = value ’, in which case they are taken as variable assignments to be performed at that point in processing the input.

’, in which case they are taken as variable assignments to be performed at that point in processing the input. You can use a single minus sign (‘ - ’) to refer to standard input on the command line. gawk also lets you use the special file name /dev/stdin .

’) to refer to standard input on the command line. also lets you use the special file name . gawk pays attention to a number of environment variables. AWKPATH , AWKLIBPATH , and POSIXLY_CORRECT are the most important ones.

pays attention to a number of environment variables. , , and are the most important ones. gawk ’s exit status conveys information to the program that invoked it. Use the exit statement from within an awk program to set the exit status.

’s exit status conveys information to the program that invoked it. Use the statement from within an program to set the exit status. gawk allows you to include other awk source files into your program using the @include statement and/or the -i and -f command-line options.

allows you to include other source files into your program using the statement and/or the and command-line options. gawk allows you to load additional functions written in C or C++ using the @load statement and/or the -l option. (This advanced feature is described later, in Writing Extensions for gawk .)

3 Regular Expressions

A regular expression, or regexp, is a way of describing a set of strings. Because regular expressions are such a fundamental part of awk programming, their format and use deserve a separate chapter.

A regular expression enclosed in slashes (‘ / ’) is an awk pattern that matches every input record whose text belongs to that set. The simplest regular expression is a sequence of letters, numbers, or both. Such a regexp matches any string that contains that sequence. Thus, the regexp ‘ foo ’ matches any string containing ‘ foo ’. Thus, the pattern /foo/ matches any input record containing the three adjacent characters ‘ foo ’ anywhere in the record. Other kinds of regexps let you specify more complicated classes of strings.

Initially, the examples in this chapter are simple. As we explain more about how regular expressions work, we present more complicated instances.

3.1 How to Use Regular Expressions

A regular expression can be used as a pattern by enclosing it in slashes. Then the regular expression is tested against the entire text of each record. (Normally, it only needs to match some part of the text in order to succeed.) For example, the following prints the second field of each record where the string ‘ li ’ appears anywhere in the record:

$ awk '/li/ { print $2 }' mail-list -| 555-5553 -| 555-0542 -| 555-6699 -| 555-3430

Regular expressions can also be used in matching expressions. These expressions allow you to specify the string to match against; it need not be the entire current input record. The two operators ‘ ~ ’ and ‘ !~ ’ perform regular expression comparisons. Expressions using these operators can be used as patterns, or in if , while , for , and do statements. (See section Control Statements in Actions.) For example, the following is true if the expression exp (taken as a string) matches regexp :

exp ~ / regexp /

This example matches, or selects, all input records with the uppercase letter ‘ J ’ somewhere in the first field:

$ awk '$1 ~ /J/' inventory-shipped -| Jan 13 25 15 115 -| Jun 31 42 75 492 -| Jul 24 34 67 436 -| Jan 21 36 64 620

So does this:

awk '{ if ($1 ~ /J/) print }' inventory-shipped

This next example is true if the expression exp (taken as a character string) does not match regexp :

exp !~ / regexp /

The following example matches, or selects, all input records whose first field does not contain the uppercase letter ‘ J ’:

$ awk '$1 !~ /J/' inventory-shipped -| Feb 15 32 24 226 -| Mar 15 24 34 228 -| Apr 31 52 63 420 -| May 16 34 29 208 …

When a regexp is enclosed in slashes, such as /foo/ , we call it a regexp constant, much like 5.27 is a numeric constant and "foo" is a string constant.

3.2 Escape Sequences

Some characters cannot be included literally in string constants ( "foo" ) or regexp constants ( /foo/ ). Instead, they should be represented with escape sequences, which are character sequences beginning with a backslash (‘ \ ’). One use of an escape sequence is to include a double-quote character in a string constant. Because a plain double quote ends the string, you must use ‘ \" ’ to represent an actual double-quote character as a part of the string. For example:

$ awk 'BEGIN { print "He said \"hi!\" to her." }' -| He said "hi!" to her.

The backslash character itself is another character that cannot be included normally; you must write ‘ \\ ’ to put one backslash in the string or regexp. Thus, the string whose contents are the two characters ‘ " ’ and ‘ \ ’ must be written "\"\\" .

Other escape sequences represent unprintable characters such as TAB or newline. There is nothing to stop you from entering most unprintable characters directly in a string constant or regexp constant, but they may look ugly.

The following list presents all the escape sequences used in awk and what they represent. Unless noted otherwise, all these escape sequences apply to both string constants and regexp constants:

\\ A literal backslash, ‘ \ ’. \a The “alert” character, Ctrl-g , ASCII code 7 (BEL). (This often makes some sort of audible noise.) \b Backspace, Ctrl-h , ASCII code 8 (BS). \f Formfeed, Ctrl-l , ASCII code 12 (FF).

Newline, Ctrl-j , ASCII code 10 (LF). \r Carriage return, Ctrl-m , ASCII code 13 (CR). \t Horizontal TAB, Ctrl-i , ASCII code 9 (HT). \v Vertical TAB, Ctrl-k , ASCII code 11 (VT). \ nnn The octal value nnn , where nnn stands for 1 to 3 digits between ‘ 0 ’ and ‘ 7 ’. For example, the code for the ASCII ESC (escape) character is ‘ \033 ’. \x hh … The hexadecimal value hh , where hh stands for a sequence of hexadecimal digits (‘ 0 ’–‘ 9 ’, and either ‘ A ’–‘ F ’ or ‘ a ’–‘ f ’). A maximum of two digts are allowed after the ‘ \x ’. Any further hexadecimal digits are treated as simple letters or numbers. (c.e.) (The ‘ \x ’ escape sequence is not allowed in POSIX awk.) CAUTION: In ISO C, the escape sequence continues until the first nonhexadecimal digit is seen. For many years, gawk would continue incorporating hexadecimal digits into the value until a non-hexadecimal digit or the end of the string was encountered. However, using more than two hexadecimal digits produced undefined results. As of version 4.2, only two digits are processed. \/ A literal slash (should be used for regexp constants only). This sequence is used when you want to write a regexp constant that contains a slash (such as /.*:\/home\/[[:alnum:]]+:.*/ ; the ‘ [[:alnum:]] ’ notation is discussed in Using Bracket Expressions). Because the regexp is delimited by slashes, you need to escape any slash that is part of the pattern, in order to tell awk to keep processing the rest of the regexp. \" A literal double quote (should be used for string constants only). This sequence is used when you want to write a string constant that contains a double quote (such as "He said \"hi!\" to her." ). Because the string is delimited by double quotes, you need to escape any quote that is part of the string, in order to tell awk to keep processing the rest of the string.

In gawk , a number of additional two-character sequences that begin with a backslash have special meaning in regexps. See section gawk -Specific Regexp Operators.

In a regexp, a backslash before any character that is not in the previous list and not listed in gawk -Specific Regexp Operators means that the next character should be taken literally, even if it would normally be a regexp operator. For example, /a\+b/ matches the three characters ‘ a+b ’.

For complete portability, do not use a backslash before any character not shown in the previous list or that is not an operator.

Backslash Before Regular Characters If you place a backslash in a string constant before something that is not one of the characters previously listed, POSIX awk purposely leaves what happens as undefined. There are two choices: Strip the backslash out This is what BWK awk and gawk both do. For example, "a\qc" is the same as "aqc" . (Because this is such an easy bug both to introduce and to miss, gawk warns you about it.) Consider ‘ FS = "[ \t]+\|[ \t]+" ’ to use vertical bars surrounded by whitespace as the field separator. There should be two backslashes in the string: ‘ FS = "[ \t]+\\|[ \t]+" ’.) Leave the backslash alone Some other awk implementations do this. In such implementations, typing "a\qc" is the same as typing "a\\qc" .

To summarize:

The escape sequences in the preceding list are always processed first, for both string constants and regexp constants. This happens very early, as soon as awk reads your program.

reads your program. gawk processes both regexp constants and dynamic regexps (see section Using Dynamic Regexps), for the special operators listed in gawk -Specific Regexp Operators.

processes both regexp constants and dynamic regexps (see section Using Dynamic Regexps), for the special operators listed in -Specific Regexp Operators. A backslash before any other character means to treat that character literally.

Escape Sequences for Metacharacters Suppose you use an octal or hexadecimal escape to represent a regexp metacharacter. (See Regular Expression Operators.) Does awk treat the character as a literal character or as a regexp operator? Historically, such characters were taken literally. (d.c.) However, the POSIX standard indicates that they should be treated as real metacharacters, which is what gawk does. In compatibility mode (see section Command-Line Options), gawk treats the characters represented by octal and hexadecimal escape sequences literally when used in regexp constants. Thus, /a\52b/ is equivalent to /a\*b/ .

3.3 Regular Expression Operators

You can combine regular expressions with special characters, called regular expression operators or metacharacters, to increase the power and versatility of regular expressions.

• Regexp Operator Details The actual details. • Interval Expressions Notes on interval expressions.

3.3.1 Regexp Operators in awk

The escape sequences described earlier in Escape Sequences are valid inside a regexp. They are introduced by a ‘ \ ’ and are recognized and converted into corresponding real characters as the very first step in processing regexps.

Here is a list of metacharacters. All characters that are not escape sequences and that are not listed here stand for themselves:

\ This suppresses the special meaning of a character when matching. For example, ‘ \$ ’ matches the character ‘ $ ’. ^ This matches the beginning of a string. ‘ ^@chapter ’ matches ‘ @chapter ’ at the beginning of a string, 