Manpage for TXR

Sep 01, 2020

1 NAME

TXR - Programming Language (Version 243)

2 SYNOPSIS

txr [ options ] [ script-file [ data-files ... ]]

3 DESCRIPTION

TXR is a general-purpose, multi-paradigm programming language. It comprises two languages integrated into a single tool: a text scanning and extraction language referred to as the TXR Pattern Language (sometimes just "TXR"), and a general-purpose dialect of Lisp called TXR Lisp.

TXR can be used for everything from "one liner" data transformation tasks at the command line, to data scanning and extracting scripts, to full application development in a wide-range of areas.

A script written in the TXR Pattern Language, also referred to in this document as a query, specifies a pattern which matches one or more sources of inputs, such as text files. Patterns can consist of large chunks of multi-line free-form text, which is matched literally against material in the input sources. Free variables occurring in the pattern (denoted by the @ symbol) are bound to the pieces of text occurring in the corresponding positions. Patterns can be arbitrarily complex, and can be broken down into named pattern functions, which may be mutually recursive.

In addition to embedded variables which implicitly match text, the TXR pattern language supports a number of directives, for matching text using regular expressions, for continuing a match in another file, for searching through a file for the place where an entire sub-query matches, for collecting lists, and for combining sub-queries using logical conjunction, disjunction and negation, and numerous others.

Patterns can contain actions which transform data and generate output. These actions can be embedded anywhere within the pattern matching logic. A common structure for small TXR scripts is to perform a complete matching session in the at the top of the script, and then deal with processing and reporting at the bottom.

The TXR Lisp language can be used from within TXR scripts as an embedded language, or completely stand-alone. It supports functional, imperative and object-oriented programming, and provides numerous data types such as symbols, strings, vectors, hash tables with weak reference support, lazy lists, and arbitrary-precision ("bignum") integers. It has expressive foreign function interface (FFI) for calling into libraries and other software components that support C-language-style calls.

TXR Lisp source files as well as individual functions can be optionally compiled for execution on a virtual machine that is built into TXR. Compiled files execute and load faster, and resist reverse-engineering. Stand-alone application delivery is possible.

TXR is free software offered under the two-clause BSD license which places almost no restrictions on redistribution, and allows every conceivable use, of the whole software or any constituent part, royalty-free, free of charge, and free of any restrictions.

4 ARGUMENTS AND OPTIONS

If TXR is given no arguments, it will enter into an interactive mode. See the INTERACTIVE LISTENER section for a description of this mode. When TXR enters interactive mode this way, it prints a one-line banner is printed announcing the program name and version, and one line of help text instructing the user how to exit.

Options which don't take an argument may be combined together. The -v and -q options are mutually exclusive. Of these two, the one which occurs in the rightmost position in the argument list dominates. The -c and -f options are also mutually exclusive; if both are specified, it is a fatal error.

-D var=value Bind the variable var to the value value prior to processing the query. The name is in scope over the entire query, so that all occurrence of the variable are substituted and match the equivalent text. If the value contains commas, these are interpreted as separators, which give rise to a list value. For instance -Da,b,c creates a list of the strings "a" , "b" and "c" . (See Collect Directive bellow). List variables provide a multiple match. That is to say, if a list variable occurs in a query, a successful match occurs if any of its values matches the text. If more than one value matches the text, the first one is taken. -D var Binds the variable var to an empty string value prior to processing the query. -q Quiet operation during matching. Certain error messages are not reported on the standard error device (but the if the situations occur, they still fail the query). This option does not suppress error generation during the parsing of the query, only during its execution. -i If this option is present, then TXR will enter into an interactive interpretation mode after processing all options, and the input query if one is present. See the INTERACTIVE LISTENER section for a description of this mode. -d --debugger Invoke the interactive TXR debugger. See the DEBUGGER section. Implies --backtrace . --backtrace Turns on the establishment of backtrace frames for function calls so that a backtrace can be produced when an unhandled exception occurs, and in other situations. Backtraces are helpful in identifying the causes of errors, but require extra stack space and slow down execution. -n --noninteractive This option affects behavior related to TXR's *stdin* stream. It also has a another, unrelated effect, on the behavior of the interactive listener; see below. Normally, if this stream is connected to a terminal device, it is automatically marked as having the real-time property when TXR starts up (see the functions stream-set-prop and real-time-stream-p ). The -n option suppresses this behavior; the *stdin* stream remains ordinary. The TXR pattern language reads standard input via a lazy list, created by applying the lazy-stream-cons function to the *stdin* stream. If that stream is marked real-time, then the lazy list which is returned by that function has behaviors that are better suited for scanning interactive input. A more detailed explanation is given under the description of this function. If the -n option is effect and TXR enters into the interactive listener, the listener operates in plain mode. The listener reads buffered lines from the operating system without any character-based editing features or history navigation. In plain mode, no prompts appear and no terminal control escape sequences are generated. The only output is the results of evaluation, related diagnostic messages, and any output generated by the evaluated expressions themselves. -v Verbose operation. Detailed logging is enabled. -b This option binds a Lisp global lexical variable (as if by the defparml function) to an object described by Lisp syntax. It requires an argument of the form sym=value where sym must be, syntactically, a token denoting a bindable symbol, and value is arbitrary TXR Lisp syntax. The sym syntax is converted to the symbol it denotes, which is bound as a global lexical variable, if it is not already a variable. The value syntax is parsed to the Lisp object it denotes. This object is not subject to evaluation; the object itself is stored into the variable binding denoted by sym . Note that if sym already exists as a global variable, then it is simply overwritten. If sym is marked special, then it stays special. -B If the query is successful, print the variable bindings as a sequence of assignments in shell syntax that can be eval-ed by a POSIX shell. II the query fails, print the word "false". Evaluation of this word by the shell has the effect of producing an unsuccessful termination status from the shell's eval command. -l or --lisp-bindings This option implies -B . Print the variable bindings in Lisp syntax instead of shell syntax. -a num This option implies -B . The decimal integer argument num specifies the maximum number of array dimensions to use for list-valued variable bindings. The default is 1. Additional dimensions are expressed using numeric suffixes in the generated variable names. For instance, consider the three-dimensional list arising out of a triply nested collect: ((("a" "b") ("c" "d")) (("e" "f") ("g" "h"))). Suppose this is bound to a variable V. With -a 1 , this will be reported as: V_0_0[0]="a" V_0_1[0]="b" V_1_0[0]="c" V_1_1[0]="d" V_0_0[1]="e" V_0_1[1]="f" V_1_0[1]="g" V_1_1[1]="h" With -a 2 , it comes out as: V_0[0][0]="a" V_1[0][0]="b" V_0[0][1]="c" V_1[0][1]="d" V_0[1][0]="e" V_1[1][0]="f" V_0[1][1]="g" V_1[1][1]="h" The leftmost bracketed index is the most major index. That is to say, the dimension order is: NAME_m_m+1_..._n[1][2]...[m-1] . -c query Specifies the query in the form of a command line argument. If this option is used, the script-file argument is omitted. The first non-option argument, if there is one, now specifies the first input source rather than a query. Unlike queries read from a file, (non-empty) queries specified as arguments using -c do not have to properly end in a newline. Internally, TXR adds the missing newline before parsing the query. Thus -c "@a" is a valid query which matches a line. Example: Shell script which uses TXR to read two lines "1" and "2" from standard input, binding them to variables a and b . Standard input is specified as - and the data comes from shell "here document" redirection: code: #!/bin/sh

txr -B -c "@a

@b" - <<!

1

2

! output: a=1

b=2 The @; comment syntax can be used for better formatting: txr -B -c "@; @a @b" -f script-file Specifies the file from which the query is to be read, instead of the script-file argument. This is useful in #! ("hash bang") scripts. (See Hash Bang Support below). -e expression Evaluates a TXR Lisp expression for its side effects, without printing its value. Can be specified more than once. The script-file argument becomes optional if -e is used at least once. If the evaluation of every expression evaluated this way terminates normally, and there is no script-file argument, then TXR terminates with a successful status. -p expression Just like -e but prints the value of expression using the prinl function. -P expression Like -p but prints using the pprinl function. -t expression Like -p but prints using the tprint function. -C number --compat= number Requests TXR to behave in a manner that is compatible with the specified version of TXR. This makes a difference in situations when a release of TXR breaks backward compatibility. If some version N+1 deliberately introduces a change which is backward incompatible, then -C N can be used to request the old behavior. The requested value of N can be too low, in which case TXR will complain and exit with an unsuccessful termination status. This indicates that TXR refuses to be compatible with such an old version. Users requiring the behavior of that version will have to install an older version of TXR which supports that behavior, or even that exact version. If the option is specified more than once, the behavior is not specified. Compatibility can also be requested via the TXR_COMPAT environment variable instead of the -C option. For more information, see the COMPATIBILITY section. --gc-delta= number The number argument to this option must be a decimal integer. It represents a megabyte value, the "GC delta": one megabyte is 1048576 bytes. The "GC delta" controls an aspect of the garbage collector behavior. See the gc-set-delta function for a description. --debug-autoload This option turns on debugging, like --debugger but also requests stepping into the auto-load processing of TXR Lisp library code. Normally, debugging through the evaluations triggered by auto-loading is suppressed. Implies --backtrace . --debug-expansion This option turns on debugging, like --debugger but also requests stepping into the parse-time macro-expansion of TXR Lisp code embedded in TXR queries. Normally, this is suppressed. Implies --backtrace . --help Prints usage summary on standard output, and terminates successfully. --license Prints the software license. This depends on the software being installed such that the LICENSE file is in the data directory. Use of TXR implies agreement with the liability disclaimer in the license. --version Prints program version standard output, and terminates successfully. --args The --args option provides a way to encode multiple arguments as a single argument, which is useful on some systems which have limitations in their implementation of the "hash bang" mechanism. For details about its special syntax, See Hash Bang Support below. It is also useful in stand-alone application deployment. See the section STAND-ALONE APPLICATION SUPPORT, in which example uses of --args are shown. --eargs The --eargs option (extended --args ) is like --args but must be followed by an argument. The argument is removed from the argument list and substituted in place of occurrences of {} among the arguments expanded from the --eargs syntax. --lisp --compiled These options influences the treatment of query files which do not have a suffix indicating their type. The --lisp option causes an unsuffixed file to be treated as Lisp source; and --compiled causes it to be treated as a compile file. Moreover, if --lisp is specified, and an unsuffixed file does not exist, then TXRwill add the ".tl" suffix and try the file again; and --compiled will similarly add the ".tlo" suffix and try opening the file again. In the same situation, if neither --lisp nor --compiled has been specified, TXR will first try adding the ".txr" suffix. If that fails, then the ".tlo" suffix will be tried and finally ".tl" . Note that --lisp and --compiled influence how the argument of the -f option is treated, but only they precedes that option. --reexec On platforms which support the POSIX exec family of functions, this option causes TXR to re-execute itself. The re-executed image receives the remaining arguments which follow the --reexec argument. Note: this option is useful for supporting setuid operation in "hash hang" scripts. On some platforms, the interpreter designated by a "hash bang" script runs without altered privilege, even if that interpreter is installed setuid. If the interpreter is executed directly, then setuid applies to it, but not if it is executed via "hash bang". If the --reexec option is used in the interpreter command line of such a script, the interpreter will re-execute itself, thereby gaining the setuid privilege. The re-executed image will then obtain the script name from the arguments which are passed to it and determine whether that script will run setuid. See the section SETUID/SETGID OPERATION. --gc-debug This option enables a behavior which stresses the garbage collector with frequent garbage collection requests. The purpose is to make it more likely to reproduce certain kinds of bugs. Use of this option severely degrades the performance of TXR. --vg-debug If TXR is enabled with Valgrind support, then this option is available. It enables code which uses the Valgrind API to integrate with the Valgrind debugger, for more accurate tracking of garbage collected objects. For example, objects which have been reclaimed by the garbage collector are marked as inaccessible, and marked as uninitialized when they are allocated again. --dv-regex If this option is used, then regular expressions are all treated using the derivative-based back-end. The NFA-based regex implementation is disabled. Normally, only regular expressions which require the intersection and complement operators are handled using the derivative back-end. This option makes it possible to test that back-end on test cases that it wouldn't normally receive. -- Signifies the end of the option list. - This argument is not interpreted as an option, but treated as a filename argument. After the first such argument, no more options are recognized. Even if another argument looks like an option, it is treated as a name. This special argument - means "read from standard input" instead of a file. The script-file , or any of the data files, may be specified using this option. If two or more files are specified as - , the behavior is system-dependent. It may be possible to indicate EOF from the interactive terminal, and then specify more input which is interpreted as the second file, and so forth.

After the options, the remaining arguments are files. The first file argument specifies the script file, and is mandatory if the -f option has not been specified, and TXR isn't operating in interactive mode or evaluating expressions from the command line via -e or one of the related options. A file argument consisting of a single - means to read the standard input instead of opening a file.

Specifying standard input as a source with an explicit - argument is unnecessary. If no data source arguments are present, then TXR scans standard input by default. This was not true in versions of TXRprior to 171; see the COMPATIBILITY section.

TXR begins by reading the script. In the case of the TXR pattern language, the entire query is scanned, internalized and then begins executing, if it is free of syntax errors. (TXR Lisp is processed differently, form by form). On the other hand, the pattern language reads data files in a lazy manner. A file isn't opened until the query demands material from that file, and then the contents are read on demand, not all at once.

The suffix of the script-file is significant. If the name has no suffix, or if it has a ".txr" suffix, then it is assumed to be in the TXR pattern language. If it has the ".tl" suffix, then it is assumed to be TXR Lisp. The --lisp option changes the treatment of unsuffixed script file names, causing them to be interpreted as TXR Lisp .

If an unsuffixed script file name is specified, and cannot be opened, then TXR will add the ".txr" suffix and try again. If that fails, it will be tried with the ".tl" suffix, and treated as TXR Lisp . If the --lisp option has been specified, then TXR tries only the ".tl" suffix.

A TXR Lisp file is processed as if by the load macro: forms from the file are read and evaluated. If the forms do not terminate the TXR process or throw an exception, and there are no syntax errors, then TXR terminates successfully after evaluating the last form. If syntax errors are encountered in a form, then TXR terminates unsuccessfully. TXR Lisp is documented in the section TXR LISP.

If a query file is specified, but no file arguments, it is up to the query to open a file, pipe or standard input via the @(next) directive prior to attempting to make a match. If a query attempts to match text, but has run out of files to process, the match fails.

5 STATUS AND ERROR REPORTING

TXR sends errors and verbose logs to the standard error device. The following paragraphs apply when TXR is run without enabling verbose mode with -v , or the printing of variable bindings with -B or -a .

If the command line arguments are incorrect, TXR issues an error diagnostic and terminates with a failed status.

If the script-file specifies a query, and the query has a malformed syntax, TXR likewise issues error diagnostics and terminates with a failed status.

If the query fails due to a mismatch, TXR terminates with a failed status. No diagnostics are issued.

If the query is well-formed, and matches, then TXR issues no diagnostics, and terminates with a successful status.

In verbose mode (option -v ), TXR issues diagnostics on the standard error device even in situations which are not erroneous.

In bindings-printing mode (options -B or -a) , TXR prints the word false if the query fails, and exits with a failed termination status. If the query succeeds, the variable bindings, if any, are output on standard output.

If the script-file is TXR Lisp, then it is processed form by form. Each top-level Lisp form is evaluated after it is read. If any form is syntactically malformed, TXR issues diagnostics and terminates unsuccessfully. This is somewhat different from how the pattern language is treated: a script in the pattern language is parsed in its entirety before being executed.

6 BASIC TXR SYNTAX

A query may contain comments which are delimited by the sequence @; and extend to the end of the line. Whitespace can occur between the @ and ; . A comment which begins on a line swallows that entire line, as well as the newline which terminates it. In essence, the entire comment line disappears. If the comment follows some material in a line, then it does not consume the newline. Thus, the following two queries are equivalent:

1. @a@; comment: match whole line against variable @a

@; this comment disappears entirely

@b 2. @a

@b

The comment after the @a does not consume the newline, but the comment which follows does. Without this intuitive behavior, line comment would give rise to empty lines that must match empty lines in the data, leading to spurious mismatches.

Instead of the ; character, the # character can be used. This is an obsolescent feature.

6.2 Hash Bang Support

TXR has several features which support use of the "hash bang" convention for creating apparently stand-alone executable programs.

6.2.1 Basic Hash Bang

-f

#!

This removal for TXR queries to be turned into standalone executable programs in the POSIX environment using the "hash bang" mechanism. Unlike most interpreters, TXR applies special processing to the #! line, which is described below, in the section Argument Generation with the Null Hack.

Shell session example: create a simple executable program called "twoline.txr" and run it. This assumes TXR is installed in /usr/bin .

$ cat > hello.txr #!/usr/bin/txr @(bind a "Hey") @(output) Hello, world! @(end) $ chmod a+x hello.txr $ ./hello.txr Hello, world!

When this plain hash bang line is used, TXR receives the name of the script as an argument. Therefore, it is not possible to pass additional options to TXR. For instance, if the above script is invoked like this

$ ./hello.txr -B

the -B option isn't processed by TXR, but treated as an additional argument, just as if txr scriptname -B had been executed directly.

This behavior is useful if the script author wants not to expose the TXR options to the user of the script.

However, the hash bang line can use the -f option:

#!/usr/bin/txr -f

Now, the name of the script is passed as an argument to the -f option, and TXR will look for more options after that, so that the resulting program appears to accept TXR options. Now we can run

$ ./hello.txr -B Hello, world! a="Hey"

The -B option is honored.

6.2.2 Argument Generation with --args and --eargs

#!/usr/bin/txr -B -f

To support systems like this, TXR supports the special argument --args , as well as as an extended version, --eargs . With --args , it is possible to encode multiple arguments into one argument. The --args option must be followed by a separator character, chosen by the programmer. The characters after that are split into multiple arguments on the separator character. The --args option is then removed from the argument list and replaced with these arguments, which are processed in its place.

Example:

#!/usr/bin/txr --args:-B:-f

The above has the same behavior as

#!/usr/bin/txr -B -f

on a system which supports multiple arguments in hash bang. The separator character is the colon, and so the remainder of that argument, -B:-f , is split into the two arguments -B -f .

The --eargs mechanism allows an additional flexibility. An --eargs argument must be followed by one more argument.

After --eargs performs the argument splitting in the same manner as --args , any of the arguments which it produces which are the two-character sequence {} are replaced with that following argument. Whether or not the replacement occurs, that following argument is then removed.

Example:

#!/usr/bin/txr --eargs:-B:{}:--foo:42

This has an effect which cannot be replicated in any known implementation of the hash bang mechanism. Suppose that this hash bang line is placed in a script called script.txr . When this script is invoked with arguments, as in:

script.txr a b c

then TXR is invoked similarly to:

/usr/bin/txr --eargs:-B:{}:--foo:42 script.txr a b c

Then, when --eargs processing takes place, firstly the argument sequence

-B {} --foo 42

is produced by splitting into four fields using the : character as the separator. Then, within these four fields, all occurrences of {} are replaced with the following argument script.txr , resulting in:

-B script.txr --foo 42

Furthermore, that script.txr argument is removed from the remaining argument list.

The four arguments are then substituted in place of the original --eargs:-B:{}:--foo:42 syntax.

The resulting TXR invocation is, therefore:

/usr/bin/txr -B script.txr --foo 42 a b c

Thus, --eargs allows some arguments to be encoded into the interpreter script, such that script name is inserted anywhere among them, possibly multiple times. Arguments for the interpreter can be encoded, as well as arguments to be processed by the script.

6.2.3 Argument Generation with the Null Hack

--args

--eargs

env

PATH

#!/usr/bin/env txr

Here, the env utility searches for the txr program in the directories indicated by the PATH variable, which liberates the script from having encode the exact location where the program is installed. However, if the operating system allows only one argument in the hash bang mechanism, then no arguments can be passed to the program.

To mitigate this problem, TXRsupports a special feature in its hash bang support. If the hash bang #! line contains a null byte, then text after the null byte, to the end of the line, is split into fields using the space character as a separator, and these fields are inserted into the command line. This manipulation happens during command line processing, prior to the execution of the file, which happens after command-line processing. If this processing is applied to a file that is specified using the -f option, then the arguments which arise from the special processing are inserted after that option and its argument. If this processing is applied to the file which is the first non-option argument, then the options are inserted before that argument. However, care is taken not to process that argument a second time. In either situation, processing of the command line options continues, and the arguments which are processed next are the ones which were just inserted. This is true even if the options had been inserted as a result of processing the first non-option argument, which would ordinarily signal the termination of option processing.

In the following examples, it is assumed that the script is named, and invoked, as /home/jenny/foo.txr , and is given arguments --bar abc , and that txr resolves to /usr/bin/txr . The <NUL> code indicates a literal ASCII NUL character, or zero bytes.

Basic example:

#!/usr/bin/env txr<NUL>-a 3

Here, env searches for txr , finding it in /usr/bin . Thus, including the executable name, TXR receives this full argument list:

/usr/bin/txr /home/jenny/foo.txr --bar abc

The first non-option argument is the name of the script. TXR opens the script, and notices that it begins with a hash bang line. It consumes the hash bang line and finds the null byte inside it, retrieving the character string after it, which is "-a 3" . This is split into the two arguments -a and 3 , which are then inserted into the command line ahead of the the script name. The effective command line then becomes:

/usr/bin/txr -a 3 /home/jenny/foo.txr --bar abc

Command line option processing continues, beginning with the -a option. After the option is processed, /home/jenny/foo.txr is encountered again. This time it is not opened a second time; it signals the end of option processing, exactly as it would immediately do if it hadn't triggered the insertion of any arguments.

Advanced example: use env to invoke txr passing options to interpreter and to the script:

#!/usr/bin/env txr<NUL>--eargs:-C:175:{}:--debug

This example shows how --eargs can be used in conjunction with the null hack. When txr begins executing, it receives the arguments

/usr/bin/txr /home/jenny/foo.txr

The script file is opened, and the arguments delimited by the null character in the hash bang line are inserted, resulting in the effective command line:

/usr/bin/txr --eargs:-C:175:{}:--debug /home/jenny/foo.txr

Next, --eargs is processed in the ordinary way, transforming the command line into:

/usr/bin/txr -C 175 /home/jenny/foo.txr --debug

The name of the script file is encountered, and signals the end of option processing. Thus txr receives the -C option, instructing it to emulate some behaviors from version 175, and the /home/jenny/foo.txr script receives --debug as its argument: it executes with the *args* list containing one element, the character string "--debug" .

The hash bang null hack feature was introduced in TXR 177. Previous versions ignore the hash bang line, performing no special processing. Where a risk exists that programs which depend on the feature might be executed by an older version of TXR, care must be taken to detect and handle that situation, either by means of the txr-version variable, or else by some logic which infers that the processing of the hash bang line hadn't been performed.

6.2.4 Passing Options to TXR via Hash Bang Null Hack

It is possible to use the Hash Bang Null Hack, such that the resulting executable program recognizes TXR options. This is made possible by a special behavior in the processing of the -f option.

For instance, suppose that the effect of the following familiar hash bang line is required:

#!/path/to/txr -f

However, suppose there is also a requirement to use the env utility to find TXR. Furthermore, the operation system allows only one hash bang argument. Using the Null Hack, this is rewritten as:

#!/usr/bin/env txr<NUL>-f

then if the script is invoked with arguments -a b c , the command line will ultimately be transformed into:

/path/to/txr -f /path/to/scriptfile -i a b c

which allows TXR to process the -i option, leaving a , b and c as arguments for the script.

However, note that there is a subtle issue with the -f option that has been inserted via the Null Hack: namely, this insertion happens after TXR has opened the script file and read the hash bang line from it. This means that when the inserted -f option is being processed, the script file is already open. A special behavior occurs. The -f option processing notices that the argument to -f is identical to the path name of name of the script file that TXR has already opened for processing. The -f option and its argument are then skipped.

6.2.5 Hash Bang and Setuid

--reexec

Outside of directives, whitespace is significant in TXR queries, and represents a pattern match for whitespace in the input. An extent of text consisting of an undivided mixture of tabs and spaces is a whitespace token.

Whitespace tokens match a precisely identical piece of whitespace in the input, with one exception: a whitespace token consisting of precisely one space has a special meaning. It is equivalent to the regular expression @/[ ]+/ : match an extent of one or more spaces (but not tabs!). Multiple consecutive spaces do not have this meaning.

Thus, the query line "a b" (one space between a and b ) matches "a b" with any number of spaces between the two letters.

For matching a single space, the syntax @\ can be used (backslash-escaped space).

It is more often necessary to match multiple spaces than to exactly match one space, so this rule simplifies many queries and adds inconvenience to only few.

In output clauses, string and character literals and quasiliterals, a space token denotes a space.

Query material which is not escaped by the special character @ is literal text, which matches input character for character. Text which occurs at the beginning of a line matches the beginning of a line. Text which starts in the middle of a line, other than following a variable, must match exactly at the current position, where the previous match left off. Moreover, if the text is the last element in the line, its match is anchored to the end of the line.

An empty query line matches an empty line in the input. Note that an empty input stream does not contain any lines, and therefore is not matched by an empty line. An empty line in the input is represented by a newline character which is either the first character of the file, or follows a previous newline-terminated line.

Input streams which end without terminating their last line with a newline are tolerated, and are treated as if they had the terminator.

Text which follows a variable has special semantics, described in the section Variables below.

A query may not leave a line of input partially matched. If any portion of a line of input is matched, it must be entirely matched, otherwise a matching failure results. However, a query may leave unmatched lines. Matching only four lines of a ten line file is not a matching failure. The eof directive can be used to explicitly match the end of a file.

In the following example, the query matches the text, even though the text has an extra line.

code: Four score and seven

years ago our data: Four score and seven

years ago our

forefathers

In the following example, the query fails to match the text, because the text has extra material on one line that is not matched:

code: I can carry nearly eighty gigs

in my head data: I can carry nearly eighty gigs of data

in my head

Needless to say, if the text has insufficient material relative to the query, that is a failure also.

To match arbitrary material from the current position to the end of a line, the "match any sequence of characters, including empty" regular expression @/.*/ can be used. Example:

code: I can carry nearly eighty gigs@/.*/ data: I can carry nearly eighty gigs of data

In this example, the query matches, since the regular expression matches the string "of data". (See Regular Expressions section below).

Another way to do this is:

code: I can carry nearly eighty gigs@(skip)

6.5 Special Characters in Text

Control characters may be embedded directly in a query (with the exception of newline characters). An alternative to embedding is to use escape syntax. The following escapes are supported:

@\ newline A backslash immediately followed by a newline introduces a physical line break without breaking up the logical line. Material following this sequence continues to be interpreted as a continuation of the previous line, so that indentation can be introduced to show the continuation without appearing in the data. @\ space A backslash followed by a space encodes a space. This is useful in line continuations when it is necessary for some or all of the leading spaces to be preserved. For instance the two line sequence abcd@\ @\ efg is equivalent to the line abcd efg The two spaces before the @\ in the second line are consumed. The spaces after are preserved. @\a Alert character (ASCII 7, BEL). @\b Backspace (ASCII 8, BS). @\t Horizontal tab (ASCII 9, HT). @

Line feed (ASCII 10, LF). Serves as abstract newline on POSIX systems. @\v Vertical tab (ASCII 11, VT). @\f Form feed (ASCII 12, FF). This character clears the screen on many kinds of terminals, or ejects a page of text from a line printer. @\r Carriage return (ASCII 13, CR). @\e Escape (ASCII 27, ESC) @\x hex-digits A @\x immediately followed by a sequence of hex digits is interpreted as a hexadecimal numeric character code. For instance @\x41 is the ASCII character A. If a semicolon character immediately follows the hex digits, it is consumed, and characters which follow are not considered part of the hex escape even if they are hex digits. @\ octal-digits A @\ immediately followed by a sequence of octal digits (0 through 7) is interpreted as an octal character code. For instance @\010 is character 8, same as @\b . If a semicolon character immediately follows the octal digits, it is consumed, and subsequent characters are not treated as part of the octal escape, even if they are octal digits.

Note that if a newline is embedded into a query line with @

, this does not split the line into two; it's embedded into the line and thus cannot match anything. However, @

may be useful in the @(cat) directive and in @(output) .

6.6 Character Handling and International Characters

TXR represents text internally using wide characters, which are used to represent Unicode code points. Script source code, as well as all data sources, are assumed to be in the UTF-8 encoding. In TXR and TXR Lisp source, extended characters can be used directly in comments, literal text, string literals, quasiliterals and regular expressions. Extended characters can also be expressed indirectly using hexadecimal or octal escapes. On some platforms, wide characters may be restricted to 16 bits, so that TXR can only work with characters in the BMP (Basic Multilingual Plane) subset of Unicode.

TXR does not use the localization features of the system library; its handling of extended characters is not affected by environment variables like LANG and L_CTYPE . The program reads and writes only the UTF-8 encoding.

If TXR encounters an invalid bytes in the UTF-8 input, what happens depends on the context in which this occurs. In a query, comments are read without regard for encoding, so invalid encoding bytes in comments are not detected. A comment is simply a sequence of bytes terminated by a newline. In lexical elements which represent text, such as string literals, invalid or unexpected encoding bytes are treated as syntax errors. The scanner issues an error message, then discards a byte and resumes scanning. Certain sequences pass through the scanner without triggering an error, namely some UTF-8 overlong sequences. These are caught when when the lexeme is subject to UTF-8 decoding, and treated in the same manner as other UTF-8 data, described in the following paragraph.

Invalid bytes in data are treated as follows. When an invalid byte is encountered in the middle of a multibyte character, or if the input ends in the middle of a multibyte character, or if a character is extracted which is encoded as an overlong form, the UTF-8 decoder returns to the starting byte of the ill-formed multibyte character, and extracts just that byte, mapping it to the Unicode character range U+DC00 through U+DCFF. The decoding resumes afresh at the following byte, expecting that byte to be the start of a UTF-8 code.

Furthermore, because TXR internally uses a null-terminated character representation of strings which easily interoperates with C language interfaces, when a null character is read from a stream, TXR converts it to the code U+DC00. On output, this code converts back to a null byte, as explained in the previous paragraph. By means of this representational trick, TXR can handle textual data containing null bytes.

6.7 Regular Expression Directives

In place of a piece of text (see section Text above), a regular expression directive may be used, which has the following syntax:

@/RE/

where the RE part enclosed in slashes represents regular expression syntax (described in the section Regular Expressions below).

Long regular expressions can be broken into multiple lines using a backslash-newline sequence. Whitespace before the sequence or after the sequence is not significant, so the following two are equivalent:

@/reg \ ular/ @/regular/

There may not be whitespace between the backslash and newline.

Whereas literal text simply represents itself, regular expression denotes a (potentially infinite) set of texts. The regular expression directive matches the longest piece of text (possibly empty) which belongs to the set denoted by the regular expression. The match is anchored to the current position; thus if the directive is the first element of a line, the match is anchored to the start of a line. If the regular expression directive is the last element of a line, it is anchored to the end of the line also: the regular expression must match the text from the current position to the end of the line.

Even if the regular expression matches the empty string, the match will fail if the input is empty, or has run out of data. For instance suppose the third line of the query is the regular expression @/.*/ , but the input is a file which has only two lines. This will fail: the data has no line for the regular expression to match. A line containing no characters is not the same thing as the absence of a line, even though both abstractions imply an absence of characters.

Like text which follows a variable, a regular expression directive which follows a variable has special semantics, described in the section Variables below.

Much of the query syntax consists of arbitrary text, which matches file data character for character. Embedded within the query may be variables and directives which are introduced by a @ character. Two consecutive @@ characters encode a literal @ .

A variable matching or substitution directive is written in one of several ways:



@ sident

@{ bident }

@* sident

@*{ bident }

@{ bident / regex /}

@{ bident ( fun [ arg ... ])}

@{ bident number }

The forms with an * indicate a long match, see Longest Match below. The last three forms with the embedded regexp / regex / or number or function have special semantics; see Positive Match below.

The identifier t cannot be used as a name; it is a reserved symbol which denotes the value true. An attempt to use the variable @t will result in an exception. The symbol nil can be used where a variable name is required syntactically, but it has special semantics, described in a section below.

A sident is a "simple identifier" form which is not delimited by braces.

A sident consists of any combination of one or more letters, numbers, and underscores. It may not look like a number, so that for instance 123 is not a valid sident , but 12A is valid. Case is sensitive, so that FOO is different from foo , which is different from Foo .

The braces around an identifier can be used when material which follows would otherwise be interpreted as being part of the identifier. When a name is enclosed in braces it is a bident .

The following additional characters may be used as part of bident which are not allowed in a sident :

! $ % & * + - < = > ? \ ~

Moreover, most Unicode characters beyond U+007F may appear in a bident , with certain exceptions. A character may not be used if it is any of the Unicode space characters, a member of the high or low surrogate region, a member of any Unicode private use area, or is one of the two characters U+FFFE or U+FFFF.

The rule still holds that a name cannot look like a number so +123 is not a valid bident but these are valid: a->b , *xyz* , foo-bar .

The syntax @FOO_bar introduces the name FOO_bar , whereas @{FOO}_bar means the variable named "FOO" followed by the text "_bar" . There may be whitespace between the @ and the name, or opening brace. Whitespace is also allowed in the interior of the braces. It is not significant.

If a variable has no prior binding, then it specifies a match. The match is determined from some current position in the data: the character which immediately follows all that has been matched previously. If a variable occurs at the start of a line, it matches some text at the start of the line. If it occurs at the end of a line, it matches everything from the current position to the end of the line.

6.9 Negative Match

If a variable is one of the plain forms



@ sident

@{ bident }

@* sident

@*{ bident }

then this is a "negative match". The extent of the matched text (the text bound to the variable) is determined by looking at what follows the variable, and ranges from the current position to some position where the following material finds a match. This is why this is called a "negative match": the spanned text which ends up bound to the variable is that in which the match for the trailing material did not occur.

A variable may be followed by a piece of text, a regular expression directive, a function call, a directive, another variable, or nothing (i.e. occurs at the end of a line). These cases are described in detail below.

6.9.1 Variable Followed by Nothing

code: a b c @FOO data: a b c defghijk result: FOO="defghijk"

6.9.2 Variable Followed by Text

the variable @a is considered to be followed by ":@/foo/bcd e" .

If a variable is followed by text, then the extent of the negative match is determined by searching for the first occurrence of that text within the line, starting at the current position.

The variable matches everything between the current position and the matching position (not including the matching position). Any whitespace which follows the variable (and is not enclosed inside braces that surround the variable name) is part of the text. For example:

code: a b @FOO e f data: a b c d e f result: FOO="c d"

In the above example, the pattern text "a b " matches the data "a b " . So when the @FOO variable is processed, the data being matched is the remaining "c d e f" . The text which follows @FOO is " e f" . This is found within the data "c d e f" at position 3 (counting from 0). So positions 0-2 ("c d") constitute the matching text which is bound to FOO.

6.9.3 Variable Followed by a Function Call or Directive

If the variable is followed by a function call, or a directive, the extent is determined by scanning the text for the first position where a match occurs for the entire remainder of the line. (For a description of functions, see Functions.)

For example:

@foo@(bind a "abc")xyz

Here, foo will match the text from the current position to where "xyz" occurs, even though there is a @(bind) directive. Furthermore, if more material is added after the xyz, it is part of the search. Note the difference between the following two:

@foo@/abc/@(func) @foo@(func)@/abc/

In the first example, the variable foo matches the text from the current position until the match for the regular expression abc. @(func) is not considered when processing @foo . In the second example, the variable foo matches the text from the current position until the position which matches the function call, followed by a match for the regular expression. The entire sequence @(func)@/abc/ is considered.

6.9.4 Consecutive Variables

However, what if an unbound variable with no modifier is followed by another variable? The behavior depends on the nature of the other variable.

If the other variable is also unbound, and also has no modifier, this is a semantic error which will cause the query to fail. A diagnostic message will be issued, unless operating in quiet mode via -q . The reason is that there is no way to bind two consecutive variables to an extent of text; this is an ambiguous situation, since there is no matching criterion for dividing the text between two variables. (In theory, a repetition of the same variable, like @FOO@FOO , could find a solution by dividing the match extent in half, which would work only in the case when it contains an even number of characters. This behavior seems to have dubious value).

An unbound variable may be followed by one which is bound. The bound variable is effectively replaced by the text which it denotes, and the logic proceeds accordingly.

It is possible for a variable to be bound to a regular expression. If x is an unbound variable and y is bound to a regular expression RE , then @x@y means @x@/RE/ . A variable v can be bound to a regular expression using, for example, @(bind v #/RE/) .

The @* syntax for longest match is available. Example:

code: @FOO:@BAR@FOO data: xyz:defxyz result: FOO=xyz, BAR=def

Here, FOO is matched with "xyz" , based on the delimiting around the colon. The colon in the pattern then matches the colon in the data, so that BAR is considered for matching against "defxyz" . BAR is followed by FOO , which is already bound to "xyz" . Thus "xyz" is located in the "defxyz" data following "def" , and so BAR is bound to "def" .

If an unbound variable is followed by a variable which is bound to a list, or nested list, then each character string in the list is tried in turn to produce a match. The first match is taken.

An unbound variable may be followed by another unbound variable which specifies a regular expression or function call match. This is a special case called a "double variable match". What happens is that the text is searched using the regular expression or function. If the search fails, than neither variable is bound: it is a matching failure. If the search succeeds, than the first variable is bound to the text which is skipped by the search. The second variable is bound to the text matched by the regular expression or function. Examples:

code: @foo@{bar /abc/} data: xyz@#abc result: foo="xyz@#", BAR="abc"

6.9.5 Consecutive Variables Via Directive

This is treated just like the variable followed by directive. No semantic error is identified, even if both variables are unbound. Here, @var2 matches everything at the current position, and so @var1 ends up bound to the empty string.

Example 1: b matches at position 0 and a binds the empty string:

code: @a@(all)@b@(end) data: abc result: a=""

b="abc"

Example 2: *a specifies longest match (see Longest Match below), and so it takes everything:

code: @*a@(all)@b@(end) data: abc result: a="abc"

b=""

6.9.6 Longest Match

@

code: a @*{FOO}cd data: a b cdcdcdcd result: FOO="b cdcdcd"

code: a @{FOO}cd data: a b cdcdcd result: FOO="b "

b=""

In the former example, the match extends to the rightmost occurrence of "cd" , and so FOO receives "b cdcdcd" . In the latter example, the * syntax isn't used, and so a leftmost match takes place. The extent covers only the "b " , stopping at the first "cd" occurrence.

6.10 Positive Match

There are syntactic variants of variable syntax which have an embedded expression enclosed with the variable in braces:



@{ bident / regex /}

@{ bident ( fun [args ...])}

@{ bident number }

@{ bident bident }

These specify a variable binding that is driven by a positive match derived from a regular expression, function or character count, rather than from trailing material (which is regarded as a "negative" match, since the variable is bound to material which is skipped in order to match the trailing material). In the / regex / form, the match extends over all characters from the current position which match the regular expression regex . (see Regular Expressions section below). In the ( fun [ args ...]) form, the match extends over characters which are matched by the call to the function, if the call succeeds. Thus @{x (y z w)} is just like @(y z w) , except that the region of text skipped over by @(y z w) is also bound to the variable x . See Functions below.

In the number form, the match processes a field of text which consists of the specified number of characters, which must be non-negative number. If the data line doesn't have that many characters starting at the current position, the match fails. A match for zero characters produces an empty string. The text which is actually bound to the variable is all text within the specified field, but excluding leading and trailing whitespace. If the field contains only spaces, then an empty string is extracted.

This syntax is processed without consideration of what other syntax follows. A positive match may be directly followed by an unbound variable.

The

@{ bident bident } syntax allows the number or regex modifier to come from a variable. The variable must be bound and contain a non-negative integer or regular expression. For example, @{x y} behaves like @{x 3} if y is bound to the integer 3. It is an error if y is unbound.

6.11 Special Symbols nil and t

Just like in the Common Lisp language, the names nil and t are special.

nil symbol stands for the empty list object, an object which marks the end of a list, and Boolean false. It is synonymous with the syntax () which may be used interchangeably with nil in most constructs.

In TXR Lisp, nil and t cannot be used as variables. When evaluated, they evaluate to themselves.

In the TXR pattern language, nil can be used in the variable binding syntax, but does not create a binding; it has a special meaning. It allows the variable matching syntax to be used to skip material, in ways similar to the skip directive.

The nil symbol is also used as a block name, both in the TXR pattern language and in TXR Lisp. A block named nil is considered to be anonymous.

6.12 Keyword Symbols

Names whose names begin with the : character are keyword symbols. These also may not be used as variables either and stand for themselves. Keywords are useful for labeling information and situations.

6.13 Regular Expressions

Regular expressions are a language for specifying sets of character strings. Through the use of pattern matching elements, regular expression is able to denote an infinite set of texts. TXR contains an original implementation of regular expressions, which supports the following syntax:

. The period is a "wildcard" that matches any character. [] Character class: matches a single character, from the set specified by special syntax written between the square brackets. This supports basic regexp character class syntax. POSIX notation like [:digit:] is not supported. The regex tokens \s , \d and \w are permitted in character classes, but not their complementing counterparts. These tokens simply contribute their characters to the class. The class [a-zA-Z] means match an uppercase or lowercase letter; the class [0-9a-f] means match a digit or a lowercase letter; the class [^0-9] means match a non-digit, and so forth. There are no locale-specific behaviors in TXR regular expressions; [A-Z] denotes an ASCII/Unicode range of characters. The class [\d.] means match a digit or the period character. A ] or - can be used within a character class, but must be escaped with a backslash. A ^ in the first position denotes a complemented class, unless it is escaped by backslash. In any other position, it denotes itself. Two backslashes code for one backslash. So for instance [\[\-] means match a [ or - character, [^^] means match any character other than ^ , and [\^\\] means match either a ^ or a backslash. Regex operators such as * , + and & appearing in a character class represent ordinary characters. The characters - , ] and ^ occurring outside of a character class are ordinary. Unescaped / characters can appear within a character class. The empty character class [] matches no character at all, and its complement [^] matches any character, and is treated as a synonym for the . (period) wildcard operator. \s , \w and \d These regex tokens each match a single character. The \s regex token matches a wide variety of ASCII whitespace characters and Unicode spaces. The \w token matches alphabetic word characters; it is equivalent to the character class [A-Za-z_] . The \d token matches a digit, and is equivalent to [0-9] . \S , \W and \D These regex tokens are the complemented counterparts of \s , \w and \d . The \S token matches all those characters which \s does not match, \W matches all characters that \w does not match and \D matches nondigits. empty An empty expression is a regular expression. It represents the set of strings consisting of the empty string; i.e. it matches just the empty string. The empty regex can appear alone as a full regular expression (for instance the TXR syntax @// with nothing between the slashes) and can also be passed as a subexpression to operators, though this may require the use of parentheses to make the empty regex explicit. For example, the expression a| means: match either a , or nothing. The forms * and (*) are syntax errors; though not useful, the correct way to match the empty expression zero or more times is the syntax ()* . nomatch The nomatch regular expression represents the empty set: it matches no strings at all, not even the empty string. There is no dedicated syntax to directly express nomatch in the regex language. However, the empty character class [] is equivalent to nomatch, and may be considered to be a notation for it. Other representations of nomatch are possible: for instance, the regex ~.* which is the complement of the regex that denotes the set of all possible strings, and thus denotes the empty set. A nomatch has uses; for instance, it can be used to temporarily "comment out" regular expressions. The regex ([]abc|xyz) is equivalent to (xyz) , since the []abc branch cannot match anything. Using [] to "block" a subexpression allows you to leave it in place, then enable it later by removing the "block". (R) If R is a regular expression, then so is (R). The contents of parentheses denote one regular expression unit, so that for instance in (RE)* , the * operator applies to the entire parenthesized group. The syntax () is valid and equivalent to the empty regular expression. R? Optionally match the preceding regular expression R . R* Match the expression R zero or more times. This operator is sometimes called the "Kleene star", or "Kleene closure". The Kleene closure favors the longest match. Roughly speaking, if there are two or more ways in which R1*R2 can match, than that match occurs in which R1* matches the longest possible text. R+ Match the preceding expression R one or more times. Like R* , this favors the longest possible match: R+ is equivalent to RR* . R1%R2 Match R1 zero or more times, then match R2 . If this match can occur in more than one way, then it occurs such that R1 is matched the fewest number of times, which is opposite from the behavior of R1*R2 . Repetitions of R1 terminate at the earliest point in the text where a non-empty match for R2 occurs. Because it favors shorter matches, % is termed a non-greedy operator. If R2 is the empty expression, or equivalent to it, then R1%R2 reduces to R1* . So for instance (R%) is equivalent to (R*) , since the missing right operand is interpreted as the empty regex. Note that whereas the expression (R1*R2) is equivalent to (R1*)R2 , the expression (R1%R2) is not equivalent to (R1%)R2 . Also note that A(XY%Z)B is equivalent to AX(Y%Z)B . This is because the precedence of % is higher than that of catenation on its left side; this rule prevents the given syntax from expressing the XY catenation. The expression may be understood as: A(X(Y%Z))B where the inner parentheses clarify how the syntax surrounding the % operator is being parsed, and the outer parentheses are superfluous. The correct way to assert catenation of XY as the left operand of % is A(XY)%ZB . To specify XY as the left operand, and limit the right operand to just Z , the correct syntax is A((XY)%Z)B . By contrast, the expression A(X%YZ)B is not equivalent to A(X%Y)ZB because the precedence of % is lower than that of catenation on its right side. The operator is effectively "bi-precedential". ~R Match the opposite of the following expression R ; that is, match exactly those texts that R does not match. This operator is called complement, or logical not. R1R2 Two consecutive regular expressions denote catenation: the left expression must match, and then the right. R1|R2 match either the expression R1 or R2 . This operator is known by a number of names: union, logical or, disjunction, branch, or alternative. R1&R2 Match both the expression R1 and R2 simultaneously; i.e. the matching text must be one of the texts which are in the intersection of the set of texts matched by R1 and the set matched by R2 . This operator is called intersection, logical and, or conjunction.

Any character which is not a regular expression operator, a backslash escape, or the slash delimiter, denotes one-position match of that character itself.

Any of the special characters, including the delimiting / , and the backslash, can be escaped with a backslash to suppress its meaning and denote the character itself.

Furthermore, all of the same escapes as are described in the section Special Characters in Text above are supported - the difference is that in regular expressions, the @ character is not required, so for example a tab is coded as \t rather than @\t . Octal and hex character escapes can be optionally terminated by a semicolon, which is useful if the following characters are octal or hex digits not intended to be part of the escape.

Only the above escapes are supported. Unlike in some other regular expression implementations, if a backlash appears before a character which isn't a regex special character or one of the supported escape sequences, it is an error. This wasn't true of historic versions of TXR. See the COMPATIBILITY section.

Precedence table, highest to lowest: Operators Class Associativity

(R) [] primary

R? R+ R* R%... postfix left-to-right

R1R2 catenation left-to-right

~R ...%R unary right-to-left

R1&R2 intersection left-to-right

R1|R2 union left-to-right



The % operator is like a postfix operator with respect to its left operand, but like a unary operator with respect to its right operand. Thus a~b%c~d is a(~(b%(c(~d)))) , demonstrating right-to-left associativity, where all of b% may be regarded as a unary operator being applied to c~d . Similarly, a?*+%b means (((a?)*)+)%b , where the trailing %b behaves like a postfix operator.

In TXR, regular expression matches do not span multiple lines. The regex language has no feature for multi-line matching. However, the @(freeform) directive allows the remaining portion of the input to be treated as one string in which line terminators appear as explicit characters. Regular expressions may freely match through this sequence.

It's possible for a regular expression to match an empty string. For instance, if the next input character is z , facing a the regular expression /a?/ , there is a zero-character match: the regular expression's state machine can reach an acceptance state without consuming any characters. Examples:

code: @A@/a?/@/.*/ data: zzzzz result: A=""

code: @{A /a?/}@B data: zzzzz result: A="", B="zzzz"

code: @*A@/a?/ data: zzzzz result: A="zzzzz"

In the first example, variable @A is followed by a regular expression which can match an empty string. The expression faces the letter z at position 0 in the data line. A zero-character match occurs there, therefore the variable A takes on the empty string. The @/.*/ regular expression then consumes the line.

Similarly, in the second example, the /a?/ regular expression faces a z , and thus yields an empty string which is bound to A . Variable @B consumes the entire line.

The third example requests the longest match for the variable binding. Thus, a search takes place for the rightmost position where the regular expression matches. The regular expression matches anywhere, including the empty string after the last character, which is the rightmost place. Thus variable A fetches the entire line.

For additional information about the advanced regular expression operators, NOTES ON EXOTIC REGULAR EXPRESSIONS below.

6.14 Compound Expressions

If the @ escape character is followed by an open parenthesis or square bracket, this is taken to be the start of a TXR Lisp compound expression.

The TXR language has the unusual property that its syntactic elements, so-called directives, are Lisp compound expressions. These expressions not only enclose syntax, but expressions which begin with certain symbols de facto behave as tokens in a phrase structure grammar. For instance, the expression @(collect) begins a block which must be terminated by the expression @(end) , otherwise there is a syntax error. The collect expression can contain arguments which modify the behavior of the construct, for instance @(collect :gap 0 :vars (a b)) . In some ways, this situation might be compared to the HTML language, in which an element such as <a> must be terminated by </a> and can have attributes such as <a href="..."> .

Compound contain subexpressions: other compound expressions, or literal objects of various kinds. Among these are: symbols, numbers, string literals, character literals, quasiliterals and regular expressions. These are described in the following sections. Additional kinds of literal objects exist, which are discussed in the TXR LISP section of the manual.

Some examples of compound expressions are:

(banana) (a b c (d e f)) ( a (b (c d) (e ) )) ("apple" #\b #\space 3) (a #/[a-z]*/ b) (_ `@file.txt`)

Symbols occurring in a compound expression follow a slight more permissive lexical syntax than the bident in the syntax @{ bident } introduced earlier. The / (slash) character may be part of an identifier, or even constitute an entire identifier. In fact a symbol inside a directive is a lident . This is described in the Symbol Tokens section under TXR LISP. A symbol must not be a number; tokens that look like numbers are treated as numbers and not symbols.

6.15 Character Literals

Character literals are introduced by the #\ syntax, which is either followed by a character name, the letter x followed by hex digits, the letter o followed by octal digits, or a single character. Valid character names are:

nul linefeed return alarm newline esc backspace vtab space tab page pnul

For instance #\esc denotes the escape character.

This convention for character literals is similar to that of the Scheme language. Note that #\linefeed and #

ewline are the same character. The #\pnul character is specific to TXR and denotes the U+DC00 code in Unicode; the name stands for "pseudo-null", which is related to its special function. For more information about this, see the section "Character Handling and International Characters".

6.16 String Literals

String literals are delimited by double quotes. A double quote within a string literal is encoded using \" and a backslash is encoded as \\ . Backslash escapes like

and \t are recognized, as are hexadecimal escapes like \xFF or \xxabc and octal escapes like \123 . Ambiguity between an escape and subsequent text can be resolved by using trailing semicolon delimiter: "\xabc;d" is a string consisting of the character U+0ABC followed by "d" . The semicolon delimiter disappears. To write a literal semicolon immediately after a hex or octal escape, write two semicolons, the first of which will be interpreted as a delimiter. Thus, "\x21;;" represents "!;" .

If the line ends in the middle of a literal, it is an error, unless the last character is a backslash. This backslash is a special escape which does not denote a character; rather, it indicates that the string literal continues on the next line. The backslash is deleted, along with whitespace which immediately precedes it, as well as leading whitespace in the following line. The escape sequence "\ " (backslash space) can be used to encode a significant space.

Example:

"foo \ bar" "foo \ \ bar" "foo\ \ bar"

The first string literal is the string "foobar" . The second two are "foo bar" .

6.17 Word List Literals

A word list literal (WLL) provides a convenient way to write a list of strings when such a list can be given as whitespace-delimited words.

There are two flavors of the WLL: the regular WLL which begins with #" (hash, double-quote) and the splicing list literal which begins with #*" (hash, star, double-quote).

Both types are terminated by a double quote, which may be escaped as \" in order to include it as a character. All the escaping conventions used in string literals can be used in word literals.

Unlike in string literals, whitespace (tabs and spaces) is not significant in word literals: it separates words. Whitespace may be escaped with a backslash in order to include it as a literal character.

Just like in string literals, an unescaped newline character is not allowed. A newline preceded by a backslash is permitted. Such an escaped backslash, together with any leading and trailing unescaped whitespace, is removed and replaced with a single space.

Example:

#"abc def ghi" --> notates ("abc" "def" "ghi") #"abc def \ ghi" --> notates ("abc" "def" "ghi") #"abc\ def ghi" --> notates ("abc def" "ghi") #"abc\ def\ \ \ ghi" --> notates ("abc def " " ghi")

A splicing word literal differs from a word literal in that it does not produce a list of string literals, but rather it produces a sequence of string literals that is merged into the surrounding syntax. Thus, the following two notations are equivalent:

(1 2 3 #*"abc def" 4 5 #"abc def") (1 2 3 "abc" "def" 4 5 ("abc" "def"))

The regular WLL produced a single list object, but the splicing WLL expanded into multiple string literal objects.

6.18 String Quasiliterals

Quasiliterals are similar to string literals, except that they may contain variable references denoted by the usual @ syntax. The quasiliteral represents a string formed by substituting the values of those variables into the literal template. If a is bound to "apple" and b to "banana" , the quasiliteral `one @a and two @{b}s` represents the string "one apple and two bananas" . A backquote escaped by a backslash represents itself. Unlike in directive syntax, two consecutive @ characters do not code for a literal @ , but cause a syntax error. The reason for this is that compounding of the @ syntax is meaningful. Instead, there is a \@ escape for encoding a literal @ character. Quasiliterals support the full output variable syntax. Expressions within variable substitutions follow the evaluation rules of TXR Lisp. This hasn't always been the case: see the COMPATIBILITY section.

Quasiliterals can be split into multiple lines in the same way as ordinary string literals.

6.19 Quasiword List Literals

The quasiword list literals (QLL-s) are to quasiliterals what WLL-s are to ordinary literals. (See the above section Word List Literals.)

A QLL combines the convenience of the WLL with the power of quasistrings.

Just as in the case of WLL-s, there are two flavors of the QLL: the regular QLL which begins with #` (hash, backquote) and the splicing QLL which begins with #*` (hash, star, backquote).

Both types are terminated by a backquote, which may be escaped as \` in order to include it as a character. All the escaping conventions used in quasiliterals can be used in QLL.

Unlike in quasiliterals, whitespace (tabs and spaces) is not significant in QLL: it separates words. Whitespace may be escaped with a backslash in order to include it as a literal character.

A newline is not permitted unless escaped. An escaped newline works exactly the same way as it does in word list literals (WLL-s).

Note that the delimiting into words is done before the variable substitution. If the variable a contains spaces, then #`@a` nevertheless expands into a list of one item: the string derived from a .

Examples:

#`abc @a ghi` --> notates (`abc` `@a` `ghi`) #`abc @d@e@f \ ghi` --> notates (`abc` `@d@e@f` `ghi`) #`@a\ @b @c` --> notates (`@a @b` `@c`)

A splicing QLL differs from an ordinary QLL in that it does not produce a list of quasiliterals, but rather it produces a sequence of quasiliterals that is merged into the surrounding syntax.

TXR supports integers and floating-point numbers.

An integer constant is made up of digits 0 through 9 , optionally preceded by a + or - sign.

Examples:

123 -34 +0 -0 +234483527304983792384729384723234

An integer constant can also be specified in hexadecimal using the prefix #x followed by an optional sign, followed by hexadecimal digits: 0 through 9 and the upper or lower case letters A through F :

#xFF ;; 255 #x-ABC ;; -2748

Similarly, octal numbers are supported with the prefix #o followed by octal digits:

#o777 ;; 511

and binary numbers can be written with a #b prefix:

#b1110 ;; 14

Note that the #b prefix is also used for buffer literals.

A floating-point constant is marked by the inclusion of a decimal point, the exponential "e notation", or both. It is an optional sign, followed by a mantissa consisting of digits, a decimal point, more digits, and then an optional exponential notation consisting of the letter e or E , an optional + or - sign, and then digits indicating the exponent value. In the mantissa, the digits are not optional. At least one digit must either precede the decimal point or follow. That is to say, a decimal point by itself is not a floating-point constant.

Examples:

.123 123. 1E-3 20E40 .9E1 9.E19 -.5 +3E+3 1.E5

Examples which are not floating-point constant tokens:

. ;; dot token, not a number 123E ;; the symbol 123E 1.0E- ;; syntax error: invalid floating point constant 1.0E ;; syntax error: invalid floating point constant 1.E ;; syntax error: invalid floating point literal .e ;; syntax error: dot token followed by symbol

In TXR there is a special "dotdot" token consisting of two consecutive periods. An integer constant followed immediately by dotdot is recognized as such; it is not treated as a floating constant followed by a dot. That is to say, 123.. does not mean 123. . (floating point 123.0 value followed by dot token). It means 123 .. (integer 123 followed by .. token).

Dialect note: unlike in Common Lisp, 123. is not an integer, but the floating-point number 123.0 .

Comments of the form @; were introduced earlier. Inside compound expressions, another convention for comments exists: Lisp comments, which are introduced by the ; (semicolon) character and span to the end of the line.

Example:

@(foo ; this is a comment bar ; this is another comment )

This is equivalent to @(foo bar) .

7 DIRECTIVES

When a TXR Lisp compound expressions occurs in TXR preceded by a @ , it is a directive.

Directives which are based on certain symbols are, additionally, involved in a phrase-structure syntax which uses Lisp expressions as if they were tokens.

For instance, the directive

not only denotes a compound expression with the collect symbol in its head position, but it also introduces a syntactic phrase which requires a matching @(end) directive. In other words, @(collect) is not only an expression, but serves as a kind of token in a higher level phrase structure grammar.

Effectively, collect is a reserved symbol in the TXR language. A TXR program cannot use this symbol as the name of a pattern function, due to its role in the syntax. The symbol has no reserved role in TXR Lisp.

Usually if this type of directive occurs alone in a line, not preceded or followed by other material, it is involved in a "vertical" (or line oriented) syntax.

If such a directive is embedded in a line (has preceding or trailing material) then it is in a horizontal syntactic and semantic context (character-oriented).

There is an exception: the definition of a horizontal function looks like this:

@(define name (arg))body material@(end)

Yet, this is considered one vertical item, which means that it does not match a line of data. (This is necessary because all horizontal syntax matches something within a line of data, which is undesirable for definitions.)

Many directives exhibit both horizontal and vertical syntax, with different but closely related semantics. A few are vertical only, and some are horizontal only.

A summary of the available directives follows:

@(eof) Explicitly match the end of file. Fails if unmatched data remains in the input stream. @(eol) Explicitly match the end of line. Fails if the current position is not the end of a line. Also fails if no data remains (there is no current line). @(next) Continue matching in another file or other data source. @(block) Groups together a sequence of directives into a logical name block, which can be explicitly terminated from within using the @(accept) and @(fail) directives. Blocks are described in the section Blocks below. @(skip) Treat the remaining query as a subquery unit, and search the lines (or characters) of the input file until that subquery matches somewhere. A skip is also an anonymous block. @(trailer) Treat the remaining query or subquery as a match for a trailing context. That is to say, if the remainder matches, the data position is not advanced. @(freeform) Treat the remainder of the input as one big string, and apply the following query line to that string. The newline characters (or custom separators) appear explicitly in that string. @(fuzz) The fuzz directive, inspired by the patch utility, specifies a partial match for some lines. @(line) and @(chr) These directives match a variable or expression against the current line number or character position. @(name) Match a variable against the name of the current data source. @(data) Match a variable against the remaining data (lazy list of strings). @(some) Multiple clauses are each applied to the same input. Succeeds if at least one of the clauses matches the input. The bindings established by earlier successful clauses are visible to the later clauses. @(all) Multiple clauses are applied to the same input. Succeeds if and only if each one of the clauses matches. The clauses are applied in sequence, and evaluation stops on the first failure. The bindings established by earlier successful clauses are visible to the later clauses. @(none) Multiple clauses are applied to the same input. Succeeds if and only if none of them match. The clauses are applied in sequence, and evaluation stops on the first success. No bindings are ever produced by this construct. @(maybe) Multiple clauses are applied to the same input. No failure occurs if none of them match. The bindings established by earlier successful clauses are visible to the later clauses. @(cases) Multiple clauses are applied to the same input. Evaluation stops on the first successful clause. @(require) The require directive is similar to the do directive in that it evaluates one or more TXR Lisp expressions. If the result of the rightmost expression is nil, then require triggers a match failure. See the TXR LISP section far below. @(if) , @(elif) , and @(else) The if directive with optional elif and else clauses allows one of multiple bodies of pattern matching directives to be conditionally selected by testing the values of Lisp expressions. It is also available inside @(output) for conditionally selecting output clauses. @(choose) Multiple clauses are applied to the same input. The one whose effect persists is the one which maximizes or minimizes the length of a particular variable. @(empty) The @(empty) directive matches the empty string. It is useful in certain situations, such as expressing an empty match in a directive that doesn't accept an empty clause. The @(empty) syntax has another meaning in @(output) clauses, in conjunction with @(repeat) . @(define name ( args ...)) Introduces a function. Functions are described in the Functions section below. @(call expr args *) Performs function indirection. Evaluates expr , which must produce a symbol that names a pattern function. Then that pattern function is invoked. @(gather) Searches text for matches for multiple clauses which may occur in arbitrary order. For convenience, lines of the first clause are treated as separate clauses. @(collect) Search the data for multiple matches of a clause. Collect the bindings in the clause into lists, which are output as array variables. The @(collect) directive is line oriented. It works with a multi-line pattern and scans line by line. A similar directive called @(coll) works within one line. A collect is an anonymous block. @(and) Separator of clauses for @(some) , @(all) , @(none) , @(maybe) and @(cases) . Equivalent to @(or) . The choice is stylistic. @(or) Separator of clauses for @(some) , @(all) , @(none) , @(maybe) and @(cases) . Equivalent to @(and) . The choice is stylistic. @(end) Required terminator for @(some) , @(all) , @(none) , @(maybe) , @(cases) , @(if) , @(collect) , @(coll) , @(output) , @(repeat) , @(rep) , @(try) , @(block) and @(define) . @(fail) Terminate the processing of a block, as if it were a failed match. Blocks are described in the section Blocks below. @(accept) Terminate the processing of a block, as if it were a successful match. What bindings emerge may depend on the kind of block: collect has special semantics. Blocks are described in the section Blocks below. @(try) Indicates the start of a try block, which is related to exception handling, described in the Exceptions section below. @(catch) and @(finally) Special clauses within @(try) . See Exceptions below. @(defex) and @(throw) Define custom exception types; throw an exception. See Exceptions below. @(assert) The assert directive requires the following material to match, otherwise it throws an exception. It is useful for catching mistakes or omissions in parts of a query that are sure-fire matches. @(flatten) Normalizes a set of specified variables to one-dimensional lists. Those variables which have scalar value are reduced to lists of that value. Those which are lists of lists (to an arbitrary level of nesting) are converted to flat lists of their leaf values. @(merge) Binds a new variable which is the result of merging two or more other variables. Merging has somewhat complicated semantics. @(cat) Decimates a list (any number of dimensions) to a string, by catenating its constituent strings, with an optional separator string between all of the values. @(bind) Binds one or more variables against a value using a structural pattern match. A limited form of unification takes place which can cause a match to fail. @(set) Destructively assigns one or more existing variables using a structural pattern, using syntax similar to bind. Assignment to unbound variables triggers an error. @(rebind) Evaluates an expression in the current binding environment, and then creates new bindings for the variables in the structural pattern. Useful for temporarily overriding variable values in a scope. @(forget) Removes variable bindings. @(local) Synonym of @(forget) . @(output) A directive which encloses an output clause in the query. An output section does not match text, but produces text. The directives above are not understood in an output clause. @(repeat) A directive understood within an @(output) section, for repeating multi-line text, with successive substitutions pulled from lists. The directive @(rep) produces iteration over lists horizontally within one line. These directives have a different meaning in matching clauses, providing a shorthand notation for @(collect :vars nil) and @(coll :vars nil) , respectively. @(deffilter) The deffilter directive is used for defining named filters, which are useful for filtering variable substitutions in output blocks. Filters are useful when data must be translated between different representations that have different special characters or other syntax, requiring escaping or similar treatment. Note that it is also possible to use a function as a filter. See Function Filters below. Named filters are stored in the hash table held in the Lisp special variable *filters* . @(filter) The filter directive passes one or more variables through a given filter or chain or filters, updating them with the filtered values. @(load) and @(include) The load and include directives allow TXR programs to be modularized. They bring in code from a file, in two different ways. @(do) The do directive is used to evaluate TXR Lisp expressions, discarding their result values. See the TXR LISP section far below. @(mdo) The mdo (macro do ) directive evaluates TXR Lisp expressions immediately, during the parsing of the TXR syntax in which it occurs. @(in-package) The in-package directive is used to switch to a different symbol package. It mirrors the TXR Lisp macro of the same name.

7.2 Subexpression Evaluation

Some directives contain subexpressions which are evaluated. Two distinct styles of evaluations occur in TXR: bind expressions and Lisp expressions. Which semantics applies to an expression depends on the syntactic context in which it occurs: which position in which directive.

The evaluation of TXR Lisp expressions is described in the TXR LISP section of the manual.

Bind expressions are so named because they occur in the @(bind) directive. TXR pattern function invocations also treat argument expressions as bind expressions.

The @(rebind) , @(set) , @(merge) , and @(deffilter) directives also use bind expression evaluation. Bind expression evaluation also occurs in the argument position of the :tlist keyword in the @(next) directive.

Unlike Lisp expressions, bind expressions do not support operators. If a bind expression is a nested list structure, it is a template denoting that structure. Any symbol in any position of that structure is interpreted as a variable. When the bind expression is evaluated, those corresponding positions in the template are replaced by the values of the variables.

Anywhere where a variable can appear in a bind expression's nested list structure, a Lisp expression can appear preceded by the @ character. That Lisp expression is evaluated and its value is substituted into the bind expression's template.

Moreover, a Lisp expression preceded by @ can be used as an entire bind expression. The value of that Lisp expression is then taken as the bind expression value.

Any object in a bind expression which is not a nested list structure containing Lisp expressions or variables denotes itself literally.

Examples: In the following examples, the variables a and b are assumed to have the string values "foo" and "bar" , respectively. The -> notation indicates the value of each expression. a -> "foo" (a b) -> ("foo" "bar") ((a) ((b) b)) -> (("foo") (("bar") "bar")) (list a b) -> error: unbound variable list @(list a b) -> ("foo" "bar") ;; Lisp expression (a @[b 1..:]) -> ("foo" "ar") ;; Lisp eval of [b 1..:] (a @(+ 2 2)) -> ("foo" 4) ;; Lisp eval of (+ 2 2) #(a b) -> (a b) ;; Vector literal, not list. [a b] -> error: unbound variable dwim The last example above [a b] is a notation equivalent to (dwim a b) and so follows similarly to the example involving list .

7.3 Input Scanning and Data Manipulation

7.3.1 The next directive

The next directive indicates that the remaining directives in the current block are to be applied against a new input source.

It can only occur by itself as the only element in a query line, and takes various arguments, according to these possibilities:



@(next)

@(next source )

@(next source :nothrow)

@(next :args)

@(next :env)

@(next :list lisp-expr )

@(next :tlist bind-expr )

@(next :string lisp-expr )

@(next :var var )

@(next nil)

The lone @(next) without arguments specifies that subsequent directives will match inside the next file in the argument list which was passed to TXR on the command line.

If source is given, it must be a TXR Lisp expression which denotes an input source. Its value may be a string or an input stream. For instance, if variable A contains the text "data" , then @(next A) means switch to the file called "data" , and @(next `@A.txt`) means to switch to the file "data.txt" . The directive @(next (open-command `git log`)) switches to the input stream connected to the output of the git log command.

If the input source cannot be opened for whatever reason, TXR throws an exception (see Exceptions below). An unhandled exception will terminate the program. Often, such a drastic measure is inconvenient; if @(next) is invoked with the :nothrow keyword, then if the input source cannot be opened, the situation is treated as a simple match failure.

The variant @(next :args) means that the remaining command line arguments are to be treated as a data source. For this purpose, each argument is considered to be a line of text. The argument list does include that argument which specifies the file that is currently being processed or was most recently processed. As the arguments are matched, they are consumed. This means that if a @(next) directive without arguments is executed in the scope of @(next :args) , it opens the file named by the first unconsumed argument.

To process arguments, and then continue with the original file and argument list, wrap the argument processing in a @(block) . When the block terminates, the input source and argument list are restored to what they were before the block.

The variant @(next :env) means that the list of process environment variables is treated as a source of data. It looks like a text file stream consisting of lines of the form "name=value" . If this feature is not available on a given platform, an exception is thrown.

The syntax @(next :list lisp-expr ) treats TXR Lisp expression lisp-expr as a source of text. The value of lisp-expr is flattened to a simple list in a way similar to the @(flatten) directive. The resulting list is treated as if it were the lines of a text file: each element of the list must be a string, which represents a line. If the strings happen contain embedded newline characters, they are a visible constituent of the line, and do not act as line separators.

The syntax @(next :tlist bind-expr ) is similar to @(next :list ...) except that bind-expr is not a TXR Lisp expression, but a TXR bind expression.

The syntax @(next :var var ) requires var to be a previously bound variable. The value of the variable is retrieved and treated like a list, in the same manner as under @(next :list ...) . Note that @(next :var x) is not always the same as @(next :tlist x) , because :var x strictly requires x to be a TXR variable, whereas the x in :tlist x is an expression which can potentially refer to Lisp variable.

The syntax @(next :string lisp-expr ) treats expression lisp-expr as a source of text. The value of the expression must be a string. Newlines in the string are interpreted as line terminators.

A string which is not terminated by a newline is tolerated, so that:

binds a to "abc" . Likewise, this is also the case with input files and other streams whose last line is not terminated by a newline.

However, watch out for empty strings, which are analogous to a correctly formed empty file which contains no lines:

This will not bind a to "" ; it is a matching failure. The behavior of :list is different. The query

binds a to "" . The reason is that under :list the string "" is flattened to the list ("") which is not an empty input stream, but a stream consisting of one empty line.

The @(next nil) variant indicates that the following subquery is applied to empty data, and the list of data sources from the command line is considered empty. This directive is useful in front of TXR code which doesn't process data sources from the command line, but takes command line arguments. The @(next nil) incantation absolutely prevents TXR from trying to open the first command line argument as a data source.

Note that the @(next) directive only redirects the source of input over the scope of subquery in which the that directive appears. For example, the following query looks for the line starting with "xyz" at the top of the file "foo.txt" , within a some directive. After the @(end) which terminates the @(some) , the "abc" is matched in the previous input stream which was in effect before the @(next) directive:

@(some) @(next "foo.txt") xyz@suffix @(end) abc

However, if the @(some) subquery successfully matched "xyz@suffix" within the file foo.text , there is now a binding for the suffix variable, which is visible to the remainder of the entire query. The variable bindings survive beyond the clause, but the data stream does not.

7.3.2 The skip directive

The skip directive considers the remainder of the query as a search pattern. The remainder is no longer required to strictly match at the current line in the current input stream. Rather, the current stream is searched, starting with the current line, for the first line where the entire remainder of the query will successfully match. If no such line is found, the skip directive fails. If a matching position is found, the remainder of the query is processed from that point.

The remainder of the query can itself contain skip directives. Each such directive performs a recursive subsearch.

Skip comes in vertical and horizontal flavors. For instance, skip and match the last line:

Skip and match the last character of the line:

The skip directive has two optional arguments, which are evaluated as TXR Lispexpressions. If the first argument evaluates to an integer, its value limits the range of lines scanned for a match. Judicious use of this feature can improve the performance of queries.

Example: scan until "size: @SIZE" matches, which must happen within the next 15 lines:

Without the range limitation skip will keep searching until it consumes the entire input source. In a horizontal skip , the range-limiting numeric argument is expressed in characters, so that

means: there must be a match for "abc" at the start of the line, and then within the next five characters, there must be a match for "def" .

Sometimes a skip is nested within a collect , or following another skip. For instance, consider:

@(collect) begin @BEG_SYMBOL @(skip) end @BEG_SYMBOL @(end)

The above collect iterates over the entire input. But, potentially, so does the embedded skip . Suppose that "begin x" is matched, but the data has no matching "end x" . The skip will search in vain all the way to the end of the data, and then the collect will try another iteration back at the beginning, just one line down from the original starting point. If it is a reasonable expectation that an end x occurs 15 lines of a "begin x" , this can be specified instead:

@(collect) begin @BEG_SYMBOL @(skip 15) end @BEG_SYMBOL @(end)

If the symbol nil is used in place of a number, it means to scan an unlimited range of lines; thus, @(skip nil) is equivalent to @(skip) .

If the symbol :greedy is used, it changes the semantics of the skip to longest match semantics. For instance, match the last three space-separated tokens of the line:

Without :greedy , the variable @c will can match multiple tokens, and end up with spaces in it, because nothing follows @c and so it matches from any position which follows a space to the end of the line. Also note the space in front of @a . Without this space, @a will get an empty string.

A line oriented example of greedy skip: match the last line without using @eof :

There may be a second numeric argument. This specifies a minimum number of lines to skip before looking for a match. For instance, skip 15 lines and then search indefinitely for begin ... :

@(skip nil 15) begin @BEG_SYMBOL

The two arguments may be used together. For instance, the following matches if, and only if, the 15th line of input starts with begin :

@(skip 1 15) begin @BEG_SYMBOL

Essentially, @(skip 1 n ) means "hard skip by n lines". @(skip 1 0) is the same as @(skip 1) , which is a noop, because it means: "the remainder of the query must match starting on the next line", or, more briefly, "skip exactly zero lines", which is the behavior if the skip directive is omitted altogether.

Here is one trick for grabbing the fourth line from the bottom of the input:

@(skip) @fourth_from_bottom @(skip 1 3) @(eof)

Or using greedy skip:

@(skip :greedy) @fourth_from_bottom @(skip 1 3)

Nongreedy skip with the @(eof) has a slight advantage because the greedy skip will keep scanning even though it has found the correct match, then backtrack to the last good match once it runs out of data. The regular skip with explicit @(eof) will stop when the @(eof) matches.

7.3.3 Reducing Backtracking with Blocks

skip can consume considerable CPU time when multiple skips are nested. Consider:

@(skip) A @(skip) B @(skip) C

This is actually nesting: the second a third skips occur within the body of the first one, and thus this creates nested iteration. TXR is searching for the combination of skips which find match the pattern of lines A , B and C , with backtracking behavior. The outermost skip marches through the data until it finds A , followed by a pattern match for the second skip. The second skip iterates within to find B , followed by the third skip, and the third skip iterates to find C . If there is only one line A , and one B , then this is reasonably fast. But suppose there are many lines matching A and B , giving rise to a large number combinations of skips which match A and B , and yet do not find a match for C , triggering backtracking. The nested stepping which tries the combinations of A and B can give rise to a considerable running time.

One way to deal with the problem is to unravel the nesting with the help of blocks. For example:

@(block) @ (skip) A @(end) @(block) @ (skip) B @(end) @(skip) C

Now the scope of each skip is just the remainder of the block in which it occurs. The first skip finds A , and then the block ends. Control passes to the next block, and backtracking will not take place to a block which completed (unless all these blocks are enclosed in some larger construct which backtracks, causing the blocks to be re-executed.

This rewrite is not equivalent, and cannot be used for instance in backreferencing situations such as:

@; @; Find three lines anywhere in the input which are identical. @; @(skip) @line @(skip) @line @(skip) @line

This example depends on the nested search-within-search semantics.

7.3.4 The trailer directive

The trailer directive introduces a trailing portion of a query or subquery which matches input material normally, but in the event of a successful match, does not advance the current position. This can be used, for instance, to cause @(collect) to match partially overlapping regions.

Trailer can be used in vertical context:



@(trailer)

directives

...

or horizontal:



@(trailer) directives ...

A vertical trailer prevents the vertical input position from advancing as it is matched by directives , whereas a horizontal trailer prevents the horizontal position from advancing. In other words, trailer performs matching without consuming the input, providing a look-ahead mechanism.

Example:

This script collects each line which has a duplicate somewhere later in the input. Without the @(trailer) directive, this does not work properly for inputs like:

111 222 111 222

Without @(trailer) , the first duplicate pair constitutes a match which spans over the 222 . After that pair is found, the matching continues after the second 111 .

With the @(trailer) directive in place, the collect body, on each iteration, only consumes the lines matched prior to @(trailer) .

7.3.5 The freeform directive

The freeform directive provides a useful alternative to TXR's line-oriented matching discipline. The freeform directive treats all remaining input from the current input source as one big line. The query line which immediately follows freeform is applied to that line.

The syntax variations are:

@(freeform) ... query line .. @(freeform number ) ... query line .. @(freeform string ) ... query line .. @(freeform number string ) ... query line ..

where number and string denote TXR Lisp expressions which evaluate to an integer or string value, respectively.

If number and string are both present, they may be given in either order.

If the number argument is given, its value limits the range of lines which are combined together. For instance @(freeform 5) means to only consider the next five lines to to be one big line. Without this argument, freeform is "bottomless". It can match the entire file, which creates the risk of allocating a large amount of memory.

If the string argument is given, it specifies a custom line terminator. The default terminator is "

" . The terminator does not have to be one character long.

Freeform does not convert the entire remainder of the input into one big line all at once, but does so in a dynamic, lazy fashion, which takes place as the data is accessed. So at any time, only some prefix of the data exists as a flat line in which newlines are replaced by the terminator string, and the remainder of the data still remains as a list of lines.

After th