This is the first post in a three part series.

Part 1: Useful methods on the String class Introduction to Regular Expressions The Select-String cmdlet

Part 2: The -split operator The -match operator The switch statement The Regex class

Part 3: A real world, complete and slightly bigger, example of a switch-based parser



A task that appears regularly in my workflow is text parsing. It may be about getting a token from a single line of text or about turning the text output of native tools into structured objects so I can leverage the power of PowerShell.

I always strive to create structure as early as I can in the pipeline, so that later on I can reason about the content as properties on objects instead of as text at some offset in a string. This also helps with sorting, since the properties can have their correct type, so that numbers, dates etc. are sorted as such and not as text.

There are a number of options available to a PowerShell user, and I’m giving an overview here of the most common ones.

This is not a text about how to create a high performance parser for a language with a structured EBNF grammar. There are better tools available for that, for example ANTLR.

.Net methods on the string class

Any treatment of string parsing in PowerShell would be incomplete if it didn’t mention the methods on the string class. There are a few methods that I’m using more often than others when parsing strings:

Name Description Substring(int startIndex) Retrieves a substring from this instance. The substring starts at a specified character position and continues to the end of the string. Substring(int startIndex, int length) Retrieves a substring from this instance. The substring starts at a specified character position and has a specified length. IndexOf(string value) Reports the zero-based index of the first occurrence of the specified string in this instance. IndexOf(string value, int startIndex) Reports the zero-based index of the first occurrence of the specified string in this instance. The search starts at a specified character position. LastIndexOf(string value) Reports the zero-based index of the last occurrence of the specified string in this instance. Often used together with Substring . LastIndexOf(string value, int startIndex) Reports the zero-based index position of the last occurrence of a specified string within this instance. The search starts at a specified character position and proceeds backward toward the beginning of the string.

This is a minor subset of the available functions. It may be well worth your time to read up on the string class since it is so fundamental in PowerShell. Docs are found here.

As an example, this can be useful when we have very large input data of comma-separated input with 15 columns and we are only interested in the third column from the end. If we were to use the -split ',' operator, we would create 15 new strings and an array for each line. On the other hand, using LastIndexOf on the input string a few times and then SubString to get the value of interest is faster and results in just one new string.

function parseThirdFromEnd([string]$line){ $i = $line.LastIndexOf(",") # get the last separator $i = $line.LastIndexOf(",", $i - 1) # get the second to last separator, also the end of the column we are interested in $j = $line.LastIndexOf(",", $i - 1) # get the separator before the column we want $j++ # more forward past the separator $line.SubString($j,$i-$j) # get the text of the column we are looking for }

In this sample, I ignore that the IndexOf and LastIndexOf returns -1 if they cannot find the text to search for. From experience, I also know that it is easy to mess up the index arithmetics. So while using these methods can improve performance, it is also more error prone and a lot more to type. I would only resort to this when I know the input data is very large and performance is an issue. So this is not a recommendation, or a starting point, but something to resort to.

On rare occasions, I write the whole parser in C#. An example of this is in a module wrapping the Perforce version control system, where the command line tool can output python dictionaries. It is a binary format, and the use case was complicated enough that I was more comfortable with a compiler checked implementation language.

Regular Expressions

Almost all of the parsing options in PowerShell make use of regular expressions, so I will start with a short intro of some regular expression concepts that are used later in these posts.

Regular expressions are very useful to know when writing simple parsers since they allow us to express patterns of interest and to capture text that matches those patterns.

It is a very rich language, but you can get quite a long way by learning a few key parts. I’ve found regular-expressions.info to be a good online resource for more information. It is not written directly for the .net regex implementation, but most of the information is valid across the different implementations.

Regex Description * Zero or more of the preceding character. a* matches the empty string, a , aa , etc, but not b . + One or more of the preceding character. a+ matches a , aa , etc, but not the empty string or b . . Matches any character [ax1] Any of a , x , 1 a-d matches any of a , b , c , d \w The \w meta character is used to find a word character. A word character is a character from a-z, A-Z, 0-9, including the _ (underscore) character. It also matches variants of the characters such as ??? and ??? . \W The inversion of \w . Matches any non-word character \s The \s meta character is used to find white space \S The inversion of \s . Matches any non-whitespace character \d Matches digits \D The inversion of \d . Matches non-digits \b Matches a word boundary, that is, the position between a word and a space. \B The inversion of \b . . er\B matches the er in verb but not the er in never . ^ The beginning of a line $ The end of a line (<expr>) Capture groups

Combining these, we can create a pattern like below to match a text like:

Text Pattern " 42,Answer" ^\s+\d+,.+

The above pattern can be written like this using the x (ignore pattern whitespace) modifier.

Starting the regex with (?x) ignores whitespace in the pattern (it has to be specified explicitly, with \s ) and also enables the comment character # .

(?x) # this regex ignores whitespace in the pattern. Makes it possible do document a regex with comments. ^ # the start of the line \s+ # one or more whitespace character \d+ # one or more digits , # a comma .+ # one or more characters of any kind

By using capture groups, we make it possible to refer back to specific parts of a matched expression.