Introduction

A few days ago in a discussion on Facebook’s PowerShell Group, I realized the many people don’t understand how to Tokenize or Use Abstract Syntax Tree to understand your PowerShell scripts better.

I think the reason behind this unawareness is mostly because it doesn’t affect your ability to write better scripts, but you’ve to agree it helps to understand how Powershell engine understands your script on the lexical level, hence this quick blog post.

What is Tokenization/Lexical Analysis?

Lexical analysis is the process of converting a sequence of characters or strings into a sequence of tokens – that is units with an assigned and thus identified meaning, Powershell engine performs all these operations for you under the hood and it looks like in the following picture when PowerShell script is tokenized

First, your script is tokenized and converted into meaningful units (Lexical units) and then converted into a Tree after parsing each token, which is called a Parse tree. That looks something like in below image and It contains all the Tokens of the script.

Parse tree helps the execution engine understands how to execute the script, I mean in which order or sequence the execution engine should evaluate the expressions etc in the script.

What is Abstract Syntax Tree?

The abstract tree is a Parse tree without presenting every detail appearing in the real syntax, that means things like ‘(‘ (Parenthesis) and ‘[{}]’ (braces) are omitted. Sometime Parser creates AST directly using the grammar of the language or it Converts token into Parse tree then converts it to AST

Following is an image of Abstract syntax tree.

In powerhsell it looks like

The overall process of converting source string to Parse/Abstract Syntax tree to execution and output looks somewhat looks like the following image

Tokenization and making Abtract Syntax Tree in Powershell

Use cases

Forensics: To understand how Powershell engine perceives each lexical unit, like for example to see how Foreach() statement and Foreach-Object cmdlet differs when tokenized even when former is used as alias ‘Foreach’. Like in below example when parser tokenized the script you can see the difference in type in spite we used same ‘Foreach’ in both lines Finding Comments, Functions or variables in a script: Following is a link to my one of my old blog posts where I used Tokenization to extract comments from PowerShell script .

Hope You’ll find this article useful, thanks for reading!

Follow @singhprateik