Parse shell one-liners with pyparsing



For one of my projects I needed some one-liners parser to AST. I’ve tried PLY, pyPEG and a few more. And stopped on pyparsing. It’s actively maintained, works without magic and easy to use.

Ideally I wanted to parse something like:

LANG = en_US.utf-8 git diff | wc -l >> diffs

To something like:

( = LANG en_US.utf-8 ) ( >> ( | ( git diff ) ( wc -l )) ( diffs ))

So let’s start with simple shell command, it’s just space-separated tokens:

import pyparsing as pp token = pp . Word ( pp . alphanums + '_-.' ) command = pp . OneOrMore ( token ) command . parseString ( 'git branch --help' ) >>> [ 'git' , 'branch' , '--help' ]

It’s simple, another simple part is parsing environment variables. One environment variable is token=token , and list of them separated by spaces:

env = pp . Group ( token + '=' + token ) env . parseString ( 'A=B' ) >>> [[ 'A' , '=' , 'B' ]] env_list = pp . OneOrMore ( env ) env_list . parseString ( 'VAR=test X=1' ) >>> [[ 'VAR' , '=' , 'test' ], [ 'X' , '=' , '1' ]]

And now we can easily merge command and environment variables, mind that environment variables are optional:

command_with_env = pp . Optional ( pp . Group ( env_list )) + pp . Group ( command ) command_with_env . parseString ( 'LOCALE=en_US.utf-8 git diff' ) >>> [[[ 'LOCALE' , '=' , 'en_US.utf-8' ]], [ 'git' , 'diff' ]]

Now we need to add support of pipes, redirects and logical operators. Here we don’t need to know what they’re doing, so we’ll treat them just like separators between commands:

separators = [ '1>>' , '2>>' , '>>' , '1>' , '2>' , '>' , '<' , '||' , '|' , '&&' , '&' , ';' ] separator = pp . oneOf ( separators ) command_with_separator = pp . OneOrMore ( pp . Group ( command ) + pp . Optional ( separator )) command_with_separator . parseString ( 'git diff | wc -l >> out.txt' ) >>> [[ 'git' , 'diff' ], '|' , [ 'wc' , '-l' ], '>>' , [ 'out.txt' ]]

And now we can merge environment variables, commands and separators:

one_liner = pp . Optional ( pp . Group ( env_list )) + pp . Group ( command_with_separator ) one_liner . parseString ( 'LANG=C DEBUG=true git branch | wc -l >> out.txt' ) >>> [[[ 'LANG' , '=' , 'C' ], [ 'DEBUG' , '=' , 'true' ]], [[ 'git' , 'branch' ], '|' , [ 'wc' , '-l' ], '>>' , [ 'out.txt' ]]]

Result is hard to process, so we need to structure it:

one_liner = pp . Optional ( env_list ). setResultsName ( 'env' ) + \ pp . Group ( command_with_separator ). setResultsName ( 'command' ) result = one_liner . parseString ( 'LANG=C DEBUG=true git branch | wc -l >> out.txt' ) print ( 'env:' , result . env , '

command:' , result . command ) >>> env : [[ 'LANG' , '=' , 'C' ], [ 'DEBUG' , '=' , 'true' ]] >>> command : [[ 'git' , 'branch' ], '|' , [ 'wc' , '-l' ], '>>' , [ 'out.txt' ]]

Although we didn’t get AST, but just a bunch of grouped tokens. So now we need to transform it to proper AST:

def prepare_command ( command ): """We don't need to work with pyparsing internal data structures, so we just convert them to list. """ for part in command : if isinstance ( part , str ): yield part else : yield list ( part ) def separator_position ( command ): """Find last separator position.""" for n , part in enumerate ( command [:: - 1 ]): if part in separators : return len ( command ) - n - 1 def command_to_ast ( command ): """Recursively transform command to AST.""" n = separator_position ( command ) if n is None : return tuple ( command [ 0 ]) else : return ( command [ n ], command_to_ast ( command [: n ]), command_to_ast ( command [ n + 1 :])) def to_ast ( parsed ): if parsed . env : for env in parsed . env : yield ( '=' , env [ 0 ], env [ 2 ]) command = list ( prepare_command ( parsed . command )) yield command_to_ast ( command ) list ( to_ast ( result )) >>> [( '=' , 'LANG' , 'C' ), >>> ( '=' , 'DEBUG' , 'true' ), >>> ( '>>' , ( '|' , ( 'git' , 'branch' ), >>> ( 'wc' , '-l' )), >>> ( 'out.txt' ,))]

It’s working. The last part, glue that make it easier to use:

def parse ( command ): result = one_liner . parseString ( command ) ast = to_ast ( result ) return list ( ast ) parse ( 'LANG=en_US.utf-8 git diff | wc -l >> diffs' ) >>> [( '=' , 'LANG' , 'en_US.utf-8' ), ( '>>' , ( '|' , ( 'git' , 'diff' ), ( 'wc' , '-l' )), ( 'diffs' ,))]

Although it can’t parse all one-liners, it doesn’t support nested commands like:

echo $ ( git branch ) echo `git branch`

But it’s enough for my task and support of not implemented features can be added easily.

Gist with source code.