ParseKit Framework

Tokenization

The API for tokenization is provided by the PKTokenizer class. Cocoa developers will be familiar with the NSScanner class provided by the Foundation Framework which provides a similar service. However, the PKTokenizer class is simpler and more powerful for many use cases.

Example usage:

NSString *s = @"\"It's 123 blast-off!\", she said, // watch out!

" @"and <= 3.5 'ticks' later /* wince */, it's blast-off!"; PKTokenizer *t = [PKTokenizer tokenizerWithString:s]; PKToken *eof = [PKToken EOFToken]; PKToken *tok = nil; while ((tok = [t nextToken]) != eof) { NSLog(@" (%@)", tok); }

outputs:

("It's 123 blast-off!") (,) (she) (said) (,) (and) (<=) (3.5) ('ticks') (later) (,) (it's) (blast-off) (!)

Each token produced is an object of class PKToken . PKToken s have a tokenType ( Word , Symbol , Number , QuotedString , etc.) and both a stringValue and a floatValue .

More information about a token can be easily discovered using the -debugDescription method instead of the default -description . Replace the line containing NSLog above with this line:

NSLog(@"%@", [tok debugDescription]);

and each token’s type will be printed as well:

<Quoted String «"It's 123 blast-off!"»> <Symbol «,»> <Word «she»> <Word «said»> <Symbol «,»> <Word «and»> <Symbol «<=»> <Number «3.5»> <Quoted String «'ticks'»> <Word «later»> <Symbol «,»> <Word «it's»> <Word «blast-off»> <Symbol «!»>

As you can see from the output, PKTokenzier is configured by default to properly group characters into tokens including:

single- and double-quoted string tokens

common multiple character symbols ( <= )

) apostrophes, dashes and other symbol chars that should not signal the start of a new Symbol token, but rather be included in the current Word or Number token ( it's , blast-off , 3.5 )

, , ) silently ignoring C- and C++-style comments

silently ignoring whitespace

The PKTokenizer class is very flexible, and all of those features are configurable. PKTokenizer may be configured to:

recognize more (or fewer) multi-char symbols. ex: [t.symbolState add:@"!="]; allows != to be recognized as a single Symbol token rather than two adjacent Symbol tokens

add new internal symbol chars to be included in the current Word token OR recognize internal symbols like apostrophe and dash to actually signal a new Symbol token rather than being part of the current Word token. ex: [t.wordState setWordChars:YES from:'_' to:'_']; allows Word tokens to contain internal underscores [t.wordState setWordChars:NO from:'-' to:'-']; disallows Word tokens from containing internal dashes.

token OR recognize internal symbols like apostrophe and dash to actually signal a new token rather than being part of the current Word token. ex: change which chars signal the start of a token of any given type. e.g.: [t setTokenizerState:t.wordState from:'_' to:'_']; allows Word tokens to start with underscore [t setTokenizerState:t.quoteState from:'*' to:'*']; allows Quoted String tokens to start with an asterisk, effectively making * a new quote symbol (like " or ' )

turn off recognition of single-line “slash-slash” ( // ) comments. ex: [t setTokenizerState:t.symbolState from:'/' to:'/']; slash chars now produce individual Symbol tokens rather than causing the tokenizer to strip text until the next newline char or begin striping for a multiline comment if appropriate ( /* )

) comments. ex: turn on recognition of “hash” ( # ) single-line comments. ex: [t setTokenizerState:t.commentState from:'#' to:'#']; [t.commentState addSingleLineStartSymbol:@"#"];

) single-line comments. ex: turn on recognition of “XML/HTML” ( ) multi-line comments. ex: [t setTokenizerState:t.commentState from:'<' to:'<']; [t.commentState addMultiLineStartSymbol:@" "];

) multi-line comments. ex: report (rather than silently consume) Comment tokens. ex: t.commentState.reportsCommentTokens = YES; // default is NO

report (rather than silently consume) Whitespace tokens. ex: t.whitespaceState.reportsWhitespaceTokens = YES; // default is NO

turn on recognition of any characters (say, digits) as whitespace to be silently ignored. ex: [t setTokenizerState:t.whitespaceState from:'0' to:'9'];

Parsing

ParseKit also includes a collection of token parser subclasses (of the abstract PKParser class) including collection parsers such as PKAlternation , PKSequence , and PKRepetition as well as terminal parsers including PKWord , PKNum , PKSymbol , PKQuotedString , etc. Also included are parser subclasses which work in individual chars such as PKChar , PKDigit , and PKSpecificChar . These char parsers are useful for things like RegEx parsing. Generally speaking though, the token parsers will be more useful and interesting.

The parser classes represent a Composite pattern. Programs can build a composite parser, in Objective-C (rather than a separate language like with lex&yacc), from a collection of terminal parsers composed into alternations, sequences, and repetitions to represent an infinite number of languages.

Parsers built from ParseKit are non-deterministic, recursive descent parsers, which basically means they trade some performance for ease of user programming and simplicity of implementation.

Here is an example of how one might build a parser for a simple voice-search command language (note: ParseKit does not include any kind of speech recognition technology). The language consists of:

search google for?

... [self parseString:@"search google 'iphone'"]; ... - (void)parseString:(NSString *)s { PKSequence *parser = [PKSequence sequence]; [parser add:[[PKLiteral literalWithString:@"search"] discard]]; [parser add:[[PKLiteral literalWithString:@"google"] discard]]; PKAlternation *optionalFor = [PKAlternation alternation]; [optionalFor add:[PKEmpty empty]]; [optionalFor add:[PKLiteral literalWithString:@"for"]]; [parser add:[optionalFor discard]]; PKParser *searchTerm = [PKQuotedString quotedString]; [searchTerm setAssembler:self selector:@selector(workOnSearchTermAssembly:)]; [parser add:searchTerm]; PKAssembly *result = [parser bestMatchFor:[PKTokenAssembly assmeblyWithString:s]]; NSLog(@" %@", result); // output: // ['iphone']search/google/'iphone'^ } ... - (void)workOnSearchTermAssembly:(PKAssembly *)a { PKToken *t = [a pop]; // a QuotedString token with a stringValue of 'iphone' [self doGoogleSearchForTerm:t.stringValue]; }