Parsing OFX with Erlang and leex

Companion code for this post available on Github

As part of a push to make keeping track of my finances easier without surrendering banking credentials to popular money-management tools, I have been working on a project that allows me to track my incomes and expenditures in a database, with a simple companion app for adding transactions and running visualizations of the data. But one of the major hurdles to keeping track of my spending is the fact that I had to manually enter each and every transaction, not just categorize it. To solve this problem, we can tap into the data used by apps like Quicken to manage your banking information - OFX. Here, we’ll go over what OFX is, how to get data from your bank in an OFX format and how to lex and parse that data to make it useful.

If you want to skip ahead, the full library is on Github.

History of OFX

OFX is a product of collaboration between Microsoft and Intuit (of Quicken) in the late 90s. The initial versions were built on top of SGML, which is a precursor of XML. As used in OFX, there are no closing tags for leaf values in SGML - a fact that makes modern XML parsers unsuitable for translating it into a document that we can then work with in code. Banks that offer an ‘Online banking with Quicken’ feature will usually do so through an endpoint that speaks OFX, and access to this API is usually accessible for $10 per month or so, depending on the bank.

Fetching OFX from an institution

As a prerequisite for parsing OFX data, we need to acquire some. OFX data transfer takes place over a single API endpoint, by sending an OFX document with one or more stanzas in it and receiving another document with a response For an example OFX request, here’s a request to fetch account information from my Chase account (certain information redacted, of course):

<OFX> <SIGNONMSGSRQV1> <SONRQ> <DTCLIENT> 20170326192550 <USERID> my_bank_username <USERPASS> my_bank_password <LANGUAGE> ENG <FI> <ORG> B1 <FID> 10898 </FI> <APPID> QWIN <APPVER> 2200 <CLIENTUID> 9a7b311e-b6d0-4836-ac07-22d144cfc836 </SONRQ> </SIGNONMSGSRQV1> <SIGNUPMSGSRQV1> <ACCTINFOTRNRQ> <TRNUID> 3a23ee72-dcea-4c42-a509-05f6700545a9 <CLTCOOKIE> 1 <ACCTINFORQ> <DTACCTUP> 19691231 </ACCTINFORQ> </ACCTINFOTRNRQ> </SIGNUPMSGSRQV1> </OFX>

As you can see, there are two stanzas in this request - the first, the SIGNONMSGSRQV1 , is common to all requests you will make to the server. It identifies who you are ( USERID , USERPASS ), which bank you want to talk to; Chase in this case, represented by it’s ORG and FID identifiers. Values for your bank can be found online, GNUCash has a good list here. This section also identifies the application we are using to talk to the bank. In this case I am ‘Quicken’ version 2200. Some banks will refuse to talk to you unless you tell them that yes, you are definitely Quicken.

The second stanza is our actual request for information - we are making an account information transaction request ( ACCTINFOTRNRQ ), with a unique transaction UUID, with an account information section stating that we last checked for account info the day before the UNIX epoch, and so should be assumed to know nothing.

To send this request to our bank, we will attach our generic OFX header and then send our request off, being sure to state the content type and Connection: close , which seems to be necessary for some banks.

ofx_request ( Url , Body ) - > H eaders = [ { "user-agent" , "InetClntApp/3.0" }, { "connection" , "close" }, { "accept" , "*/*, application/x-ofx" } ], ContentType = "application/x-ofx" , {ok, {_, _, Resp }} = httpc : request ( post, { Url , Headers , ContentType , lists : flatten ( Body )}, [], [{body_format, binary}] ), {ok, Resp }.

Lexing OFX

Hopefully, the bank will respond to our information request with a nice blob of SGML (indented for readability):

OFXHEADER:100 DATA:OFXSGML VERSION:103 SECURITY:NONE ENCODING:USASCII CHARSET:1252 COMPRESSION:NONE OLDFILEUID:NONE NEWFILEUID:6841bcd0-bc42-11e6-9ef8-f30ad20f18fe <OFX> <SIGNONMSGSRSV1> <SONRS> <STATUS> <CODE> 0 <SEVERITY> INFO <MESSAGE> SUCCESS </STATUS> <DTSERVER> 20170319173139.086[-4:EDT] <LANGUAGE> ENG <FI> <ORG> B1 <FID> 10898 </FI> </SONRS> </SIGNONMSGSRSV1> <SIGNUPMSGSRSV1> <ACCTINFOTRNRS> <TRNUID> 20161207002420.000 <STATUS> <CODE> 0 <SEVERITY> INFO </STATUS> <CLTCOOKIE> 1 <ACCTINFORS> <DTACCTUP> 20170319173139.379[-4:EDT] <ACCTINFO> <DESC> CREDIT CARD <CCACCTINFO> <CCACCTFROM> <ACCTID> my_credit_card_number </CCACCTFROM> <SUPTXDL> Y <XFERSRC> N <XFERDEST> N <SVCSTATUS> ACTIVE </CCACCTINFO> </ACCTINFO> </ACCTINFORS> </ACCTINFOTRNRS> </SIGNUPMSGSRSV1> </OFX>

Excellent. Now let’s define a representation for this data that we can work with more easily in Erlang. There are two distinct node types - they may either have a value, and no close tag, or some children and a close tag. Let’s represent them as two different records, as such:

- record ( ofx_node , { name :: nonempty_string (), children :: [ #ofx_leaf {}] }). - record ( ofx_leaf , { name :: nonempty_string (), value :: nonempty_string () }).

Enter Leex

Leex is a lexer, a tool for taking our blob of OFX text and turning it into a list of meaningful tokens. In order to do so, we need to specify a couple of rules first. Leex input files have three sections: Definitions , Rules and Erlang code .

The definitions section is a context-free grammar for defining patterns that can then be used for building up rules. For example, U can be defined as [A-Z], or a shorthand for all uppercase letters. L can then be all lowercase ([a-z]) and the two can then be combined to refer to all letters as ALPHA = ({U}|{L}) .

Once we have a set of definitions for character groups, we can then write the rules section. This is where the requisites for tokenisation are defined - for example, we want to emit a token every time we see an opening tag, and want to include in that token the name of the tag. On the left hand side of the tag, we write the match expression - in this case, <({TAGCHAR})+> , for one or more characters in the set of allowable tag names bounded by angle brackets. On the right hand side, we then specify what the lexer should do when it encounters something that matches this pattern. In this case we want to emit a token, so we’ll write {token, {opentag, lists:sublist(TokenChars, 2, TokenLen-2)}} . This means it will emit a token that is a 2-tuple of the atom opentag and a substring of the matched string that removes the enclosing ‘<>‘. So, for example, if the lexer encountered the tag <OFX> it would then emit the token {opentag, "OFX"} .

The final section allows for the definition of generic Erlang methods that can then be used in the right hand side of rules. For example, we could take our substringing code from the match rule we just defined and place it in a convenience method in the code section.

Once we are finished writing our rules, we end up with a leex file that looks like this:

Definitions . U = [ A - Z] L = [ a - z ] D = [ 0 - 9] SAFESYM = [_\-.:+] SYM = [_\-.:/*+\[\]'] WHITESPACE = [\ s \ t \ n \ r] ALPHA = ({ U }|{ L }) ALNUM = ({ ALPHA }|{ D }) ALSYM = ({ ALNUM }|{ SYM }|{ WHITESPACE }) TAGCHAR = ({ ALNUM }|{ SAFESYM }) Rules . < ({ TAGCHAR }) +> : {token, {opentag, lists : sublist ( TokenChars , 2 , TokenLen - 2)}}. </({TAGCHAR})+> : { token , {closetag, lists : sublist ( TokenChars , 3 , TokenLen - 3)}}. {WHITESPACE}+ : skip_token . { ALSYM } + : {token, {string, string : strip ( TokenChars )}}. Erlang code.

We emit three kinds of tokens - opentag , when a tag is opened, closetag , when a tag is closed and string when we encounter a string literal (tag value). With just these three types, we can then build a parser that can turn this list of tokens into a document tree.

Parsing the tokens

As stated when we built our records, we only have two cases we need to deal with here - leaf nodes, which will always be [{opentag, Tag}, [{string, Value}] and parent nodes, which will be [{opentag, Tag}, ...tag_children..., [{closetag, Tag}] . This means that leaf nodes can be parsed easily by matching on the head of the tag list, and our more complex case of a parent node can be handled by a secondary method that accumulates all nodes until it encounters a specified terminal node. The implementation is as follows:

% Parses a list of tags into an OFX data tree. % Will error out in there are tokens that cannot be parsed as part of the tree. parse ( Tags ) - > {T ree , Unparsed } = parse_node ( Tags ), [] = Unparsed , Tree . % Parse a single OFX node from tokens. % Returns the node, and any unused tokens. parse_node ([{opentag, Tag }|[{string, Value }| Tags ]]) - > {# ofx_leaf {name = Tag , value = Value }, Tags }; parse_node ([{opentag, Tag }| Tags ]) - > {C hildren , Tags2 } = parse_node_list ( Tag , Tags ), { #ofx_node {name = Tag ,children = Children }, Tags2 }. % Convenience method for parse_node_list/3. parse_node_list ( EndTag , Tags ) - > parse_node_list ( EndTag , Tags , []). % Parses a list of child nodes. Stops parsing when a {closetag, } tuple is found % with a name matching the EndTag. parse_node_list (_ EndTag , [], Nodes ) - > N odes ; parse_node_list ( EndTag , [ Tag | Tags ], Nodes ) - > {N ode , Tags2 } = parse_node ([ Tag | Tags ]), case hd ( Tags2 ) of {closetag, EndTag } - > {[N ode | Nodes ], tl ( Tags2 )}; _ - > parse_node_list ( EndTag , Tags2 , [ Node | Nodes ]) end .

This isn’t robust against malformed SGML, and will need as many stack frames as the tree is deep, but it gets the job done. If we now take the our list of lexed tags from the previous step, we can run them through the parser and we should get a workable tree of records like so:

1 > {ok, Client } = ofx : new_client ( "username" , "password" , "B1" , "10898" , "https://ofx.chase.com" ). {ok, < 0 . 415 . 0 > } 2 > ofx_client : list_accounts ( Client ). {ofx_node, "OFX" , [{ofx_node, "SIGNUPMSGSRSV1" , [{ofx_node, "ACCTINFOTRNRS" , [{ofx_node, "ACCTINFORS" , [{ofx_node, "ACCTINFO" , [{ofx_node, "CCACCTINFO" , [{ofx_leaf, "SVCSTATUS" , "ACTIVE" }, {ofx_leaf, "XFERDEST" , "N" }, {ofx_leaf, "XFERSRC" ,[...]}, {ofx_leaf,[...],...}, {ofx_node,...}]}, {ofx_leaf, "DESC" , "CREDIT CARD" }]}, {ofx_leaf, "DTACCTUP" , "20170328125651.415[-4:EDT]" }]}, {ofx_leaf, "CLTCOOKIE" , "1" }, {ofx_node, "STATUS" , [{ofx_leaf, "SEVERITY" , "INFO" },{ofx_leaf, "CODE" , "0" }]}, {ofx_leaf, "TRNUID" , "12a14b40-2154-43ab-b986-67200cab0ec3" }]}]}, [...]

Using the methods exposed in the ofx_tree module, we can then easily parse out the information we might care about: