

When I started to work on Baseball Odds I knew I was going to have to worry about performance - the data set I have for the win probability has right around 15000 records. So I thought it would be neat to compare different file formats and how long it took to read their data in. Each record had the inning number (with top or bottom), how many outs, what runners are on base, the score difference, and the number of situations and the number of times the current team won. Here's a brief description of each format and some sample code:





Text:

This was actually the format I already had the data in, as it matched Phil Birnbaum's data file format. A sample line looks like this:

"H",1,0,1,0,81186,47975 and there are 15000 lines in the file. The code to parse this looks something like this:

const bool USE_AWAIT = false ; const bool CONFIGURE_AWAIT = false ; var resource = System.Windows. Application .GetResourceStream( new Uri ( @"Data\winProbs.txt" , UriKind .Relative)); using ( StreamReader sr = new StreamReader (resource.Stream)) { string line; if (USE_AWAIT) { if (CONFIGURE_AWAIT) { line = await sr.ReadLineAsync().ConfigureAwait( false ); } else { line = await sr.ReadLineAsync(); } } else { line = sr.ReadLine(); } while (line != null ) { var parts = line.Split(','); bool isHome = (parts[0] == "\"H\"" ); _fullData.Add( new Tuple < bool , byte , byte , byte , sbyte >( isHome, byte .Parse(parts[1]), byte .Parse(parts[2]), byte .Parse(parts[3]), sByte .Parse(parts[4])), new Tuple < UInt32 , UInt32 >( UInt32 .Parse(parts[5]), UInt32 .Parse(parts[6]))); if (USE_AWAIT) { if (CONFIGURE_AWAIT) { line = await sr.ReadLineAsync().ConfigureAwait( false ); } else { line = await sr.ReadLineAsync(); } } else { line = sr.ReadLine(); } } }



(what are USE_AWAIT and CONFIGURE_AWAIT all about? See the results below...)





JSON:



To avoid having to write my own parsing code, I decided to write the data in a JSON format and use Json.NET to parse it. One line of the data file looks like this:

{isHome:1,inning:1,outs:0,baserunners:1,runDiff:0,numSituations:81186,numWins:47975}

This is admittedly a bit verbose, and it makes the file over a megabyte. The parsing code is simple, though:

var resource = System.Windows. Application .GetResourceStream( new Uri ( @"Data\winProbs.json" , UriKind .Relative)); using ( StreamReader sr = new StreamReader (resource.Stream)) { string allDataString = await sr.ReadToEndAsync(); JArray allDataArray = JArray .Parse(allDataString); for ( int i = 0; I < allDataArray.Count; ++i) { JObject dataObj = ( JObject )(allDataArray[i]); _fullData.Add( new Tuple < bool , byte , byte , byte , sbyte >( ( int )dataObj[ "isHome" ] == 1, ( byte )dataObj[ "inning" ], ( byte )dataObj[ "outs" ], ( byte )dataObj[ "baserunners" ], ( sbyte )dataObj[ "runDiff" ]), new Tuple < UInt32 , UInt32 >(( UInt32 )dataObj[ "numSituations" ], ( UInt32 )dataObj[ "numWins" ])); } }



After I posted this, Martin Suchan pointed out that using JsonConvert might be faster, and even wrote some code to try it out.



Binary:



To try to get the file to be as small as possible (which I suspected correlated with parsing time), I converted the file to a custom binary format. Here's my textual description of the format:

UInt32 = total num records UInt32 = num of records that have UInt32 for num situations (these come first) each record is: UInt8 = high bit = visitor=0, home=1 rest is inning (1-26) UInt8 = high 2 bits = num outs (0-2) rest is baserunners (1-8) Int8 = score diff (-26 to 27) UInt32/UInt16 = num situations UInt16 = num of wins

To format the file this way, I had to write a Windows 8 app that read in the text file and wrote out the binary version using a BinaryWriter with the Write(Byte), etc. methods. Here's the parsing code:

var resource = System.Windows. Application .GetResourceStream( new Uri ([ @"Data\winProbs.bin" , UriKind .Relative)); using ( var br = new System.IO. BinaryReader (resource.Stream)) { UInt32 totalRecords = br.ReadUInt32(); UInt32 recordsWithUInt32 = br.ReadUInt32(); for ( UInt32 i = 0; i < totalRecords; ++i) { byte inning = br.ReadByte(); byte outsRunners = br.ReadByte(); sbyte scoreDiff = br.ReadSByte(); UInt32 numSituations = (i < recordsWithUInt32) ? br.ReadUInt32() : br.ReadUInt16(); UInt16 numWins = br.ReadUInt16(); _compressedData.Add(new Tuple < byte , byte , sbyte >(inning, outsRunners, scoreDiff), new Tuple < uint , ushort >(numSituations, numWins)); } }





Results:



Without further ado, here are the file sizes and how long the files took to read and parse (running on my Lumia 1020):

Type File size Time to parse Text ( USE_AWAIT =true)

( CONFIGURE_AWAIT =false) 278K 4.8 secs Text ( USE_AWAIT =true)

( CONFIGURE_AWAIT =true) 278K 0.4 secs Text ( USE_AWAIT =false) 278K 0.4 secs JSON (parsing one at a time) 1200KB 3.2 secs JSON (using JsonConvert ) 1200KB 1.3 secs Binary 103KB 0.15 secs



A few observations:

Apparently there is some overhead involved in await ing 15000 calls, as this increased the time to parse the text file from 0.4 secs to 4.8 secs! Not hugely surprising, but something to keep in mind - if a call is going to be very short and you're going to be doing it many times, try not await ing it. If you want to use await , you can call ConfigureAwait(false) to not force the continuation back into its original context - this seems to almost entirely eliminate the overhead. For more information, I'd recommend the article Best Practices in Asynchronous Programming.

ing 15000 calls, as this increased the time to parse the text file from 0.4 secs to 4.8 secs! Not hugely surprising, but something to keep in mind - if a call is going to be very short and you're going to be doing it many times, try not ing it. If you want to use , you can call to not force the continuation back into its original context - this seems to almost entirely eliminate the overhead. For more information, I'd recommend the article Best Practices in Asynchronous Programming.

JSON is very convenient, but its parsing time was by far the longest. I'll keep this in mind for my other apps - it might be worth investing in a different format even if parsing it is more of a hassle. However, using JsonConvert cut down on the time significantly. I had always avoided doing that because of the pain of declaring a class, but I'll definitely do it in the future!

cut down on the time significantly. I had always avoided doing that because of the pain of declaring a class, but I'll definitely do it in the future!

Parsing a simple text file can be quite fast (keeping the above await caveat in mind)

caveat in mind)

Binary was the clear winner here in file size and parsing time. However, the code to write the file (not shown here) took much more work than either of the other two formats, including my time to analyze the range of values of each entry and figuring out how tightly I could pack it. And if I need to update this data and suddenly the range of an entry expands so it doesn't fit in a byte any more (for example), it's a huge hassle to rewrite both the writing and the parsing code. Binary formats are not a free lunch!



So since I had already done all the work I went with the binary format, and Baseball Odds starts up lickety-split!



--



See all my Windows Phone development posts.



I'm planning on writing more posts about Windows Phone development - what would you like to hear about? Reply here, on twitter at @gregstoll, or by email at greg@gregstoll.com.