[Haskell-cafe] Parallel parsing & multicore

Hi, I've put together a parser that works like this: - the full input is read into a strict ByteString; - such string is split into a number of lines; - each line is fed (independently) to a parser based on Parsec (version 3). Running the serial version of the parser (compiled with -O2 optimizations) on a rather large file, I get the following: $ ./Parser +RTS -s 2,175,464,100 bytes allocated in the heap 49,293,632 bytes copied during GC 5,297,560 bytes maximum residency (6 sample(s)) 1,406,800 bytes maximum slop 16 MB total memory in use (0 MB lost due to fragmentation) Generation 0: 4140 collections, 0 parallel, 0.39s, 0.42s elapsed Generation 1: 6 collections, 0 parallel, 0.05s, 0.07s elapsed INIT time 0.00s ( 0.00s elapsed) MUT time 1.93s ( 1.99s elapsed) GC time 0.44s ( 0.48s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 2.37s ( 2.47s elapsed) %GC time 18.7% (19.5% elapsed) Alloc rate 1,127,336,508 bytes per MUT second Productivity 81.2% of total user, 78.1% of total elapsed The profile report: COST CENTRE MODULE %time %alloc >>=_a6OH Text.Parsec.Prim 20.6 13.0 tokenPrimEx Text.Parsec.Prim 16.5 20.8 fmap_a6Qj Text.Parsec.Prim 15.5 3.6 parserBind Text.Parsec.Prim 12.4 9.3 mplus_a6KS Text.Parsec.Prim 10.3 10.6 notFollowedBy Text.Parsec.Combinator 3.1 4.4 [...] shows that the biggest part of the execution time is spent in the parser (not in the read & line split operations, as expected). Given that each line is parsed independently, I used the following code: map parse lines `using` parListChunk 250 rnf to spark a separate computation for each group of 250 lines. I made the datatype used by the parser an instance of NFData, so that the "rnf" strategy should force a complete evaluation. The threaded version running on 2 cores is moderately faster than the serial one: $ ./Parser +RTS -s -N2 2,377,165,256 bytes allocated in the heap 36,320,944 bytes copied during GC 6,020,720 bytes maximum residency (6 sample(s)) 6,933,928 bytes maximum slop 21 MB total memory in use (0 MB lost due to fragmentation) Generation 0: 2410 collections, 0 parallel, 0.33s, 0.34s elapsed Generation 1: 6 collections, 4 parallel, 0.06s, 0.05s elapsed Parallel GC work balance: 1.83 (2314641 / 1265968, ideal 2) Task 0 (worker) : MUT time: 2.43s ( 1.19s elapsed) GC time: 0.02s ( 0.02s elapsed) Task 1 (worker) : MUT time: 2.15s ( 1.19s elapsed) GC time: 0.29s ( 0.30s elapsed) Task 2 (worker) : MUT time: 2.37s ( 1.19s elapsed) GC time: 0.07s ( 0.08s elapsed) Task 3 (worker) : MUT time: 2.45s ( 1.19s elapsed) GC time: 0.00s ( 0.00s elapsed) INIT time 0.00s ( 0.00s elapsed) MUT time 2.06s ( 1.19s elapsed) GC time 0.39s ( 0.39s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 2.45s ( 1.58s elapsed) %GC time 15.7% (24.9% elapsed) Alloc rate 1,151,990,234 bytes per MUT second Productivity 84.2% of total user, 130.2% of total elapsed The speedup is smaller than what I was expecting given that each unit of work (250 input lines) is completely independent from the others. Changing the size of each work unit did not help; garbage collection times are small enough that increasing the minimum heap size did not produce any speedup either. Is there anything else I can do to understand why the parallel map does not provide a significant speedup? Thanks, AB