in /tech/haskell

In my last post about HXT I had gotten stuck at a performance problem in HXT that rendered it unusable. Since then, I’ve exchanged a number of emails with Uwe Schmidt, the maintainer. He found where the exponential blowup was happening in the regex engine and fixed that problem. With that fix, my spider ran for a bit longer, but eventually failed due to hitting the per-process limit on open file descriptors. I tried adding strictA in a couple places in the code, but it did not resolve the resource leak. Uwe claims this is a bug in Network.HTTP , and suggested the a_use_curl option to spawn an external curl program to do the fetching. While it sucks to be spawning hundreds of processes for this task, it did fix the resource leak.

With those problems out of the way, I was able to focus on some issues in my own program, like trying to validate JPG images as XML, or to fetch mailto: links. I’m now reasonably happy with the program, which you can see in the HXT/Practical section of the Haskell wiki.

The major area where this could still be improved is parallelization. Verifying about 700 pages and links on my site takes 45 minutes, during which the program is only doing something for about 8, while the rest is waiting on the network. It would definitely be a good exercise to learn more about the concurrency capabilities of Haskell, although the hidden system state in HXT makes me nervous about whether it’ll work at all. I’d probably want to do a couple simpler exercises in concurrent programming first, before attempting to parallelize this one.