Escape the scripter mentality if reliability matters

I would like to present a problem in order to trigger the gut/instinct of those reading it. Basically, given the problem statement, and reading this linearly, how might you handle it?

Let's say I want you to get the fifth word of the fourth paragraph of the third column of the second page of the first edition of the local paper in a town to be determined. It's going to be a color, and we have a little deal with that paper to get our data plugged in every day. You can get the feed from their web site.

At this point, you're probably thinking of something like this: wget or curl, then either dump it in a temp file, or just go straight on and pipe it into some processing tools. Maybe you go for grep, cut, sed and friends. Perhaps you break out perl or python.

The point is simple enough: I asked you to get one piece of data one time. You should probably put something simple together that'll run one and that's it. It's not worth the trouble to go much beyond that. It doesn't matter how much baling wire or duct tape it takes, since once I get my answer (orange, blue, lavender, mauve, teal?), I'm done.

Okay, so, now that you're comfortable with that, I'm going to change it up to make a point.

Now I want you to be able to do this reliably every day for the next two years. I need this data on a regular delivery schedule and it can't rely on some human being there to constantly fine-tune things.

At this point, the gears in your head should start turning. Whatever you come up with, I hope it doesn't resemble the mass of duct-taped gunk you thought of before. This is a different problem with a completely different set of requirements, even though the very core is unchanged! I still need "5w, 4p, 3c, 2p, 1e", but now it has to hold up to all kinds of craziness that a year might bring.

If you think your shell script abomination is going to work flawlessly for 730 days (or 731 if we're talking about a leap year!), you have far more confidence than I do. You have to go beyond that and actually think about all of the corner cases.

What if there's no first edition that day? Maybe the town has a massive earthquake and they fail to put out a paper. Or perhaps they do manage to put one out, but it's the front side of a single sheet of paper. How about some more failure modes? That second page may exist, but perhaps it doesn't have a third column, or a fourth paragraph, or maybe that fourth paragraph doesn't have five words!

Short paragraphs happen!

Okay, what if all of those things do exist, but then the word you find isn't actually a color? Maybe it's "twenty-seven" or "127.0.0.1" or "0xF00FC7C8" or anything else. You might have found some data, but it's completely useless to me.

If handling this gracefully matters to you, a simple hack will not cut it. You're going to have to invest the time to do it right up front. If you don't, you run the risk of having to spend at least as much time after the fact cleaning up a mess and making excuses for your own disaster.

By now you're probably wondering where this rant came from. The other night, I spotted a big chunk of code which was put forth as a demonstration of someone's abilities. It was a collection of shell scripts which had a ton of glue.

Script #1 was to be run on a distant host. It ran a data capture tool and sent it through things like grep and sed to filter some of the details, and then it sent it into a small network tool. Think of "hose" and "faucet", if you are familiar with them.

Script #2 then used the companion network tool to call over to script #1's machine. It then sent the whole mess through another bunch of grep and sed pipelines to filter it some more, and piped it into script #3.

Script #3 was a loop to read stdin and break it into chunks. Each chunk was then filtered through things like tail, awk, cat and sort, and then wound up in a series of temporary files.

Finally, script #4 contained a call out to some data processing tool which would build a graphical representation of the input. It actually created a FIFO, started that tool listening to one end, then ran its own loop and wrote to the other end. This ran until you aborted it.

The results look impressive enough. You can start this suite up on your test environment and start getting numbers back. They'll come back over the network and will show up on a local display. If that's all you want to do, and you want to do it right now, you're done! Yay!

The problem is what happens when this thing outlives its welcome. What if you need this day after day? Or what happens if it's supposed to run continuously? We're no longer talking about a series of things you can just start in a bunch of xterms or even screen sessions. It needs to graduate to a higher level of service.

At that point, it is no longer appropriate to handle this problem that way. For one thing, all of this parsing has to go right out out the window. The data in question is already in a nice form on the source machine. Out there, values like 123 are in fact a single byte with the value of 123! They haven't been turned into a human-readable stream of characters which happens to include "foo bar 123 blah blah" somewhere.

What happens if this thing was written some time before October 1st? It might be parsing "3/14 12:17 foo bar 123 blah blah". It'll work fine... until October 10th rolls around, then you have "10/10 foo bar 123 blah blah", and suddenly your data is offset by one character!

"Oh, split on spaces and use fields instead of using raw columns" you might say. To that, I just say this: why even let it become ASCII in the first place if you're going to operate on it as data? Nicely-formatted lines of text are for humans. Let computers talk to each other in something that's not going to be (as) ambiguous, already.

Incidentally, there is also a small amount of software written in the last three months of any year which will break in the next couple of hours as December 31st rolls over to January 1st.

None of this should matter to you! If you are having to worry about better ways to parse data which is coming from a machine, then you have already lost. Instead of building better parsers, focus on finding a way to transport the data in its native form. Everyone will be happier for it.

This gets into a whole thing I call "scripter mentality". It seems like some people would rather call (say) tcpdump and parse the results instead of writing their own little program which uses libpcap. Calling tcpdump means you have to do the pipe, fork, dup2, exec, parse thing. Using libpcap means you just have to deal with a stream of data arriving that you'd have to chew on anyway.

I've read The Unix Philosophy. I even agree with most of it. But there is a time and a place for elaborate | series | of | pipelines, and building long-lived reliable server operations which can handle errors well without human interactions is not one of them.

I might be persuaded to look the other way on text-based communications if you can control both ends strictly and you aren't a bozo who will create an ambiguous grammar. Otherwise, stick to whatever kind of encoding floats your boat.

Or you can just sit there and work on parsers for the rest of your career. I know what I'd rather be doing.