Plain Text and XML

A Conversation with Andy Hunt and Dave Thomas, Part X

by Bill Venners

May 5, 2003




Summary

Pragmatic Programmers Andy Hunt and Dave Thomas talk with Bill Venners about the value of storing persistent data in plain text and the ways they feel XML is being misused.

Andy Hunt and Dave Thomas are the Pragmatic Programmers, recognized internationally as experts in the development of high-quality software. Their best-selling book of software best practices, The Pragmatic Programmer: From Journeyman to Master (Addison-Wesley, 1999), is filled with practical advice on a wide range of software development issues. They also authored Programming Ruby: A Pragmatic Programmer's Guide (Addison-Wesley, 2000), and helped to write the now famous Agile Manifesto.

In this interview, which has been published in ten weekly installments, Andy Hunt and Dave Thomas discuss many aspects of software development:

Why Use Plain Text?

Bill Venners: In your book, The Pragmatic Programmer, you write, "We believe the best format for storing knowledge persistently is plain text." Why? What are the advantages? What are the costs?

Dave Thomas: Does it ever happen to you that someone sends you a Microsoft Word file?

Bill Venners: It happens all the time.

Dave Thomas: A Word file that you can't open?

Bill Venners: No, because I have Microsoft Word. One of the main reasons I have Word and Excel on my Macintosh is because people send me Word and Excel files all the time and I need to be able to open them.

Dave Thomas: Well that's funny, because I have Word on my Macintosh. I have the very latest Word, and yesterday I received a Word document that it won't open.

Andy Hunt: This problem also happens between Word 97 and later versions for Windows, not just between say Word 97 and the Macintosh version of Word.

Dave Thomas: The problem is, once we store data in a non-transparent, inaccessible format, then we need code to read it, and that code disappears. Code is disappearing all the time. You probably can't go to a store and ask for a copy of Word 1, or whatever the first version of Word was called. So we are losing vast quantities of information, because we can no longer read the files.

One of the reasons we advocate using plain text is so information doesn't get lost when the program goes away. Even though a program has gone away, you can still extract information from a plain text document. You may not be able to make the information look like the original program would, but you can get the information out. The process is made even easier if the format of the plain text file is self-describing, such that you have metadata inside the file that you can use to extract out the actual semantic meaning of the data in the file. XML is not a particularly good way to do this, but it's currently the plain text transmission medium du jour.

Another reason for using plain text is it allows you to write individual chunks of code that cooperate with each other. One of the classic examples of this is the Unix toolset: a set of small sharp tools that you can join together. You join them by feeding the plain text output of one into the plain text input of the next. There's no concept of trying to make sure the word count program outputs things in a format that's compatible with the next tool in the chain. It's just plain text to plain text, and that's a very powerful way to do it.

Andy Hunt: Virtually any program that's going to operate on text of some sort can operate on plain text as the lowest common denominator. Very often you get into a state where you want to work with some program, but its properties file has gotten corrupted such that the program won't even come up to let you change the property. If that file is in some binary format that needs the program itself to fix it, you're hosed. You've catch-22ed yourself right out of existence. If it's in a plain text format, you can go in with any generic tool—a text editor, whatever you like to use to deal with plain text—and fix the problem. So in terms of emergency recovery, or changes in the field, plain text is helpful. It provides another level of insurance.

Dave Thomas: Earlier in the interview (See Resources), I was talking about putting abstractions into code, specifics into metadata. We will be handing the programs we're writing today down to the next generation of programmers, and the ones after that. They will have to deal with this mess we've left behind. If we give them a load of gibberish consisting of binary data, they're going to have a harder time understanding it. If we give them nice plain text or XML files, it will be a lot easier to understand. Plain text will obviously require less mental energy to figure out.

Readable versus Understandable

Bill Venners: What is the distinction of human readable and human understandable data, and why is that distinction important?

Dave Thomas: I can give you 128 bit cipher key as ASCII, and you can read it, but it may not make sense to you.

Andy Hunt: So it is readable, but not understandable.

Dave Thomas: I can give you the works of Shakespeare as a list of words sorted alphabetically. You could read it, but you couldn't make much sense of it.

Andy Hunt: The advantage of human understandable plain text is, suppose for historic reasons you've got a control file lying around, but there is no software still around that can understand it or do anything meaningful with it. You as a human may be able to read that file and understand enough to figure out whatever you're trying to extract from it. Or, suppose you've got some printouts from way back when sitting in a warehouse. You need to get some ancient piece of account information or figure out an algorithm from an old Cobol program. If you have printouts and nothing left that can possibly even read them, you can still read them yourself and extract some information.

Dave Thomas: Cobol provides a good example, I think. The Cobol fixed length record has data in columns. You can print it out and actually see the columns of data lined up. One step better than that is CSV, comma separated variables, because in CSV you can put in a header that tells you what's in each column.

Bill Venners: So the CSV header, which basically lists comma separated column names, is an example of self-describing data.

Dave Thomas: It's a very simple example of self-describing data. And you can import CSV into just about any program.

Bill Venners: So the advantage of self-describing data is that in the absence of the manual, in situations where I have to look at some data and figure it out, the metadata will help. The metadata isn't the whole manual. Maybe it's just words like "Customer ID," which can help me figure out what the columns are about.

Andy Hunt: Metadata helps us express our intent. In this case suppose the original programs are missing, and all you've got is this CSV file. The column names give you a hint. You might be able to figure this out from staring at raw data in the Cobol dump: "Well those look like customer IDs, or maybe those are zip codes, I'm not real sure." Metadata gives you that one level of added security, OK, the people who wrote this originally called this "Customer ID." Now I've got a hint to go on. I know what they meant by that. I can understand this. It all boils down to communication. You are communicating intent.

Parsing with Partial Knowledge

Bill Venners: There's something that's been bothering me about the way people have been raving about XML. One of the big claims is that because XML data is self-describing, with data wrapped by tags like <customerid>12345</customerid> , clients can figure things out even for documents that don't strictly adhere to their schema and specification. I hear claims that XML is more flexible, because providers of documents can be sloppy and just add new pieces of data here and there. Clients can just ignore tags they don't recognize and find data even if it is in the wrong place according to the schema. The Java class file is not XML, but like XML is a data structure and file format. There is a detailed specification for the Java class file that describes all the data and semantics, and also clearly defines the way in which class files can be extended. Providers and consumers of Java class files adhere strictly to the specification. This approach of strict compliance to a specification and schema makes more sense to me. I like what you have said about self-describing data, but I'm concerned about the leap that some XML enthusiasts seem to make that because the data is self-describing, the way in which a particular schema can evolve doesn't have to be clearly specified or followed, because they assume clients will just ignore anything they don't understand.

You write in your book, "You can parse a plain text file with only partial knowledge of its format." How often do we lose the format specification, or is this more about not needing to "read the manual"—the specification—because the data is more user-friendly.

Dave Thomas: Oh no, it's not so you don't have to read the manual. It's that, if all you have is a pile of data, I'm sure you'd much rather have something in there that gives you some hints to the semantics, as well as just the data itself.

Andy Hunt: We mean using partial knowledge of the format in a forensic sense. You want to go back and dig out account numbers. If the data is tagged such that you can see which pieces of data are account numbers, it becomes a much easier job than just having to dig through a bunch of numbers.

Bill Venners: So the metadata makes the data itself more programmer-friendly. I don't have to go to the manual. It's like there's a miniature, really terse manual in the data itself.

Dave Thomas: Yes, and I think you're also assuming there's a manual.

Bill Venners: Well, that's part of what I'm asking. How often is there no manual?

Dave Thomas: Most of the time there is no manual. If I give you a Word 1 file, where's the manual? If I ship you the output of my stock controller system, where's the manual? If I'm gone, if my program's gone, what are you going to do with that file? There are terabytes of data sitting around in an unusable state, because the software that reads them is gone. Yes, you could probably sit there and reverse engineer it, but it would be a whole lot easier to reverse engineer it if it were self-describing.

Misuses of XML

Dave Thomas: Now, can I just have a little rant?

Bill Venners: Sure.

Dave Thomas: XML sucks.

Bill Venners: Why?

Dave Thomas: XML sucks because it's being used wrongly. It is being used by people who view it as being an encapsulation of semantics and data, and it's not. XML is purely a way of structuring files, and as such, really doesn't add much to the overall picture. XML came from a document preparation tradition. First there was GML, a document preparation system, then SGML, a document preparation system, then HTML, a document preparation system, and now XML. All were designed as ways humans could structure documents. Now we've gotten to the point where XML has become so obscure and so complex to write, that it can no longer be written by people. If you talk to people in Sun about their libraries that generate XML, they say humans cannot read this. It's not designed for human consumption. Yet we're carrying around all the baggage that's in there, because it's designed for humans to read. So XML is a remarkably inefficient encoding system. It's a remarkably difficult to use encoding system, considering what it does. And yet it's become the lingua franca for talking between applications, and that strikes me as crazy.

Andy Hunt: It's sort of become the worst of both worlds.

Bill Venners: Actually, that was one of my last questions I was going to ask: Do you consider XML plain text? Could you elaborate on what you said about how people view XML?

Dave Thomas: People think, "Once I've got my data in XML that's all I've got to do. I've now got self-describing data," but the reality is they don't. They're just assuming that the tags that are in there somehow give people all the information they need to be able to deal with the data. Now, for some things there are standards. For example, there are some standards like RSS and RDF, which give you very simple ways of describing web page content. But a random XML file, especially machine generated XML files, can be as obscure as binary data.

Bill Venners: Yeah, I find Ant build files, which are XML, very hard to read.

Andy Hunt: Ant is actually a really good example, because in that case you're using XML as a user-specified input language, which is really inappropriate in that context. I'd much rather have something...

Bill Venners: A context-free grammar, something that's more readable.

Andy Hunt: Yeah, a genuine grammar. I want to be able to type something simple and easy for me. I don't care if it's easy for the tool to parse, that's the tool's problem. I want it to be easy for me to write. And in cases like that, it's really the case of the programmer saying, "Oh look, here's an XML parser. I can just take XML files. That's easier." So one programmer in one context puts a burden on the other 100,000 programmers trying to use it.

Bill Venners: Well, I think some people may be more comfortable reading XML than others. What I've found is that I can usually read XML files just fine if they are small and simple. For example, a couple years ago I pulled certain metadata about each Artima.com web page into external files that I keep separate from the raw HTML files. The metadata file has information like title, subtitle, publication date, author, and so on. As part of my build system, I wrote a "page pumper" program that takes one metadata file and one raw HTML file as input and generates the pretty HTML file you see on the web as output. When I want to make global changes to the look and feel of Artima.com, I just change page pumper and do a build all.

In the old days, what I would have done to create that metadata file was whip up a quick context free grammer with tools like Lex and YACC, and use that for the grammer of the metadata file. But given that XML was all the rage back then, I wanted as a consultant to get some experience with XML, so I used XML for the metadata file. And XML has worked just fine in that situation. I can easily edit the web page metadata files by hand and easily read them, even though they are XML, because they are small and simple. But I've often been frustrated staring at even moderate-sized Ant build files trying to decipher them, and staring at the Ant documentation trying to figure out how to do something that I think should be simple and obvious.

Dave Thomas: If you're talking about using XML in certain domains, it's fine. XSLT, for example, lets you do some really fun things with XML. When we had our book online, for example, we went from LaTeX, to XML, and then to the output format, simply because XSLT gave us some really powerful ways for manipulating the document's content. XML is useful in appropriate contexts, but it is being grossly abused in most of the ways it is being used today.

Next Week

Come back Monday, May 12 for Part I of a conversation with Elliotte Rusty Harold. If you'd like to receive a brief weekly email announcing new articles at Artima.com, please subscribe to the Artima Newsletter.

Talk Back!

Have an opinion about assertions, crashing early, or the appropriate level of confidence to have in the code you write? Discuss this article in the News & Ideas Forum topic, Plain Text and XML. Resources

Dave Thomas talks about putting abstractions into code, details into metadata in Part IV of this interview, Abstraction and Detail:

http://www.artima.com/intv/metadata.html

Andy Hunt and Dave Thomas are authors of The Pragmatic Programmer, which is available on Amazon.com at:

http://www.amazon.com/exec/obidos/ASIN/020161622X/



The Pragmatic Programmer's home page is here:

http://www.pragmaticprogrammer.com/