Charles G. pointed out in an email discussion recently:

Lisp still doesn't seem like the right language for doing text manipulation, and nothing I've seen from the Emacs libraries is making me think any differently. It sure beats the hell out of Java though. Maybe someday someone will write Emacs using Ruby as the embedded interpreter...

These are all great points. I know exactly how he feels. I know soooo exactly how Charles feels that I decided to write a blog instead of an email reply. Because all the things he's brought up are real, bona-fide problems.

Lisp for Text Processing

Let's start by considering the basic problem: how good is Lisp for text processing? That turns out to be a complicated question.

When we think of "text processing", most of us usually think immediately of regular expressions. Unless "we" are C++ programmers, in which case "we" like to write 2500-line clones of the Unix 'grep' utility whenever "we" need to do a text search — or so it would seem, from the candidates I've interviewed in the past few months. But I think it's safe to say that most programmers equate text processing with regular expressions.

Regexps are obviously quite useful. If you aren't extremely proficient with regular expressions right now, then you should drop everything and go become proficient with them. I bet I use regular expressions on 350 days out of every year: in my editor, on the command-line, in my code — anywhere that using one would save me time or make my code clearer. Oh, how it hurts to think about all the so-called "programmers" out there who don't know how to use regexps. Argh. Let's just drop it.

However, I read somewhere that Lispers have always been a bit skeptical of regular expressions, because regexps are actually a bit weak compared to the generalized processing you can do on tree structures. Lisp folks ask: why are you storing your data as text in the first place? (As opposed to storing it as Lisp.)

I dunno about you, but the first response that comes to my mind is: "duh, what about logs?" I remember thinking: gosh, what a bunch of losers; the Lisp folks don't even know that the logs of virtually all systems are one-line entries that most easily parsed (by far) using regexps.

Yeah, what about logs. Those dumb Lispers. Jerks. Losers. I certainly had them there.

Then I noticed, no more than three weeks ago, that in Java 1.5, my java.util.logging output had quietly turned into XML. D'oh!!! Regexps suck for processing XML. If you don't know why, please don't tell me that you don't know, or I will hate you. Better to keep silent.

Why, then, are logs switching to XML output? Well, er, ah, because XML offers more powerful and generalized text-processing capabilities than one-line log entries. I suppose. I actually haven't quite gotten used to the new XML output format, but I'm giving it a go and trying to learn to like it. It's quite verbose, which in some cases is good, and in others, maybe not so good.

Case in point: Java stack traces in the logs have each individual stack frame entry wrapped in its own XML element. The stack traces are already long, but this makes them sort of crazy. Well, you be the judge. Would you rather have your log entries look like this:

Feb 21, 2005 6:57:39 PM java.util.logging.LogManager$RootLogger log

SEVERE: A very very bad thing has happened!

java.lang.Exception

at logtest.main(logtest.java:24)



Or like this:

< ? xml version = " 1.0 " encoding = " utf-8 " standalone = " no " ?> < ! DOCTYPE log SYSTEM " logger.dtd "> < log> < record> < date> 2005-02-21T18:57:39< / date> < millis> 1109041059800< / millis> < sequence> 1< / sequence> < logger> < / logger> < level> SEVERE< / level> < class> java.util.logging.LogManager$RootLogger< / class> < method> log< / method> < thread> 10< / thread> < message> A very very bad thing has happened!< / message> < exception> < message> java.lang.Exception< / message> < frame> < class> logtest< / class> < method> main< / method> < line> 30< / line> < / frame> < / exception> < / record> < / log>

I guess it kinda depends. If your log only has a few entries, or you're just doing quick-and-dirty searches, regular expressions might be sufficient. But with lots of entries, XML (even though it's five times as verbose) becomes a really powerful tool.

For instance, you can do XPath expressions on XML — they're sort of like regular expressions, but they understand the XML tree structure: something that no regexp, no matter how fancy, will ever be able to do. With a one-line XPath expression, you could (for instance) select all of the log entries that have a stack trace containing a particular Java class (or a set of classes). Trying to do that reliably with regexps will take you time, patience, and a lot of auxiliary scripting. With XPath it's a snap.

(Incidentally, if you're not already extremely proficient with XPath, I suggest you drop everything and go become proficient with it. Path expressions are becoming quite popular, and XPath is leading the pack. They're very powerful. If you don't know how to use XPath, you will wind up reinventing it badly in your XML-processing code.)

XML data also lets you use XSLT transforms (or XQuery, if you're hardcore and perhaps slightly crazy), or you can simply use your favorite SAX or DOM parser in your favorite language, and quickly do all sorts of remarkable things that would be extremely clunky using regular expressions. Clunky, nothing — you'd actually just be writing your own ad-hoc XML parser in each script. You just don't want to go there.

So XML is pretty nice. And that sort of validates what the Lisp people were saying all along, which is that you want even your "simple" text data to be tree-structured. In Lisp, the equivalent log output might look very similar to the XML:

( log '( record ( date "2005-02-21T18:57:39" )

( millis 1109041059800)

( sequence 1)

( logger nil)

( level 'SEVERE )

( class "java.util.logging.LogManager$RootLogger" )

( method 'log )

( thread 10)

( message "A very very bad thing has happened!" )

( exception ( message "java.lang.Exception" )

( frame ( class "logtest" )

( method 'main )

( line 30)))))



Well... similar, except for being ten times cleaner and easier to read. It still has all the same metadata that the XML gives you, and you can still process it using tools that are just as powerful, if not more so.

You could even trivially convert it to XML and use XSLT, if you were silly enough. But Lisp is directly executable, so you could simply make the tag names functions that automatically transform themselves. It'd be a lot easier than using XSLT, and less than a tenth the size.

And for your XPath queries, well, there are mature Common Lisp packages that support them directly on both XML and Lisp data. The same is true for Scheme.

I don't care how fancy your language is — C++, Ruby, Python, Java, Perl whatever — I can guarantee you that even if it supports doing XPath queries on the syntax trees for source code in that language, which is unlikely, I doubt very much that you'd want to do it. Have you ever looked at the ANTLR or JavaCC grammar for Java or C++? And the grammars for Python and Ruby are almost as complex. A query language can't hide that kind of complexity. It will always be more work to process the source code programmatically in syntactically complex languages.

The Text-Processing Conundrum

So everyone in the world except for the Lisp community is pretty much stuck with the same fundamental text-processing problem, which I'll summarize:

You want to be able to store and process text data. Doing this effectively requires your data to be tree-structured. Regexps don't cut it for any data or processing that's sufficiently complex. Your only real option these days, for most languages, is to use XML. It has all the best tools and the widest support for your language. XML processing, which is supposed to be easy, starts to become arbitrarily complex when you start having to use XSLT or XQuery, or roll your own transformations using a SAX or DOM parser in your favorite language. But those are your only options.

In Lisp, your code is data, and your data is code, so you have a third option (aside from regexps or XML) that's not realistically an option in any other language: you store your text data as a lisp program.

If you simply want to scan it visually, well, you can see for yourself in my example above, it's easier on the eyes than XML. It's also more compact, which is easier on the disks, networks, databases, and IDEs.

If you want to query it, you load it in and use Lisp functions, which now include various flavors of path expressions, including XPath, if you like.

And if you want to transform it, well, you can write your own transformer, of course, but it's probably easier to make the actual code know how to transform itself. In any case, your transformers will be easier to write, since they have all the benefits of XSLT (i.e. transformers can themselves be auto-generated and auto-transformed, breaking things into nice stages), without all the downsides of XSLT (ugliness, surliness, no fun at parties, etc.)

Beyond Logs

Of course we're not just talking about log data. The situation is even clearer for configuration files. You definitely want them in XML, except it has the same problems, so... Hey, wait a minute — if your configuration file is... Lisp, then it's not really a... configuration file anymore; it's part of your... program? Is that right?

Um, yep. You got it.

The whole nasty "configuration" problem becomes incredibly more convenient in the Lisp world. No more stanza files, apache-config, .properties files, XML configuration files, Makefiles — all those lame, crappy, half-language creatures that you wish were executable, or at least loaded directly into your program without specialized processing. I know, I know — everyone raves about the power of separating your code and your data. That's because they're using languages that simply can't do a good job of representing data as code. But it's what you really want, or all the creepy half-languages wouldn't all evolve towards being Turing-complete, would they?

In fact, if you insist on code/data separation and you're an advocate of OOP, then you're talking out of both sides of your mouth. If your gut reaction to having log entries know how to transform or process themselves is "woah, that's just wrong", think again: you're imposing a world-view on the problem that's not consistent with your notions of data encapsulation and active objects. This world-view dates back to ancient Unix and pre-Unix days. But if you think about it, there's no reason log entries or config files shouldn't be executable and subclassable. It might be better.

And what about, oh, web pages? Or word-processor documents? Well, you figure it out. Web pages use HTML, which isn't even powerful enough to represent text styles, let alone something like an event handler. So Web pages have CSS, and JavaScript, and all this other hooey. It's become so ugly that people don't really write web pages anymore, not for production stuff. Nowadays people treat the morass of ancient, crufty Web technologies as a sort of assembly language. You write code to assemble your pages piecewise using PHP or XML/XSLT or Perl/Mason or Java/JSP or perhaps all of them in a giant ugly pipeline, which "compiles" down to an unreadable Web page format. Talk about fun!

I can tell you in all honesty: everyone who tries this feels pain. And there are a lot of people in the world doing exactly what I described above. Building production websites == pain. The world is gradually, very slowly, converging towards using a variety of "executable XML" formats (e.g. Ant, Jelly, Cocoon) which... well, they sort of ease the pain, but it's replaced with new pain: the pain of the executable-XML language designers not having a frigging clue what they're doing.

So now Ant has a macro system, and try/catch tags, and if-tags, and it's gradually migrating towards Turing-completeness if it's not there already. But it still has all the same yucky problems it's had from day one: properties that look like variables that you can only set once, and weird inconsistencies in the way the tags work, and of course the fact that it's automatically 10x as verbose as a programming language because it's XML. Don't get me wrong — it's still way better than Make. But that's not a very high bar now, is it?

Let's face it: a Turing-complete Ant (or Jelly, or any pure-XML processing framework) is going to be a monstrosity, because it will take years (if not decades) for them to figure out that Turing-completeness does not equal expressiveness, and they'll have to add lexical scoping, and data types, and a class system, and first-class functions, and and and...

And in the end, it'll still be XML.

The Shaggy Cow

How did I get so far off the original track of text processing? Well, that's the punch line of this shaggy-dog story: it's all text processing! Log files, configuration files, XML data, query strings, mini-languages, programming languages, transformers, web pages, word documents, everything... the vast majority of your programming work involves text processing somehow.

What would you rather do? Learn 16 different languages and frameworks in order to do "simple" log-file and configuration-file processing? Or just buckle down, learn Lisp, and have all of these problems go away forever?

It's a rhetorical question. The answer is patently obvious at this point: Lisp is evil, and you'd damned well better write all your code in C++ and XML and JavaScript and PL*SQL and CSS and XSLT and regular expressions and all those other God-fearing red-blooded manly patriotic all-American languages from now on. No more of this crazy Lisp talk, ya hear?

Welcome to my life. I'm the cow in the Gary Larsen comic — the one who looks up, shocked, and says: "Hey, wait a minute! This is grass! We've been eating grass." The other cows stare blankly, munching the grass.

Actually, I do feel like that cow, but I also feel a bit like one of the characters in Salman Rushdie's Midnight's Children. (It's one of the most amazing fictional works ever written, and if you haven't read it, you're missing out.) There's a character who can travel forward and backward through time, so he can of course see the future. The funny thing is: all the other children, even though they know he can see the future, refuse to believe anything he says about it.

Yes, you're probably boiling over with objections to my little discussion above. You think I'm trivializing things, or you think that perhaps I'm overstating the importance of tree-structured data (perhaps you're not an XML fan), or maybe you're simply mad at me for reasons you can't really articulate, other than to say, vaguely, that I appear to have "Paul Graham-itis". I understand how you feel.

Being able to see the future is actually rather unpleasant.

What About Emacs?

Putting all ranting aside for the moment, let's talk about Charles' second concern: wouldn't it be better if Emacs were written in Ruby?

After all, Emacs is designed for manipulating any old kind of text, not just tree-structured text like XML or Lisp. And Charles was right on the mark when he said nothing in the Emacs libraries indicates that Emacs-Lisp is particularly good for plain-vanilla text manipulation. It's missing a lot of features we've become accustomed to. Perl has raised the bar on ordinary/arbitrary string processing.

Although a Ruby-based Emacs would probably be quite nice in some respects, I now think (even liking Ruby as much as I do) a Common Lisp Emacs would be even nicer. I don't want to belabor it, because if you agree with me then you need no convincing, and if you don't, then you probably cannot be convinced in any reasonable amount of time. The summary is that Lisp has intrinsic, un-beatable technical advantages stemming from its s-expression structure, and Common Lisp has 20+ years of maturity that give it far more stability, performance, and interoperability than Ruby or Python will have for a long, long time (if ever.)

Well then, why don't they just convert Emacs to Common Lisp?

That's the rub: Emacs Lisp is even older than Common Lisp, and it's got some unfortunate incompatibilities with Common Lisp (and Scheme even more so) that make porting forward so nontrivial as to be nearing a complete rewrite.

Since Emacs is so ancient, there are millions of lines of fairly well-debugged elisp code out there; it's one of the original and longest-lived open-source applications, so you'd have an absolutely huge task in trying to re-implement all of it. Most people who try this wind up trying to create a "compatibility-mode" for old elisp code. Guile Emacs, JEmacs and a few Common Lisp editors all attempt this, and none yet have succeeded at doing it well.

The other option is to just live with Emacs, since it is still Lisp, and even has a fairly comprehensive set of macros that give a large subset Common Lisp's functionality. So it's generally easier to hack Emacs to interoperate with your language (or any system, in fact) than it is to try to re-implement Emacs.

Unfortunately, this wouldn't really be a big deal if people could just go in and hack on Emacs source code and "fix things". For instance, I'd love to add Perl5-compatible regular expressions, and a reader-macro system to allow for raw strings (or at least hack in some syntax to support regexps without having to double-escape everything).

But there are several blocking issues. One is that the Emacs folks are notoriously picky about contributions — you have to provide legal paperwork saying the work is your own, that the FSF can use it, etc. It's the basic problem that led to Eric Raymond's famous "The Cathedral and the Bazaar" essay — GNU Emacs is the archetypal Cathedral. So: good luck getting your changes into Emacs. The Lucid folks tried for a while, and ultimately forked the code base to produce XEmacs, which is a famously bad situation.

The difficulty of contributing extends beyond the core binary. If you wanted to contribute, say, a pure-elisp String library (which Emacs could really use), or a collections package, I'm not sure you could pull it off. You'd have to get it by RMS, and it seems fairly daunting. RMS is, well, conservative — to put it mildly. I think he's a superhero, but he doesn't make it easy to contribute to Emacs.

Even if contributing weren't such a hassle, it's not entirely clear to people that Emacs is worth hacking on. It's missing many of the core rendering-engine features that would make it capable of doing, say, a Web Browser. Getting it to the point where it could render PostScript seems an impossible task.

And many newer programmers aren't using Emacs at all; they've been lured away by the siren-song of IDEs like Eclipse, IntelliJ, Visual Studio, and so on. Emacs doesn't have a very pretty face (because of the simple rendering engine I mentioned above), and it certainly doesn't have much marketing. Most programmers these days are quite astonished that anyone would actually still use Emacs. If they realized how much functionality it has, and how powerful its extensibility model is, they'd be ten times as astonished. It's got things that Eclipse may never have, not in a hundred years, and that's no exaggeration. If they tried hard enough, they would eventually wind up rewriting most of it in Lisp, which would be rather ironic, all in all.

The Emacs Problem

So! The situation is best described as a "dilemma". Emacs isn't really advancing, and it appears to be a lot of work to re-implement it in another language. (Not that people like the Guile folks aren't trying, but it's still taking forever). And a bunch of less powerful languages are in vogue; in fact it appears that a language's popularity is very nearly inversely proportional to its power these days. People just don't realize what they're missing.

So yeah. Charles is right. Lisp doesn't seem like the right language for doing text processing, at least if you take the fairly narrow view that text processing means syntax for regexps (and string interpolation, and miscellaneous other Perl-isms), and you're only looking at the libraries bundled with GNU Emacs.

And it does sure still beat the hell out of Java — at least for, uh, creating dynamically-modifiable IDEs. I won't climb too far up that soapbox, lest I get lynched.

And maybe, just maybe, someone will succeed in the gargantuan effort to create a usable replacement for Emacs in some other high-level language. Probably not any time soon, I'd bet.

So for now, I use lots of tools: Perl/Python/Ruby for scripts, Java or C for production apps (perhaps with embedded languages where appropriate!), XML for a lot of my data; I even use Eclipse for some things. And Emacs is a great general-purpose extensible editor and programming environment, especially if you make the effort to master it. But anyone can see that the whole situation (taking into account Web programming, let's not forget) could be a lot better.

This is a hard problem.

(Published Feb 22, 2005)

Comments

Talking of Lisp, XML, and data-is-code-is-data, have you seen this?

Looks like an interesting alternative to SAX/DOM approaches. It uses the reflection-layer of the Common Lisp Object System to transform XML elements to and from Lisp objects on the fly.

I suspect the Lisp hardcore would say 'Pah! Just use sexprs — that's what they were intended for'. But this still looks like an interesting glue layer.

Posted by: Chris N. at February 22, 2005 04:25 PM

And talking of using Common Lisp to rewrite Emacs, have you taken a look at Hemlock, which comes with CMUCL?

It's still not Emacs. Its incompleteness does a fair job of proving your point about how hard it would be to completely rebuild all of the Emacs utilities in CL, but it *is* kinda nifty.

Posted by: Brian W. at February 22, 2005 07:54 PM

We used XML/XPath/XSLT for a project to scrape prices from vendor web sites a couple years ago. It was fun in a, "I can't believe we're making this Elephant dance" kind of way. I can't imagine actually wanting to use that technology for something I wasn't planning to throw away, however.

XSLT isn't a real language (or it wasn't then). Xalan's xslt extensions allowed us to embed Javascript, which we needed for ... regular expressions! XPath worked great for finding the nodes we wanted, but then we wanted to find specific text in the nodes and do some simple cleanup. So, now we had XSLT with Javascript, being executed by a Java program running Xalan's XML libaries. I just went back and read through the code again, and its just as ugly as I remembered it being. Besides the fact that it's using Xalan extensions, so it isn't really portable anymore.

I think you've hit on the fundamental flaw. They are trying to build another lisp, but this one is uglier, way more verbose and not the slightest bit symmetric. I mean, they have good intentions: XSLT is just executable XML, but there isn't any way to extend it, short of vendor specific extensions. I really like the idea of having configuration files in Lisp. That would make parsing them almost trivial, although I think you would still want standard "Perl'ish" regular expressions for the verbose bits of data at the leaves of the tree. That would be easy to add in a real language though! Not to mention the ability, as you say, to just define functions with the same names as your nodes. You could even use macros to parameterize them for different sorts of transformations.

Sounds great, when do we start?

Posted by: Charles G. at February 22, 2005 08:50 PM

> Sounds great, when do we start?

Just as sooooon as I find a Lisp that doesn't cause nausea, diarrhea and severe stomach cramps.

Not entirely clear that such a thing exists, but I'm working on it. I keep having to wait for the symptoms to disappear before trying the next one.

Seriously, though: before trying anything with Lisp, I have a big checklist of criteria that it needs to pass. (Or at least provide some sort of "out" so you can fix it yourself, e.g. by bridging to C libraries). Stuff like decent concurrency support, asynchronous I/O, good tools and cross-platform support, etc. etc.

I really wish someone else had done this evaluation already. I only have so much free time for it. But I've found a few useful links; Paul Costanza's Highly Opinionated Guide to Lisp is one (in no small part because of all the links it has at the bottom). And Cliki is kind of useful.

But I haven't yet found an evaluation of any Lisp's suitability for what we think of as "production" work. Feel free to help! Maybe we need to start a Wiki on it. Call it Blub, though, so nobody gets scared. :-)

Anyway...

> Sounds great, when do we start?

Short answer: not any time soon, I think. Java and C++ are still the king and queen, respectively.

(Note: C++ programmers who read that are thinking "hey, why do we have to be the queen?". That's why. :-)

Posted by: Steve Yegge at February 22, 2005 10:54 PM

Brian: Yeah, Hemlock looks interesting. Actually, do so Guile Emacs and JEmacs.

JEmacs could be really cool if it ever got finished. It's mostly a proof-of-concept, and Per Bothner isn't working actively on it anymore. He wants someone to take over the development on it. The code is clean but complex — Kawa (unlike pretty much all other JVM "scripting" languages) goes to great lengths to ensure static typing so it can be efficiently compiled to bytecode. And Kawa actually supports at least five different languages in its framework. So it'd be a significant undertaking to ramp up on the code and start pushing it forward. Not that I don't think about trying! Having a reasonable Emacs that you can run in the same JVM as the Java app you're developing would be nothing short of incredible.

I'll download Hemlock and actually try it out. All in all, I generally prefer Scheme to Common Lisp, though I wish for all of CL's library functions. I think Per is actually planning on adding namespaces to Kawa, and offering all the Common Lisp stuff to your Scheme code by explicitly prefixing it, e.g.:

(cl:eq foo bar)

is equivalent to the Scheme expression:

(eq? foo bar)

Being able to mix and match Scheme libraries, CL libraries, elisp libraries, and Java APIs together seamlessly (which he's working on) will make Kawa a serious contender in the Lisp-implementations lineup at some point. By the time Arc comes out, he'll be able to support that pretty well too.

I'm stalling, though. I really need to download CMUCL and start hacking with it. All my fun hacking been elisp or scheme lately.

Posted by: Steve Yegge at February 22, 2005 11:27 PM

Chris: thanks for the pointer to XMLisp! It looks really nifty. Actually, AgentSheets itself looks kind of cool. Backing the link up a level:

http://agentsheets.com/lisp/

I wonder if their product is all written in Lisp? There's no mention of this anywhere on their website — which I suppose is probably great for marketing.

How'd you hear about this? Just a Google search?

Posted by: Steve Yegge at February 22, 2005 11:48 PM

Tree Regular Expressions (in SCSH Scheme)

XML is great stuff if you're working on text with a high content/markup ratio, like written documents. Take a look at the source of this blog page if you want an example. Vast swathes of text with a few P and EM elements. The problems occur when the text/markup ratio approaches zero.

Interview question for people who claim major XML chops: Why does XML have both attributes and elements?

Posted by: Derek U. at February 23, 2005 12:32 AM

I keep on hearing about how great lisp is, and lispers rant on and on about it, but unfortunately all the good lisp examples are badly out of date, or in some horrible state (emacs vs xemacs and the personal politics of RMS), or apparently almost completely unusable according to your previous entries on languages. So what is a programmer to do? If everyone evolves to lisp, the argument is just go straight to lisp, but if the language is so great, why isn't more evangelism being done on it?

Most statements involving lisp tend to be what the SEC would call 'forward looking statements' or what i'll call 'niche statements'. That is people describe what COULD be or describe what works for a small number of people (why it never grows beyond that is never really talked about).

in the mean time on my Mac, I have the best UI development system ever conceived and it was built using Objective C - it's also shaping up to be one of the most advanced UI programming environments available period (CoreImage - screw 8 bit integers for each RGBA, how about FLOATS for each component?!). In the mean time the best face on lisp is ... emacs?!

I liked your entry on languages, but I was left with the distinct feeling that you don't like any programming language right now, any follow up there?

Posted by: Ryan R. at February 24, 2005 06:18 AM

The fundamental problem appears to be a chicken-and-egg problem, or just a game of chicken.

Nobody wants to use a language unless it's rock-solid: stable, fast, well-documented, well-specified, bug free (at least in complying with its specification), portable, etc. Oh, and you already have to know it, which limits the field a bit. Most people don't want to learn new things; they'd rather build new things, even if what they're building is something that already exists. Easier than learning the old thing.

I personally dislike most languages so much that I would hesitate to use them unless absolutely forced into it. (Even then, it would just be ho-hum, and not all THAT bad. Getting a dumb language to do what you want can be a fun challenge in itself — that's got to be part of why the existing popular languages are so popular.)

The game-of-chicken is that a language can't actually become rock-solid without a big community. The bigger the community, the more solid it gets. Hence, the most popular languages are the most solid: they're popular because they're solid, and they're solid because they're popular. It feeds on itself, reinforcing the desire to use the existing languages, no matter how awkward the languages are at saying things.

I care a lot about how easy or hard it is to say things in a programming language, because I've decided that I hate giant systems. They suck. The gianter a system, the more bugs it'll have, the harder it'll be to learn, the worse its availability will be, and the more people you'll need to hire — to make the system even more giant than it already is.

Higher-level languages provide mechanisms that allow you to say certain common things (such as: "give me all the elements of this vector that have a customer whose name starts with an S", although there are many other examples than just data-structure queries) much more conveniently than you can in C++ or Java. Not that you can't say them in C++ or Java — it just takes longer. C++ sometimes lets you say those things better than Java — but then screws you over with having to hand-manage memory and do all this other hooey, and on the balance, it's worse. Much worse.

Over time, if you use a language that doesn't let you say things compactly — in other words, a language that lets your refactoring make your code base smaller, then your system will grow giant, and then lots of problems will happen: longer builds, slower ramp-up, slower innovation, more bugs, lower availability, etc. Many developers (junior and senior) don't realize that it doesn't really have to be this way; it's one of the reasons I write this blog in the first place.

My position at the moment is that I'd like to see teams use the highest-level language that they can just barely tolerate. If everyone switched from C++ to Java, I would be overjoyed, and in no small part because stupid-ass porting projects like RHEL3 would disappear, freeing up engineers to do, you know, engineering, rather than a bunch of frigging porting work. Like, duh. Can't anyone at this company see how much pain C++ is causing us? No, because you have no accountability for porting your systems. Let dev-services do it... as if that'll work. There are half a dozen other reasons that are just as valid as the porting one. C++ is killing us. It's a virus that, once it's entered your company, will expand until it pushes everything else out, and you will become paralyzed. You all know this at some level, but you let it happen anyway.

Java just may be the only language out there today that's higher-level than C++ and still suitable for building a really massive service like CMS. That's because it's rock-solid (it really is, amazingly so nowadays), which stems from its popularity. I trash on J2EE a lot, because I'm squarely in the "Better, Faster, Lighter Java" camp, which is a subset of the "use small, reusable tools, not giant-ass frameworks". That's a long discussion that you can have with Peter D., who agrees and feels even more strongly about it than I do. :-)

But Java in general is a damn good language platform. Could be better, sure, but there's a lot to be said for it. And two or three very promising developments in Java make it even more compelling: Java 1.5 (which adds some really nice features), AspectJ (which adds some very powerful features and now seems mature enough for production, if approached carefully), and the JVM scripting languages, which if used cautiously, can greatly improve your ability to do things like unit testing, builds, debugging, configuration, scripting, prototyping, and other *auxiliary* work that we all do all the time.

If everyone switched to Java here, I'd be one happy camper. I'd say the same thing about Perl, except that Perl is so excruciatingly detestable (and broken in many ways) that I couldn't bear to recommend it. But I'd secretly still be pretty happy about Perl over C++.

Moving up the language power ladder, Python and Ruby are great for many tasks, and I think we could start using them in moderation. I picked Ruby somewhat arbitrarily, since it seems a little cleaner, but they're both good. Python is more "solid" by the definitions I gave above. Everything else — the ML family, Haskell, Smalltalk, and a bunch of others — none of them seem like they're solid enough for any production work at all here.

Python and Ruby are solid enough to use for auxiliary coding, but I'm not convinced either of them is solid enough (by which, again, I mean mature, stable, fast, etc.) to write (say) OMS or CMS. Java is solid enough, although you'll still have your share of serious engineering headaches that have little to do with the language — distribution, scaling, etc. Unless maybe you use a language like Erlang that makes these problems part of the language, but I remain a bit skeptical there.

So Java. Java is what I'd recommend. And Ruby or Python for anything you can get away with building using them. Nothing higher up the power curve is suitable today...

...except for Lisp. Lisp is the one (possible) exception. It appears pretty solid, and it happens to be very near (or at) the top of the power continuum. I'm holding out hope that some version of Lisp will be solid enough to use here. I'm looking at them. These things take time. What I want could best be described as "Common Scheme". Failing that, Common Lisp seems like the main contender.

Our interviewing process hires C++ programmers; that's what it's been carefully tuned to do (whether we deliberately set out to do that or not), and most of them don't know Lisp, let alone anything between Lisp and C++. The problem with Lisp, of course, is that it's so far up there that most programmers think of it as a complete joke, not even worth mentioning in the same breath as "work that one gets paid for".

And you're right, Emacs is the only application most people associate with Lisp. However, you don't get to see the production code from most companies. There are a lot of companies using Lisp, but it turns out that rather than trumpeting this, they hide the fact very carefully. They view it as a secret strategic advantage, and they don't want anyone (except potential hires) to know they're using it. Also, it doesn't do a thing for your product marketing if you say it's written in Lisp — that's more likely to scare people away than get them to use it. So Lisp actually appears to be a well-kept secret in the industry.

At least it appears to be. I haven't written anything big in it, so I can only speak from a sort of investigative standpoint. It could be a year, or even five years, before I really know. In the meantime, switching to Java is an absolutely outstanding thing for people to do. There are no doubts about it, no mysteries — it's as rock-solid as you need it to be. And it's not even scary.

Posted by: Steve Yegge at February 25, 2005 04:47 AM

"Common Scheme" being an industrial-strength, portable Scheme implementation with widespread support? Already satirized (and by Guy Steele no less):

http://zurich.ai.mit.edu/pipermail/rrrs-authors/1998-May/002343.html

The thread is an argument between the minimalist and maximalist groups of Scheme language designers.

Posted by: Derek U. at February 25, 2005 05:56 PM

In this entry you present us with a false dilemma. Either we write our configuration files in ad-hoc languages that eventually become turing-complete monstrosities, or we write them in s-expressions. Either our logfiles are XML and we are consigned to the world of XSLT/XPath, or they are lisp programs that know how to execute themselves.

I would like to present a third alternative:

logEntry = {

"date" => "2005-02-21T18:57:39",

"millis" => 1109041059800,

"sequence" => 1,

"logger" => nil,

"level" => :SEVERE,

"class" => "java.util.logging.LogManager$RootLogger",

"method" => "log"

"thread" => 10,

"message" => "A very very bad thing has happened!",

"exception" => {

"message" => "java.lang.Exception",

"frame" => {

"class" => "logtest",

"method" => :main,

"line" => 30

}

}

}

LISP isn't the only language that has hierarchical data structures.

No, a hash table isn't executable and can't transform itself like a LISP program could, but it's not clear to me why it should. A stack trace is not inherently an executable thing, any more than my license plate number is. No, I can't subclass a hash table, but I can subclass this:

class LogfileEntry

def initialize(data)

@data = data

end

end



So: yeah, hopefully we can do better than XML/XSLT. But what makes LISP the answer? And what does it really buy you when your data, which by itself is not meaningfully executable, is expressed in a syntax that could be executable?

Posted by: Josh H. at February 25, 2005 06:00 PM

I had never heard the term "s-expression" before. I saw this article on Google though. Seems somewhat relevant.

XML is not S-Expressions

Posted by: Joel H. at February 25, 2005 10:54 PM

The sections "Starting with Syntax" and "Redundancy Is Good" argue that XML is better for document processing—-which is true.

The "Family Matters" section argues that, because (for example) XSLT is a domain-specific language for processing XML documents, it's a better choice than a general language for processing s-expressions. But each of the listed technologies is getting expanded and expanded because they don't support "real" programming language concepts:

"Many users have requested the ability to return a conditional value based on a boolean expression. XPath 2.0 MUST provide a conditional expression..."

"As part of the XSLT 1.1 work done on extension functions, a proposal to author XSLT extension functions in XSLT itself was deferred for reconsideration in XSLT 2.0. This would allow the functions in an extension namespace to be implemented in "pure" XSLT, without resulting to external programming languages."

"If a ::marker pseudo-element has its 'content' property set to normal, the following algorithm should be used to generate the computed value of the property."

"It is the ideal time right now for the W3C XPath 2.0 working group to make the decision to provide the necessary support for higher-order functions as part of the standard XPath 2.0 specification. In case this golden opportunity is missed, then generic templates and libraries will be used in the years to come."

(http://fxsl.sourceforge.net/articles/FuncProg/9.html)

Posted by: Derek U. at February 26, 2005 07:16 AM

I think Derek pegged it better than anyone when he commented earlier that XML is better for high content/markup ratio, and Lisp is better when it's mostly markup. I hadn't thought of it this way before, but it immediately rang true.

By way of background, you should be aware that Paul Prescod is a person who is trying to let the world know, in no uncertain terms, that Python people can be even snottier than Lisp people. He's one of the handful of folks that come to mind when I describe the Python community as "frosty".

So of course he's going to try very hard to dissociate XML with s-expressions, because if XML is really s-expressions (and you fail to take Derek's observation into account), then you could easily draw the conclusion that Lisp beats Python for XML processing, and Paul would very much like for nobody to draw that conclusion.

The XSLT-article guy is just whacked out on acid. He seems to think that by zooming out to a satellite's-eye view in his XSLT examples, he will give the impression that XSLT is as compact as Haskell. I can't help but feel, looking at all those tiny colored blobs, that I'm about to fall twenty thousand feet to my death, impaled on angle brackets, and that the colors are the spatters of everyone else who's fallen on them so far — including blue-blooded XSLT aficionados.

Posted by: Steve Yegge at February 28, 2005 11:05 PM

Josh: if you only focus on logfiles and other similarly inert-seeming data clumps, then Ruby's fine, and the solution you suggest seems fine.

The only nitpick I'd make is that defining the log entries as a language-specific hash makes it more difficult to process the entries in some other language, whereas with XML and Lisp, both of them are syntactically relatively straightforward, and you can in fact convert trivially and lexically between the two of them. But it's mostly splitting hairs.

For more complex documents, e.g. web pages or word-processor documents, I might be inclined to go with a Lisp dialect, if I were designing it from scratch. But who knows. Ruby's very nice too. If only it had native-code compilers and preemptive multithreading and a macro system and all those other goodies Lisp has...

Posted by: Steve Yegge at February 28, 2005 11:20 PM

p.s. it's important to realize that you can fairly cleanly divide programming languages into "scripting languages" and "programming languages", and I really understand the distinction now. Scripting languages are all a bunch of miserable hacks: Perl, Python, Ruby, Groovy, Tcl, Rexx... you name it. They all start life with no formal grammar or parser, no formal bytecode or native-code generation on the backend, no lexical scoping, no formal semantics, no type system, nothing. And that's where most of them wind up. They may grow and evolve in the right directions, and they all eventually become pleasant to work with, after enough hacks piled on, but all of them are plagued with fundamental problems.

C, C++, Java, Objective-C, C#, Pascal, Lisp, and Scheme (to name a few) are all REAL languages, in all the senses I mentioned in the previous paragraph.

Notice that in "real" languages, you may or may not have good string-processing, or garbage collection, or OOP constructs, or first-class functions, or anything else. All those concerns are orthogonal to whether the language was built with a compiler framework in mind or not. And there's absolutely no reason someone shouldn't be able to create a "scripting language" that has a solid foundation.

What's the difference? Why do all those formal doo-hickeys matter?

Performance! Compiled languages are fast. Lisp is WAY faster than Ruby, over the long haul. Smokes it.

I care about other stuff besides performance, of course, and compilation gives you other benefits as well. But most of all, it gives me this feeling of security, knowing that the formal syntax and semantics are all well-specified. I think that's what I mean by "solid". (And despite the general ugliness of XML and friends, its formal specification is pretty good.)

Posted by: Steve Yegge at February 28, 2005 11:38 PM