Erlang String Issue 9 September 2008

Posted by Oliver Mason in erlang

I could never really understand people complaining about Erlang’s lack of a string data structure; having them represented as lists seemed fine, especially as there is no problem with representing unicode characters that go beyond 255. The increased size didn’t bother me: people worried about that when unicode first arrived and it didn’t really turn out as a big issue. And I do a lot of processing of large texts.

However, working on a basic YAML library for Erlang (see code repository) I ran into a problem where I least exected to be one: generating a YAML representation of a data structure.

Parsing basic YAML (just lists and mappings) was a piece of cake in the end (ignoring the more, erm, arcane bits of the YAML spec), and writing it seemed to be trivial. However, when it comes to writing out a list the problems start. I cannot differenciate between a list of numbers and a string. Hence, I cannot decide which way to represent it in YAML, as a string or a list of numbers.

This is only a human-factor issue: writing it as a list of numbers and then reading it in again obviously gives identical results, so for serialisation purposes it’s fine. But when editing the file I don’t really fancy looking up the ASCII codes for all the letters I want to change. A purely ‘presentational’ issue, but a tricky one.

Solutions? I could check any list if all the elements are in the range of readable characters, which I assume is what the Erlang VM does when printing something. But that’s a hack, really. Another solution is to go the whole hog and introduce a string type, presumably something like ‘{string, [45,64,…]}’. Then I can easily make the decision how to write it out. And parsing would be easy as well.

Now, that looks like a good idea, but it would interfere with the way most other programs use strings. And the string library wouldn’t work. So I guess I have to write my own.

My current plan is to allow various representations of strings, so that the data structure will actually be ‘{string, TYPE, DATA}’, where TYPE could be ‘list’ (the current way), ‘binary’ (a binary in UTF-8 format), ‘rope’ (a different approach to strings, see description with Java code). Other representations could be added at a later stage. There would be a function estring:to_list(string), which would convert the string into a list for use with the string library, and estring:to_string(list) would create a string from the given list. The default representation would probably be a binary, as that can in turn be used as a component of a rope easily once something gets added to it.

Another alternative would be to use Starling, but I’m not too keen on it as it’s not pure Erlang, using a C/C++ library under the hood.

How do you deal with strings?

UPDATE: Here’s a discussion about the same problem in the context of JSON: http://www.lshift.net/blog/2007/09/13/how-should-json-strings-be-represented-in-erlang