The re module implements regular expressions. Here is the basic rule:

Do NOT use the re module

Usually, there are better ways to handle data than to supply a regular expression to handle it. Mind you, I utterly love using regexes in editors and I also like using them in lexers. But for most normal data processing, regular expressions are overkill. Also, the regular expression engine in Erlang has been altered so it preempts. This avoids a long-running regex to destroy the VM’s scheduling mechanism.

Typical code will have maybe one or two calls to the re module per 5000 lines of code. If you have more than that, chances are you are trying to program PERL in Erlang. And that is a bad idea.

However, the re module is way faster than the older regexp module which was deprecated.

The binary() type

Erlang, like Haskell, saw the problem with a memory-heavy string type. So they both implement a type which is more efficient at handling large amounts of data. In Erlang, this is called a binary and in Haskell it is called a ByteString. Binaries are arrays of binary data. They are immutable, which means they can be shared. Also, referral to subparts of a binary can be shared by having the system store 3 values called a sub-binary:

A pointer to the original binary

An offset

A size

Note that sub-binaries work a bit like Go’s slice construction in this way. The VM is built such that passing around binaries and subbinaries are always efficient. The trick is immutability, which allows the system to pass pointers on binaries rather than passing the binary value itself.

Binaries, like in Go, also has extra capacity in the sense that in some cases a binary can be appended to efficiently without having to copy the binary data to a new place. The system will automatically extend binaries with extra capacity when they are copied, ensuring efficient append.

When programming Erlang, the compiler and VM will automatically generate binaries and sub-binaries for you. Write your code in a straightforward and readable manner first. Then compile your program with

+bin_opt_info

to have the compiler report on which binaries were not optimised in code which is heavily traversed by the program.

Binaries can be pattern matched. This is extremely efficient, but sometimes you can’t write a matching rule since they essentially work from the beginning always. You can’t “search” in a binary until you hit something which matches by a single pattern match. The way to handle this problem is by using the binary module:

binary:split/3 is extremely powerful. It is the binary variant of string:tokens/2 but it is returning shared data and so does only produce a small amount of garbage. The split string simply refers into the original binary through sub-binaries. Be very aware of the option “[global]” which will allow you to split the binary into more than two parts.

is extremely powerful. It is the binary variant of but it is returning shared data and so does only produce a small amount of garbage. The split string simply refers into the original binary through sub-binaries. Be very aware of the option “[global]” which will allow you to split the binary into more than two parts. binary:match/3 is your generic search routine for picking out parts deeply in binary data.

is your generic search routine for picking out parts deeply in binary data. binary:compile_pattern/1 allows you to build some simple compiled patterns like a weaker (but way faster) regular expression search

allows you to build some simple compiled patterns like a weaker (but way faster) regular expression search binary:copy/1 forces a copy of a binary. This is useful if you have a 1 megabyte binary and you have found a 45 byte sequence in it—and you only want that sequence. Then you can copy the sequence which means you don’t hold on to the 1 megabyte binary anymore—in turn allowing the garbage collector to collect it. This is extremely useful if you are cutting input into pieces (with split/3) and storing it at rest for a long time. For instance in ETS.

The iodata() type

There is another quite important data type which I want to describe. These are called iodata() or iolists(). The rule is the following:

A list of integers in the range 0..255 is IOData.

A binary is IOData.

Lists of IOData is IOData.

In particular, this means you can form IOData by collecting IOData as lists. This means string concatenation in the language is O(1). Example:

p(IOData) -> ["<p>", IOData, "</p>"].

The p/1 function given here will wrap IOData in a standard paragraph section in HTML, but it will not reallocate any data, nor will it generate any garbage. The sections “<p>” and “</p>” are constants, so the only allocation that will happen will be for the front of the list, two list elements.

Most IO functions in Erlang understands IOData directly. This means you can avoid having to flatten data, but just send the IOData out over a socket or the like. It avoids costly allocations and copies on the IO pipe in your program. Highly recommended!

A good way to gauge how well thought out a library is is to look at how well it handles IOData. If it doesn’t and requires you to explicitly call iolist_to_binary/1 then chances are the library is not that well written.

Handling unicode() data

Unicode data in Erlang is handled by the unicode module, which can convert between representations of Unicode. My recommendation however, would be to keep most unicode data as UTF-8 strings in binaries. You can match on unicode code-points:

-module(z). -export([len/1]). len(B) when is_binary(B) ->

len(B, 0). len(<<>>, K) -> K;

len(<<_/utf8, Rest/binary>>, K) -> len(Rest, K+1).

This is useful together with the ability to input character strings as UTF-8:

Erlang R16B03-1 (erts-5.10.4) [source-ce3d6e8] [64-bit] [smp:8:8] [async-threads:10] [hipe] [kernel-poll:false] [dtrace] Eshell V5.10.4 (abort with ^G)

...

2> z:len(<<”Rødgrød med fløde”/utf8>>).

17

3>

In Release 17.0, the default will be UTF-8 in Erlang files. This will probably have some deep effects on Erlang source code, but it will ultimately help getting Unicode into Erlang. Release 18.0 is planned to accept unicode atoms as well, to open up the design space.

Attack Vectors

The basic rule of all Erlang string handling is this:

Never never NEVER work on stringly typed data

When data is “stringly” typed, it means that your data has no structure and everything are represented in strings. This is fairly expensive to work on for a system since you are limited to use the string-handling code. Parsing is always expensive and this hurts your processing speed. Some languagues, like awk or perl are built to process these kind of things. But you rather do not want to do this processing in Erlang.

The way you avoid stringly typed data is to take the input and transform it as early as possible into structured data inside your code. That way, you can avoid working on the string data, and you only need it in a few places. The more structure you can attach to the data, the better.

The primary problem is when you have to work with a bad data format. Again, the trick is to turn the bad format into something sensible quickly and then process it as sensible data.

Erlang is designed to work with data that has structure. With structure, you can pattern match which is fast. Without structure, you have to rely on standard techniques and this is almost always going to be a pain in the language. So don’t do it. Convert data into a structured format and then proceed by handling the structure with pattern matches.

Immutability

Erlang takes a stance. All data are immutable. In particular, strings are immutable. Binaries are immutable. There will be an overhead to this stance. If you can’t accept this, you must pick another language. That said, the advantages of immutability far outshine the benefits of immutability.

Erlang is immutable because it eliminates a large source of programming errors and programming mistakes. After all, the value of an incorrect program is lower than a correct one. This stance is highly unlikely to be changed, since the safety guarantee provided by immutability is part of the Erlang-DNA.

When you have control of data

In some situations you control the format of the data you are going to use. This is an excellent oppurtunity to pick some clever ways of working with data. In particular to enforce structure on data by default. If communicating between Erlang systems, you can use term_to_binary/1 & binary_to_term/2 and just put data at rest in the standard Erlang-format. If the foreign system also supports this format, it is an excellent way to interchange data. The encoder/decoder for this format is written in C and it also handles very large terms with grace—the running process will be preempted while producing the binary.

The man page for inet

erl -man inet / setopts\( RET

describes common socket options you can set on a socket. By setting the packet option you can control how the system decodes inbound data. Most interesting, you can set ASN.1 BER encoding or Line-wise encoding.

If you value speed, you should consider an efficient binary format.

JSON

The ubiquitous format today, you need to handle is JSON data. I don’t particularly like JSON as a data exchange format, since it is very weak in what types it encodes. I’d much rather have a format like Joe Armstrong’s UBF or the Clojure edn & data.fressian encodings. But JSON it is. There are two good JSON libraries for Erlang:

They differ in how they are implemented. The jsx library is implemented in pure Erlang and is the slower of the two. The jiffy library uses a C NIF (Natively Implemented Function) to run the encoder and decoder and is about 10 times faster.

Beware: the jiffy library can’t be used to decode large JSON data sets. The decoder is not a well-behaved NIF and as such it can mess up the schedulers if it is used to decode large data strings. If the JSON takes more than 1ms to decode, you should probably avoid using it. In Release 17.0, we get so-called dirty schedulers which can be used to work around this problem.

The other problem with JSON data is the internal Erlang representation. Right now, there is no isomorphic mapping for JSON objects/dictionaries into Erlang. This will be fixed in Release 17.0 as well, since it includes maps so you can obtain a native mapping of objects/dictionaries into Erlang data. I also have a side project on the run to properly handle JSON to Record encoding in Erlang, but this still ongoing work. And it will take some time before it is fully implemented.

On the other hand, note that JSON will never be a fast interchange format. If you use JSON to move large amounts of data, you are screwed. Plain and simple. You best bet is then to hope data are sent as many small pieces so you can use jiffy on them. Or wait till 17.0 so you can get jiffy in a dirty-scheduler variant.

Files

You should study the section “PERFORMANCE” of the man page

erl -man file

Note that in order to have fast file I/O you need to open the file in “raw” mode, use binaries, and you can usually also benefit by following some of the advice in the section about performance. The general rule of IOData applies: If you supply IOData, the underlying file driver is able to map this onto a unix pwrite(2) call which is highly efficient.

Not opening in raw mode does have its benefits though, because you can then get the IO subsystem to work. This subsystem allow you to open a file on a foreign file system (on another node) and then operate on that file. If you don’t need the high speed, this is desirable in many situations, should your system span multiple nodes.

Closing off

It is a myth erlang strings are slow. You will have to think a bit more about what you do in order for the system to speed up. But chances are that string processing won’t be your limit. It is much more conceivable your bottleneck will have to do with a lock, or the wrong structure of processes than it will be slow strings.