Fixing Unix/Linux/POSIX Filenames:

Control Characters (such as Newline), Leading Dashes, and Other Problems

David A. Wheeler

Seek freedom and become captive of your desires, seek discipline and find your liberty. — Frank Herbert, Dune



“Negative freedom is freedom from constraint, that is, permission to do things; Positive freedom is empowerment, that is, ability to do things... Negative and positive freedoms, it might seem, are two different descriptions of the same thing. No! Life is not so simple. There is reason to think that constraints (prohibitions, if you like) can actually help people to do things better. Constraints can enhance ability...” — Angus Sibley, “Two Kinds of Freedom”



“...filesystem people should aim to make “badly written” code “just work” unless people are really really unlucky. Because like it or not, that’s what 99% of all code is... Crying that it’s an application bug is like crying over the speed of light: you should deal with *reality*, not what you wish reality was.” — Linus Torvalds, on a slightly different topic (but I like the sentiment)



Years ago I thought the lack of restrictions were a sign of simple and clean design to be held up as a badge of honor compared to more limited operating systems. Now that I am responsible for production shell scripts I am a firm supporter of your view that filenames should be UTF-8 with no control characters. Other troublesome filenames you pointed out such as those with leading and trailing spaces and leading hyphens should probably be prohibited too. — Doug Quale, email dated 2016-10-04

Traditionally, Unix/Linux/POSIX pathnames and filenames can be almost any sequence of bytes. A pathname lets you select a particular file, and may include zero or more “/” characters. Each pathname component (separated by “/”) is a filename; filenames cannot contain “/”. Neither filenames nor pathnames can contain the ASCII NUL character (\0), because that is the terminator.

This lack of limitations is flexible, but it also creates a legion of unnecessary problems. In particular, this lack of limitations makes it unnecessarily difficult to write correct programs (enabling many security flaws). It also makes it impossible to consistently and accurately display filenames, causes portability problems, and confuses users.

This article will try to convince you that adding some tiny limitations on legal Unix/Linux/POSIX filenames would be an improvement. Many programs already presume these limitations, the POSIX standard already permits such limitations, and many Unix/Linux filesystems already embed such limitations — so it’d be better to make these (reasonable) assumptions true in the first place. This article will discuss, in particular, the three biggest problems: control characters in filenames (including newline, tab, and escape), leading dashes in filenames, and the lack of a standard character encoding scheme (instead of using UTF-8). These three problems impact programs written in any language on Unix/Linux/POSIX system. There are other problems, of course. Spaces in filenames can cause problems; it’s probably hopeless to ban them outright, but resolving some of the other issues will simplify handling spaces in filenames. For example, when using a Bourne shell, you can use an IFS trick (using IFS=`printf '

\t'` ) to eliminate some problems with spaces. Similarly, special metacharacters in filenames cause some problems; I suspect few if any metacharacters could be forbidden on all POSIX systems, but it’d be great if administrators could locally configure systems so that they could prevent or escape such filenames when they want to. I then discuss some other tricks that can help.

After limiting filenames slightly, creating completely-correct programs is much easier, and some vulnerabilities in existing programs disappear. This article then notes some others’ opinions; I knew that some people wouldn’t agree with me, but I’m heartened that many do agree that something should be done. Finally, I briefly discuss some methods for solving this long-term; these include forbidding creation of such names (hiding them if they already exist on the underlying filesystem), implementing escaping mechanisms, or changing how tools work so that these are no longer problems (e.g., when globbing/scanning, have the libraries prefix “./” to any filename beginning with “-”). Solving this is not easy, and I suspect that several solutions will be needed. In fact, this paper became long over time because I kept finding new problems that needed explaining (new “worms under the rocks”). If I’ve convinced you that this needs improving, I’d like your help in figuring out how to best do it!

Filename problems affect programs written in any programming language. However, they can be especially tricky to deal with when using Bourne shells (including bash and dash). If you just want to write shell programs that can handle filenames correctly, you should see the short companion article Filenames and Pathnames in Shell: How to do it correctly.

Imagine that you don’t know Unix/Linux/POSIX (I presume you really do), and that you’re trying to do some simple tasks. For our purposes we will primarily show simple scripts on the command line (using a Bourne shell) for these tasks. However, many of the underlying problems affect any program, as we'll show by demonstrating the same problems in Python3.

Leading dash disaster

For example, let’s try to print out the contents of all files in the current directory, putting the contents into a file in the parent directory:

cat * > ../collection # WRONG

In a well-designed system, simple things should be simple, and the “obvious easy” way to do simple common tasks should be the correct way. I call this goal “no sharp edges” — to use an analogy, if you’re designing a wrench, don’t put razor blades on the handles. Typical Unix/Linux filesystems fail this test — they do have sharp edges.

The list doesn’t include “hidden” files (filenames beginning with “.”), but often that’s what you want anyway, so that’s not unreasonable. The problem with this approach is that although this usually works, filenames could begin with “-” (e.g., “-n”). So if there’s a file named “-n”, and you’re using GNU cat, all of a sudden your output will be numbered! Oops; that means on every command we have to disable option processing.

Some earlier readers thought that this was a shell-specific problem, even though I repeatedly said otherwise. Their “solution” was to use another language like Python... except the problem doesn't go away. Let's write the same thing in Python3:

#!/bin/env python3 # WRONG import subprocess,os subprocess.run(['cat'] + os.listdir('.'), stdout=open('../collection', 'w'))

Exactly the same problem happens in Python3 and in any other language - if there if a filename beginning with - , the receiving program will typically see that as an option flag (not a file) and mishandle it. Notice that this invocation of subprocess.run does not use a shell (there are options like shell=True that would do that, but we aren't using any of them). So the illusion that “this is just a shell problem” is proven false. It's true that you would not normally run cat from within Python, but it's also rare to run cat from a shell. Instead, cat is here as a trivial demo showing that safely invoking other programs is harder than it should be. Programs written in any language often do need to invoke other programs... and here we see the danger of doing so.

The “obvious” way to resolve this problem is to litter command invocations with “--” before the filename(s). You will find many people recommending this. But that solution turns out this doesn’t really work, because not all commands support “--” (ugh!). For example, the widely-used “echo” command is not required to support “--”. What’s worse, echo does support at least one dash option, so we need to escape leading-dash values somehow. POSIX recommends that you use printf(1) instead of echo(1), but some old systems do not include printf(1). Many other programs that handle options do not understand “--” either, so this is not a robust solution.

In my opinion, a much better solution is to prefix globs like this with “./”. In other words, you should do this instead:

cat ./* > ../collection # CORRECT

Prefixing relative globs with “./” always solves the “leading dash” problem, but it sure isn’t obvious. In fact, many shell books and guides completely omit this information, or don’t explain it until far later in the book (which many people never read). Even people who know this will occasionally forget to do it. After all, people tend to do things the “easy way” that seems to work, resulting in millions of programs that have subtle bugs (which sometimes lead to exploits). Complaining that people must rewrite all of their programs to use a non-obvious (and ugly) construct is unrealistic. Most people who write cat * do not intend for the filenames to be used as command options (as noted in the The Unix-haters Handbook page 27).

List of files disaster (newlines allowed)

In many cases globbing isn’t what we want. We probably don’t want the “cat *” command to examine directories, and glob patterns like “*” won’t recursively descend into subdirectories either. It is often the case that we want to handle a large collection of files spread across directories, and we may want to record information about those files (such as their names) for processing later.

The primary tool for walking POSIX filesystems in shell is the “find” command, and many languages have a built-in library to recursively walk directories. In theory, we could just replace the “*” with something that computes the list of such file names (which will also include the hidden files):

cat `find . -type f` > ../collection # WRONG

This construct doesn’t fail because of leading dashes; find always prefixes filenames with the starting directory, so all of the filenames in this example will start with “./”. This construct does have trouble with scale — if the list is really long, you risk an “argument list too long” error, and even if it works, the system has to build up a complete list all at once (which is slow and resource-consuming if the list is long). Even if the list of files is short, this construct has many other problems. One problem (among several!) is that if filenames can contain spaces, their names will be split (file “a b” will be incorrectly parsed as two files, “a” and “b”).

Okay, so let’s use a “for” loop, which is better at scaling up to large sets of files and complicated processing of the results. When using shell you need to use set -f to deal with filenames containing glob characters (like asterisk), but you can do that. Problem is, the “obvious” for loop won’t work either, for the same reason; it breaks up filenames that contain spaces, newlines or tabs:

( set -f ; for file in `find . -type f` ; do # WRONG cat "$file" done ) > ../collection

( find . -type f | # WRONG while read filename ; do cat "$filename" ; done ) > ../collection

Now at this point, some of you may suggest using xargs, like this:

( find . -type f | xargs cat ) > ../collection # WRONG, WAY WRONG

Yet this is wrong on many levels. If you try to use xargs, and limit yourself to the POSIX standard, xargs is painful to use. By default, xargs’ input is parsed, so space characters (as well as newlines) separate arguments, and the backslash, apostrophe, double-quote, and ampersand characters are used for quoting. According to the POSIX standard, underscore may have a special meaning (it will stop processing) if you omit the -E option, too! So even though this “simple” use of xargs works on some filenames, it fails on many characters that are allowed in filenames. The xargs quoting convention isn’t even consistent with the shell. Using xargs while limiting yourself to the POSIX standard is an exercise in pain, if you are trying to create actually-correct programs, because it requires substitutions to work around xargs quoting.

So let’s “fix” handling filenames with spaces by combining find (which can output filenames a line at a time) with a “while” loop (using read -r and IFS), a “for” loop, xargs with quoting and -E, or xargs using a non-standard GNU extension “-d” (the extension makes xargs more useful):

# WRONG: ( find . -type f | while IFS="" read -r filename ; do cat "$filename" ; done ) > ../collection # OR WRONG: IFS="`printf '

'`" # Split filenames only on newline, not space or tab ( for filename in `find . -type f` ; do cat "$filename" done ) > ../collection # OR WRONG, yet portable; space/backslash/apostrophe/quotes ok in filenames: ( find . -type f | sed -e 's/[^[:alnum:]]/\\&/g' | xargs -E "" cat ) > ../collection # OR WRONG _and_ NON-STANDARD (uses a GNU extension): ( find . -type f | xargs -d "

" cat ) > ../collection

Whups, all four of these don’t work correctly either. All of these create a list of filenames, with each filename terminated by a newline (just like the previous version of “while”). But filenames can include newlines!

Handling filenames with all possible characters (including newlines) is often hard to do portably. You can use find...-exec...{} , which is portable, but this gets ugly fast if the command being executed is nontrivial. It can also be slow, because this has to start a new process for every file, and the new process cannot trivially set a variable that can be used afterwards (the variable value disappears when the process goes away). POSIX has more recently extended find so that find -exec ... {} + (plus-at-end) creates sets of filenames that are passed to other programs (similar to how xargs works); this is faster, but it still creates new processes, making tracking-while-processing very inconvenient. I believe that some versions of find have not yet implemented this more recent addition, which is another negative to using it (but it is standard so I expect that problem to go away over time). In any case, both of these forms get ugly fast if what you’re exec-ing is nontrivial:

# These are CORRECT but have many downsides: ( find . -type f -exec cat {} \; ) > ../collection # OR ( find . -type f -exec cat {} + ) > ../collection

Is this a problem just for shell? Not at all. Other languages do have libraries for safely walking directory structures, and typically they handle this correctly... but that is not the only situation. It's quite common to want to make a list of files that are stored somewhere, typically in a file, for reuse later. This is commonly done by storing a list of filenames where each name is terminated by a newline. Why? Because lots of tools easily handle that format, and it is the "obvious" thing to do. For example, here's how you might do this (incorrectly) in Python3:

#!/bin/python3 # WRONG with open('filelist.txt') as fl: for filename in fl: # do something with filename, e.g., open it

If you use GNU find and GNU xargs, you can use non-standard extensions to separate filenames with \0 instead:

# CORRECT but nonstandard: ( find . -type f -print0 | xargs -0 cat ) > ../collection # OR, also correct but nonstandard: find . -print0 | while IFS="" read -r -d "" file ; do ... # Use "$file" not $file everywhere. done

But using \0 as a filename separator requires that you use non-standard (non-portable) extensions in shell, this convention is supported by only a few tools, and the option names to use this convention (when available) are jarringly inconsistent (perl has -0, while GNU tools have sort -z, find -print0, xargs -0, and grep either -Z or --null). This format is also difficult to view and modify (in part because so few tools support it), compared to the line-at-a-time format that is widely supported. You can’t even pass such null-separated lists back to the shell via command substitution; cat `find . -print0` and similar “for” loops don’t work. Even the POSIX standard’s version of “read” can’t use \0 as the separator (POSIX’s read has the -r option, but not bash’s -d option), so they’re really awkward to use.

The problem hits other languages, too. Many applications, regardless their implementation language, store information using one filename per line (with an unencoded filename) because so many tools support that format. The only problem is that it's wrong when newlines can occur in filenames.

This is silly; processing lines of text files is well-supported, and filenames are an extremely common data value, but you can’t easily combine these constructs?

Displaying filenames disaster (control characters allowed and no encoding is enforced)

Oh, and don’t display filenames. Filenames could contain control characters that control the terminal (and X-windows), causing nasty side-effects on display. Displaying filenames can even cause a security vulnerability — and who expects printing a filename to be a vulnerability?!? In addition, you have no way of knowing for certain what the filename’s character encoding is, so if you got a filename from someone else who uses non-ASCII characters, you’re likely to end up with garbage mojibake.

Again, this is not just a shell issue. Merely displaying filenames in any language can be dangerous, and there is no guarantee that the encoding of the filename is the same as the encoding used by standard output. So this is an example of an incorrect and potentially dangerous Python3 program:

#!/bin/python3 # WRONG - control characters and encoding issue import os for filename in os.listdir('.'): print(filename)

Avoiding problems is too hard

Ugh — lots of annoying problems, caused not because we don’t have enough flexibility, but because we have too much. Many documents describe the complicated mechanisms that can be used to deal with this problem, such as BashFAQ’s discussion on handling newlines in filenames. Many of the suggestions posted on the web are wrong, for example, many people recommend the incorrect while read line as the correct solution. In fact, I found that the BashFAQ’s 2009-03-29 entry didn’t walk files correctly either (one of their examples used for file in *.mp3; do mv "$file" ... , but this fails if a filename begins with “-”; yes, I fixed it). If the “obvious” approaches to common tasks don’t work correctly, and require complicated mechanisms instead, I think there is a problem.

In a well-designed system, simple things should be simple, and the “obvious easy” way to do simple common tasks should be the correct way. I call this goal “no sharp edges” — to use an analogy, if you’re designing a wrench, don’t put razor blades on the handles. Typical Unix/Linux filesystems fail this test — they do have sharp edges. Because it’s hard to do things the “right” way, many Unix/Linux programs simply assume that “filenames are reasonable”, even though the system doesn’t guarantee that this is true. This leads to programs with occasional errors that are sometimes hard to solve.

In some cases, these errors can even be security vulnerabilities. My “Secure Programming for Linux and Unix HOWTO” has a section dedicated to vulnerabilities caused by filenames. Similarly, CERT’s “Secure Coding” item MSC09-C (Character Encoding — Use Subset of ASCII for Safety) specifically discusses the vulnerabilities due to filenames. The Common Weakness Enumeration (CWE) includes 3 weaknesses related to this (CWE 78, CWE 73, and CWE 116), all of which are in the 2009 CWE/SANS Top 25 Most Dangerous Programming Errors. Vulnerability CVE-2011-1155 (logrotate) and CVE-2013-7085 (uscan in devscripts, which allowed remote attackers delete arbitrary files via a whitespace character in a filename) are a few examples of the many vulnerabilities that can be triggered by malicious filenames.

These types of vulnerabilities occasionally get rediscovered, too. For example, Leon Juranic released in 2014 an essay titled Back to the Future: Unix Wildcards Gone Wild, which demonstrates some of the problems that can be caused because filenames can begin with a hyphen (which are then expanded by wildcards). I am really glad that Juranic is making more people aware of the problem! However, this is not new information; these types of vulnerabilities have been known for decades. Kucan comments on this, noting that this particular vulnerability can be countered by always beginning wildcards with “./”. This is true, and for many years I have been recommended prefixing globs with “./”. I still recommend it as part of a solution that works today. However, we’ve been trying to teach people to do this for decades, and the teaching is not working. People do things the easy way, even if it creates vulnerabilities.

It would be better if the system actually did guarantee that filenames were reasonable; then already-written programs would be correct. For example, if you could guarantee that filenames don’t include control characters and don’t start with “-”, the following script patterns would always work correctly:

#!/bin/sh # CORRECT if files can't contain control chars and can't start with "-": set -eu # Always put this in Bourne shell scripts IFS="`printf '

\t'`" # Always put this in Bourne shell scripts # This presumes filenames can't include control characters: for file in `find .` ; do ... command "$file" ... done # This presumes filenames can't begin with "-": for file in * ; do ... command "$file" ... done # You can print filenames if they're always UTF-8 & can't inc. control chars

I comment on a number of problems that filenames cause the Bourne shell, specifically, because anything that causes problems with Bourne shell scripts interferes with use of Unix/Linux systems. The Bourne shell is not going away; it is built into POSIX, it is directly used by nearly every Unix-like system for starting it up, and most GNU/Linux users use Bourne shells for interactive command line use. What’s more, the leading contender, C shells (csh), are loathed by many (for an explanation, see “Csh Programming Considered Harmful” by Tom Christiansen). Now, it’s true that some issues are innate to the Bourne shell, and cannot be fixed by limiting filenames. The Bourne shell is actually a nice programming language for what it is for, but as noted by Bourne himself, its design requirements led to compromises that can sometimes be irksome. In particular, in most cases Bourne shell scripts will still need to double-quote variable references in most cases, even if filenames are limited to more reasonable values. For those who don’t know, when using a variable value, you usually need to write "$file" and not $file in Bourne shells (due to quirks in the language that make it easy to use interactively). You don’t need to double-quote values in certain cases (e.g., if they can only contain letters and digits), but those are special cases. Since variables can store information other than filenames, many Bourne shell programmers get into the habit of adding double-quotes around all variables anyway unless they want a special effect, and that effectively resolves the issue. But as shown above, that’s not the only issue; it can be difficult to handle all filenames correctly in the Bourne shell even when you use double-quotes correctly.

Filename problems tend to happen in any language; they are not specific to any particular language. For example, if a filename begins with “-”, and another command is invoked with that filename as its parameter, that command will see an option flag... no matter what computer languages are being used. Similarly, it’s more awkward to pass lists of filenames between programs in different languages when newlines can be part of the filename. Practically every language gracefully handles line-at-a-time processing; it’d be nice to be able to easily use that with filenames.

The problem of awkward filenames is so bad that there are programs like detox and Glindra that try to fix “bad” filenames. The POSIX standard includes pathchk; this lets you determine that a filename is bad. But the real problem is that bad filenames were allowed in the first place and aren’t prevented or escaped by the system — cleaning them up later is a second-best approach.

Programs assume bad filenames won’t happen

Lots of programs presume “bad” filenames can’t happen, and fail to handle them. For example, many programs fail to handle filenames with newlines in them, because it’s harder to write programs that handle such filenames correctly. In several cases, developers have specifically stated that there’s no point in supporting such filenames! For example:

There are a few programs that do try to handle all cases. According to user proski, “One of the reasons git replaced many shell scripts with C code was support for weird file names. C is better at handling them. In absence of such issues, many commands would have remained shell scripts, which are easier to improve”. But such exceptions prove the rule — many developers would not be willing to re-write working programs, in a different language, just to handle bad filenames.

Failure to handle “bad” filenames can lead to mysterious failures and even security problems... but only if they can happen at all. If “bad” filenames can’t occur, the problems they cause go away too!

Standards permit the exclusion of bad filenames

The POSIX standard defines what a “portable filename” is; this definition implies that many filenames are not portable and thus do not need to be supported by POSIX systems. For all the details, see the Austin Common Standards Revision Group web page. To oversimplify, the POSIX.1-2008 specification is simultaneously released as both The Open Group’s Base Specifications Issue 7 and IEEE Std 1003.1(TM)-2008. I’ll emphasize the Open Group’s version, since it is available at no charge via the Internet (good job!!). Its “base definitions” document section 4.7 (“Filename Portability”) says:

For a filename to be portable across implementations conforming to POSIX.1-2008, it shall consist only of the portable filename character set as defined in Portable Filename Character Set. Portable filenames shall not have the <hyphen> character as the first character since this may cause problems when filenames are passed as command line arguments.

I then examined the Portable Filename Character Set, defined in 3.276 (“Portable Filename Character Set”); this turns out to be just A-Z, a-z, 0-9, <period>, <underscore>, and <hyphen> (aka the dash character). So it’s perfectly okay for a POSIX system to reject a non-portable filename due to it having “odd” characters or a leading hyphen.

In fact, the POSIX.1-2008 spec includes a standard shell program called “pathchk”, which can be used to determine if a proposed pathname (filename) is portable. Its “-p” option writes a diagnostic if the pathname is too long (more than {_POSIX_PATH_MAX} bytes or contains any component longer than {_POSIX_NAME_MAX} bytes), or contains any character that is not in the portable filename character set. Its “-P” option writes a diagnostic if the pathname is empty or contains a component beginning with a hyphen. GNU, and many others, include pathchk. (My thanks to Ralph Corderoy for reminding me of pathchk.) So not only does the POSIX standard note that some filenames aren’t portable... it even specifically includes tools to help identify bad filenames (such as ones that include control characters or have a leading hyphen in a component).

Operating Systems already forbid bad filenames in certain cases

Indeed, existing POSIX systems already reject some filenames. A common reason is that many POSIX systems mount local or remote filesystems that have additional rules, e.g., for Microsoft Windows. Wikipedia’s entry on Filenames reports on these rules in more detail. For example, the Microsoft Windows kernel forbids the use of characters in range 1-31 (i.e., 0x01-0x1F) in filenames, so any such filenames can’t be shared with Windows users, and they’re not supposed to be stored on their filesystems. I wrote some code and found that the Linux msdos module (which supports one of the Windows filesystems) already rejects some “bad” filenames, returning the EINVAL error message instead.

The Plan 9 operating system was developed by many Unix luminaries; its filenames can only contain printable characters (that is, any character outside hexadecimal 00-1F and 80-9F) and cannot include either slash or blank (per intro(5)). Tom Duff explains why Plan 9 filenames noted that filenames with spaces are a pain for many reasons, in particular, that it messes up scripts. Duff said, “When I was working on the plan 9 shell, I did a survey of all the file names on all the unix machines that I could conveniently look at, and discovered, unsurprisingly, that characters other than letters, digits, underscore, minus, plus and dot were so little used that forbidding them would not impact any important use of the system. Obviously people stick to those characters to avoid colliding with the shell’s syntax characters. I suggested (or at least considered) formalizing the restriction, specifically to make file names easier to find by programs like awk. Probably rob took the more liberal road of forbidding del, space and controls, the first because it is particularly hard to type, and the rest because, as Russ noted, they confound the usual line- and field-breaking rules.”

So some application developers already assume that filenames aren’t “unreasonable”, the existing standard (POSIX) already permits operating systems to reject certain kinds of filenames, and existing POSIX and POSIX-like systems already reject certain filenames in some circumstances. In that case, what kinds of limitations could we add to filenames that would help users and software developers?

First: Why the heck are the ASCII control characters (byte values 1 through 31, as well as 127) permitted in filenames? The point of filenames is to create human-readable names for collections of information, but since these characters aren’t readable, the whole point of having filenames is lost. There’s no advantage to keeping these as legal characters, and the problems are legion: they can’t be reasonably displayed, many are troublesome to enter (especially in GUIs!), and they cause nothing but nasty side-effects. They also cause portability problems, since filesystems for Microsoft Windows can’t contain bytes 1 through 31 anyway.

One of the nastiest permitted control characters is the newline character. Many programs work a line-at-a-time, with a filename as the content or part of the content; this is great, except it fails when a newline can be in the filename. Many programs simply ignore the problem, and presume that there are no newlines in filenames. But this creates a subtle bug, possibly even a vulnerability — it’d be better to make the no-newline assumption true in the first place! I know of no program that legitimately requires the ability to insert newlines in a filename. Indeed, it’s not hard to find comments like “ban newlines in filenames”. GNU’s “find” and “xargs” make it possible to work around this by inserting byte 0 between each filename... but few other programs support this convention (even “ls” normally doesn’t, and most shells cannot do word-splitting on \0). Using byte 0 as the separator is a pain to use anyway; who wants to read the intermediate output of this? Even if the only character that is forbidden is newline, that would still help. For example, if newlines can’t happen in filenames, you can use a standard (POSIX) feature of xargs (which disables various quoting problems of xargs by escaping each character with a backslash) (lwn forgot the -E option, which I have added):

find . -type f | sed -e 's/./\\&/g' | xargs -E "" somecommand

The “tab” character is another control character that makes no sense; if tabs are never in filenames, then it’s a great character to use as a “column separator” for multi-column data output — especially since many programs already use this convention. But the tab character isn’t safe to use (easily) if it can be part of a filename.

Some control characters, particularly the escape (ESC) character, can cause all sorts of display problems, including security problems. Terminals (like xterm, gnome-terminal, the Linux console, etc.) implement control sequences. Most software developers don’t understand that merely displaying filenames can cause security problems if they can contain control characters. The GNU ls program tries to protect users from this effect by default (see the -N option), but many people display filenames without getting filtered by ls — and the problem returns. H. D. Moore’s “Terminal Emulator Security Issues” (2003) summarizes some of the security issues; modern terminal emulators try to disable the most dangerous ones, but they can still cause trouble. A filename with embedded control characters can (when displayed) cause function keys to be renamed, set X atoms, change displays in misleading ways, and so on. To counter this, some programs modify control characters (such as find and ls) — making it even harder to correctly handle files with such names.

In any case, filenames with control characters aren’t portable. POSIX.1-2008 doesn’t include control characters in the “portable filename character set”, implying that such filenames aren’t portable per the POSIX standard. Wikipedia’s entry on Filenames notes that the Windows kernel forbids the use of characters in range 1-31 (i.e., 0x01-0x1F), so any such filenames can’t be shared with Windows users, and they’re not supposed to be stored on their filesystems.

A few people noted that they used the filesystem as a keystore, and found it handy to use filenames as arbitrary-value keys. That’s fine, but filesystems already impose naming limitations; you can’t use \0 in them, and you can’t use ‘/’ as a key value in the same way, even on a traditional Unix filesystem. And as noted above, many filesystems impose more restrictions anyway. So even people who use the filesystem as a keystore, with arbitrary key values, must do some kind of encoding of filenames. Since you have to encode anyway, you can use an encoding that is easier to work with and less likely to cause subtle problems... like one that forbids control characters. Many programs, like git, use the filesystem as a keystore yet do not require control characters in filenames.

In contrast, if control characters are forbidden when created and/or escaped when returned, you can safely use control characters like TAB and NEWLINE as filename separators, and the security risks of displaying unfiltered control characters in filenames goes away. As noted above, software developers make these assumptions anyway; it’d be great if it was safe to do so.

The “leading dash” (aka leading hyphen) problem is an ancient problem in Unix/Linux/POSIX. This is another example of the general problem that there’s interaction between overly-flexible filenames with other system components (particularly option flags and shell scripts).

The Unix-haters handbook page 27 (PDF page 67) notes problems these decisions cause: “By convention, programs accept their options as their first argument, usually preceded by a dash... Finally, Unix filenames can contain most characters, including nonprinting ones. This is flaw #3. These architectural choices interact badly. The shell lists files alphabetically when expanding “*” [and] the dash (-) comes first in the lexicographic caste system. Therefore, filenames that begin with a dash (-) appear first when “*” is used. These filenames become options to the invoked program, yielding unpredictable, surprising, and dangerous behavior... [e.g., “rm *” will expand filenames beginning with dash, and use those as options to rm]... We’ve known several people who have made a typo while renaming a file that resulted in a filename that began with a dash: “% mv file1 -file2” Now just try to name it back... Doesn’t it seem a little crazy that a filename beginning with a hypen, especially when that dash is the result of a wildcard match, is treated as an option list?” Indeed, people repeatedly ask how to ignore leading dashes in filenames — yes, you can prepend “./”, but why do you need to know this at all?”

Similarly, in 1991 Larry Wall (of perl fame) stated: “Just don’t create a file called -rf. :-)” in a discussion about the difficulties in handling filenames well.

The list of problems that “leading dash filenames” creates is seemingly endless. You can’t safely run “cat *”, because there might be a file with a leading dash; if there’s a file named “-n”, then suddenly all the output is numbered if you use GNU cat. Not all programs support the “--” convention, so you can’t simply say “precede all command lists with --”, and in any case, people forget to do this in real life. Even the POSIX folks, who are experts, make mistakes due to leading dashes; bug 192 identifies a case where examples in POSIX failed to operate correctly when filenames begin with dash.

You could prefix the name or glob with “./”, e.g., “ cat ./* ”. Prefixing the filename is a good solution, but people often don’t know or forget to do this. The result: many programs break (or are vulnerable) when filenames have components beginning with dash. Users of “find” get this prefixing essentially for free, but then they get troubled by newlines, tabs, and spaces in filenames (as discussed elsewhere).

POSIX.1-2008’s “base definitions” document section 4.7 (“Filename Portability”) specifically says “Portable filenames shall not have the <hyphen> character as the first character since this may cause problems when filenames are passed as command line arguments”. So filenames with leading hyphens are already specifically identified as non-portable in the POSIX standard.

There’s no reason that a filesystem must permit filenames to begin with a dash. If such filenames were forbidden, then writing safe shell scripts would be much simpler — if a parameter begins with a “-”, then it’s an option and there is no other possibility.

If the filesystem must include filenames with leading dashes, one alternative would be to modify underlying tools and libraries so that whenever globbing or directory scanning is done, prepend “./” to any filename beginning with “-”. This would be done by glob(3), scandir(3), readdir(3), and shells that implement globbing themselves. Then, “cat *” would become “cat ./-n” if “-n” was in the directory. This would be a silent change that would quietly cause bad code to work correctly. There are reasons to be wary of these kinds of hacks, but if these kinds of filenames must exist, it would at least reduce their trouble. I will say more about solutions later in this paper. Since POSIX says that filename components with leading dashes (hypens) are not portable, you can say that this is all part of a special non-portable extension... and thus meets the POSIX specification.

With today’s march towards globalization, computers must support the sharing of information using many different languages. Given that, it’s crazy that there’s no standard encoding for filenames across all Unix/Linux/POSIX systems. At the beginnings of Unix, everyone assumed that filenames could only be English text, but that hasn’t been true for a long time. Yet because you can’t know the character encoding of a given filename, in theory you can’t display filenames at all today. Why? Because then you don’t know how to translate the bytes of a filename into displayable characters (!). This is true for GUIs, and even for the command line. Yet you must be able to display filenames, so you need to make some determination... and it will be wrong.

The traditional POSIX approach is to use environment variables that declare the filename character encoding (such as LC_ALL, LC_CTYPE, LC_CTYPE, LC_COLLATE, and LANG). But as soon as you start working with other people (say, by receiving a tarball or sharing a filesystem), the single environment variable approach fails. That’s because the single-environment-variable approach assumes that the entire filesystem uses the same encoding (as specified in the environment variable), but once there’s file sharing, different parts of the filesystem can use different encoding systems. Should you interpret the bytes in a filename as ISO-8859-1? One of the other ISO-8859-* encodings? KOI8-* (for Cyrillic)? EUC-JP or Shift-JIS (both popular in Japan)? In short, this is too flexible! Since people routinely share information around the world, this incompatibility is awful. The Austin Group even had a discussion about this in 2009. This failure to standardize the encoding leads to confusion, which can lead to mistakes and even vulnerabilities.

Yet this flexibility is actually not flexible enough, because the current filesystem requirements don’t permit arbitrary encodings. If you want to store arbitrary international text, you need to use Unicode/ISO-10646. But the other common encodings of Unicode/ISO-10646 (UTF-16 and UTF-32) must be able to store byte 0; since you can’t use byte 0 in a filename, they don’t work at all. The filesystem is also not flexible in another way: There’s no mechanism to find out what encoding is used on a given filesystem. If one person uses ISO-8859-1 for a given filename, there’s no obvious way to find out what encoding they used. In theory, you could store the encoding system with the filename, and then use multiple system calls to find out what encoding was used for each name.. but really, who needs that kind of complexity?!?

If you want to store arbitrary language characters in filenames using todays’ Unix/Linux/POSIX filesystem, the only widely-used answer that “simply works” for all languages is UTF-8. Wikipedia’s UTF-8 entry and Markus Kuhn’s UTF-8 and Unicode FAQ have more information about UTF-8. UTF-8 was developed by Unix luminaries Ken Thompson and Rob Pike, specifically to support arbitrary language characters on Unix-like systems, and it’s widely acknowledged to have a great design.

When filenames are sent to and from the kernel using UTF-8, then all languages are supported, and there are no encoding interoperability problems. Any other approach would require nonstandard additions like adding sort of “character encoding” value with the filesystem, which would then require user programs to examine and use this encoding value. And they won’t. Users and software developers don’t need more complexity — they want less. If people simply agreed that “all filenames will be sent in/out of the kernel in UTF-8 format”, then all programs would work correctly. In particular, programs could simply retrieve a filename and print it, knowing that the filename is in UTF-8. (Other encodings like UTF-7 and punycode do exist. But these are designed for cases where you can’t have byte values more than 127, which is not true for Unix/Linux/POSIX filesystems. Which is why people do not use them for filesystems.) Plan 9 already did this, and showed that you could do this on a POSIX-like system. The IETF specifically mandates that all protocol text must support UTF-8, while all other encodings are optional.

Another advantage of UTF-8 filenames are that they are very robust. The chance of a random 4-byte sequence of bytes being valid UTF-8, and not pure ASCII, is only 0.026% — and the chances drop even further as more bytes are added. Thus, systems that use UTF-8 filenames will almost certainly detect when someone tries to import non-ASCII filenames that use the “wrong” encoding — eliminating filename mojibake.

UTF-8 is already supported by practically everything. Some filesystems store filenames in other formats, but at least on Linux, all of them have mount options to translate in/out of UTF-8 for userspace. In fact, some filesystems require a specific encoding on-disk for filenames, but to do this correctly, the kernel has to know which encoding is being used for the data sent in and out (e.g., with iocharset). But not all filesystems can do this conversion, and how do you find out which options are used where?!? Again, the simple answer is “use UTF-8 everywhere”.

There’s also another reason to use UTF-8 in filenames: Normalization. Some symbols have more than one Unicode representation (e.g., a character might be followed by accent 1 then accent 2, or by accent 2 then accent 1). They’d look the same, but they would be considered different when compared byte-for-byte, and there’s more than one normalization system (Programs written for Linux normally use NFC, as recommended by the W3C, but Darwin and MacOS X normally use NFD). If you have a filename in a non-Unicode encoding, then it’s ambiguous how you “should” translate these to Unicode, making simple questions like “is this file already there” tricky. But if you store the name as UTF-8 encoded Unicode, then there’s no trouble; you can just use the filename using whatever normalization convention was used when the file was created (presuming that the on-disk representation also uses some Unicode encoding).

To be fair, what I’m proposing here doesn’t solve some other Unicode issues. Many characters in Unicode look identical to each other, and in many cases there’s more than one way to represent a given character. But these problems already exist, and they don’t go away if the status quo continues. If we at least agreed that the userspace filename API was always in UTF-8, we’d at least solve half the battle.

Andrew Tridgell, Samba’s lead developer, has identified yet another reason to use UTF-8 — case handling. Efficiently implementing Windows’ filesystem semantics, where uppercase and lowercase are considered identical, requires that you be able to know what is “uppercase” and what is “lowercase”. This is only practical if you know what the filename encoding is in the first place. (Granted, total upper and lower case handling is in theory locale-specific, but there are ways to address that sensibly that handle the cases people care about... and that’s outside the scope of this article.) Again, a single character encoding system for all filenames, from the application point of view, is almost required to make this efficient.

User “epa” on LWN notes that Python 3 “got tripped up by filenames that are not valid UTF-8”. Python 3 moved to a very clean system where there are “string” types that handle internationalized text and “bytes” that contain arbitrary data. You would think that filenames would be string types, but currently POSIX filenames are really just binary blobs! Python 3’s “what’s new” discusses what they had to do in trying to paper this over, but as epa says, this situation interferes with implementing filenames “as Unicode strings [to] cleanly allow international characters”. Eventually, Python 3.1 implemented the more-complicated PEP 383 proposal, specifically to address the problem that some “character” interfaces (like filenames) don’t just provide characters at all. In PEP 383, on POSIX systems, “Python currently applies the locale’s encoding to convert the byte data to Unicode, failing for characters that cannot be decoded. With this PEP, non-decodable bytes >= 128 will be represented as lone surrogate codes U+DC80..U+DCFF. Bytes below 128 will produce exceptions... To convert non-decodable bytes, a new error handler “surrogateescape” is introduced, which produces these surrogates. On encoding, the error handler converts the surrogate back to the corresponding byte. This error handler will be used in any API that receives or produces file names, command line arguments, or environment variables”.

The result is that many applications end up being far more complicated than necessary to deal with the lack of an encoding standard. Python PEP 383 bluntly states that the Unix/Linux/POSIX lack of enforced encoding is a design error: “Microsoft Windows NT has corrected the original design limitation of Unix, and made it explicit in its system interfaces that these data (file names, environment variables, command line arguments) are indeed character data [and not arbitrary bytes]”. Zooko O’Whielacronx posted some comments on Python PEP 383 relating to the Tahoe project. He commented separately to me that “Tahoe could simplify its design and avoid costly storage of ‘which encoding was allegedly used’ next to *every* filename if we instead required utf-8b for all filenames on Linux.” (Sidebar: Tahoe is an interesting project; Here is Zooko smashing a laptop with an axe as part of his Tahoe presentation.)

Converting existing systems or filesystems to UTF-8 isn’t that painful either. The program “convmv” can do mass conversions of filenames into UTF-8. This program was designed to be “very handy when one wants to switch over from old 8-bit locales to UTF-8 locales”. It’s taken years to get some programs converted to support UTF-8, but nowadays almost all modern POSIX systems support UTF-8.

Again, let’s look at the POSIX.1-2008 spec. Its “Portable Filename Character Set” (defined in 3.276) is only A-Z, a-z, 0-9, <period>, <underscore>, and <hyphen>. Note that this is a very restrictive list; few international speakers would accept this limited list, since it would mean they must only use English filenames. That’s ridiculous; most computer users don’t even know English. So why is this standard so restrictive? That’s because there’s no standard encoding; since you don’t know if a filename is UTF-8 or something else, there’s no way to portably share filenames with non-English characters. If we did agree that UTF-8 encoding is used, the set of portable characters could include all languages. In other words, the lack of a standard creates arbitrary and unreasonable limitations.

Linux distributions are already moving towards storing filenames in UTF-8, for this very reason. Fedora’s packaging guidelines require that “filenames that contain non-ASCII characters must be encoded as UTF-8. Since there’s no way to note which encoding the filename is in, using the same encoding for all filenames is the best way to ensure users can read the filenames properly.” OpenSuSE 9.1 has already switched to using UTF-8 as the default system character set (“lang_LANG.UTF-8”). Ubuntu recommends using UTF-8, saying “A good rule is to choose utf-8 locales”, and provides a UTF-8 migration tool as part of its UTF-8 by default feature.

Filename permissiveness is not just a command-line problem. It’s actually worse for the GUIs, because if filenames can truly be anything, then GUIs have no way to actually display filenames. The major POSIX GUI suites GNOME and KDE have already moved towards UTF-8 as the required filename encoding format:

In a 2003 discussion about GNOME, Michael Meeks noted that “using locale encoded filenames on the disk is a really, really bad idea :-) simply because there is never sufficient information to unwind the encoding (think networking, file sharing, etc.). So — the right way to go is utf-8 everywhere”. He noted that although GNOME has an option G_BROKEN_FILENAMES, it is “only a way to help migration towards that. The issue of course is that the whole Unix world needs fixing to be UTF-8 happy...” KDE has this problem, too. They do their best to deal with it by guessing from the user’s locale, but they also have the option KDE_UTF8_FILENAMES so that UTF-8-everywhere filesystems are easily handled. This note may be of interest too.

The GUI toolkit Qt (the basis of KDE), since Qt 4, has “removed the hacks they had in QString to allow malformed Unicode data in its QString constructor. What this means is that the old trick of just reading a filename from the OS and making a QString out of it is impossible in general since there are filenames which are not valid ASCII, Latin-1, or UTF-8. Qt does provide a way to convert from the ‘local 8-bit’ filename-encoding to and from QString, but this depends on there being one, and only one, defined filename-encoding (unless the application wishes to roll its own conversion). This has effectively caused KDE to mandate users use UTF-8 for filenames if they want them to show up in the file manager, be able to be passed around on DBus interfaces, etc.”

NFSv4 requires that all filenames be exchanged using UTF-8 over the wire. The NFSv4 specification, RFC 3530, says that filenames should be UTF-8 encoded in section 1.4.3: “In a slight departure, file and directory names are encoded with UTF-8 to deal with the basics of internationalization.” The same text is also found in the newer NFS 4.1 RFC (RFC 5661) section 1.7.3. The current Linux NFS client simply passes filenames straight through, without any conversion from the current locale to and from UTF-8. Using non-UTF-8 filenames could be a real problem on a system using a remote NFSv4 system; any NFS server that follows the NFS specification is supposed to reject non-UTF-8 filenames. So if you want to ensure that your files can actually be stored from a Linux client to an NFS server, you must currently use UTF-8 filenames. In other words, although some people think that Linux doesn’t force a particular character encoding on filenames, in practice it already requires UTF-8 encoding for filenames in certain cases.

UTF-8 is a longer-term approach. Systems have to support UTF-8 as well as the many older encodings, giving people time to switch to UTF-8. To use “UTF-8 everywhere”, all tools need to be updated to support UTF-8. Years ago, this was a big problem, but as of 2011 this is essentially a solved problem, and I think the trajectory is very clear for those few trailing systems.

Not all byte sequences are legal UTF-8, and you don’t want to have to figure out how to display them. If the kernel enforces these restrictions, ensuring that only UTF-8 filenames are allowed, then there’s no problem... all the filenames will be legal UTF-8. Markus Kuhn’s utf8_check C function can quickly determine if a sequence is valid UTF-8.

The filesystem should be requiring that filenames meet some standard, not because of some evil need to control people, but simply so that the names can always be displayed correctly at a later time. The lack of standards makes things harder for users, not easier. Yet the filesystem doesn’t force filenames to be UTF-8, so it can easily have garbage.

We have a good solution that is already in wide use: UTF-8. So let’s use it!

Probably too late for an outright ban on spaces in filenames

It’d be easier and cleaner to write fully-correct shell scripts if filenames couldn’t include any kind of whitespace. There’s no reason anyone needs tab or newline in filenames, as noted above, so that leaves us with the space character.

There are a lot of existing Unix/Linux shell scripts that presume there are no space characters in filenames. Many RPM spec files’ shell scripts make this assumption, for example (this can be enforced in their constrained environment, but not in general). Spaces in filenames are particularly a problem because the default setting of the Bourne shell “IFS” variable (which determines how substitution results are split up) includes space as a delimiter. This means that, by default, invoking “find” via ‘...‘ or $(...) will fail to handle filenames with spaces (they will break single filenames into multiple filenames at the spaces). Any variable use with a space-containing filename will be split or corrupted if the programmer forgets to surround it with double-quotes (unquoted variable uses can also cause trouble if the filename contains newline, tab, “*”, “?”, or “]”, but these are less common than filenames with spaces). Reading filenames using read will also fail (by default) if a filename begins or ends with a space. Many programs, like xargs, also split on spaces by default. The result: Lots of Unix/Linux/POSIX programs don’t work correctly on filenames with spaces.

In some dedicated-use systems, you could enforce a “no spaces” rule; this would make some common programming errors no longer an error, reducing slightly the risk of security vulnerabilities. From a functional viewpoint, other characters like “_” could be used instead of space. As noted above, some operating systems like Plan 9 expressly forbid spaces in filenames, so there is even some precedence for having an operating system forbid spaces in filenames.

Unfortunately, a lot of people do have filenames with embedded spaces (spaces that are not at the beginning or end of a filename), so a “no spaces” rule would hard to enforce in general. In particular, you essentially cannot handle typical Windows and MacOS filenames without handling filenames with an embedded space, because many filenames from those systems use the space character. So if you exchange files with them (via archives, shared storage, and so on), this is often impractical. Windows’ equivalent of “/usr/bin” is “\Program Files” —, and Windows’ historical equivalent of “/home” is “\Documents and Settings”, so you must deal with embedded spaces if you deal directly with Windows’ primary filesystem from a POSIX system. (Windows Vista and later use “\Users” instead of the awful default “\Documents and Settings”, copying the more sensible Unix approach of using short names without spaces, but the problem still remains overall.) (To be fair, Windows has other problems too. Windows internally passes arguments as an unstructured string, making escaping and its complications necessary.)

Banning leading and/or trailing spaces might work

However, there are variations that might be more palatable to many: “no leading spaces” and/or “no trailing spaces”. Such filenames are a lot of trouble, especially filenames with trailing spaces — these often confuse users (especially GUI users).

If leading spaces, trailing spaces, newline, and tab can’t be in filenames, then a Bourne shell construct already in common use actually becomes correct. A “while” loop using read -r file works for filenames if spaces are always between other characters, but by default it subtly fails when filenames have leading or trailing spaces (because space is by default part of the IFS). But if leading spaces, trailing spaces, newline, and tab cannot occur in filenames, the following works all the time with the default value of IFS:

# CORRECT IF filenames can't include leading/trailing space, newline, tab, # even though IFS is left as its default value find . -print | while read -r file ; do command "$file" ... done

There are a few arguments that leading spaces should be accepted. barryn informs me that “There is a use for leading spaces: They force files to appear earlier than usual in a lexicographic sort. (For instance, a program might create a menu at run time in lexicographic order based on the contents of a directory, or you may want to force a file to appear near the beginning of a listing.) This is especially common in the Mac world....”. They are even used by some people with Mac OS X.

But it’s hard to argue that trailing spaces are useful. Trailing spaces are worse than leading ones; in many user interfaces, a leading space will at least cause a visible indent, but there’s no indication at all of trailing spaces... leading to rampant confusion. I understand that in Microsoft Windows (or at least some of its key components), the space (and the period) are not allowed as the final character of a filename. So preventing a space as a final character improves portability, and is rather unlikely to be required for interoperability.

If trailing spaces are forbidden, then filenames with only spaces in them become forbidden as well. And that’s a good thing; filenames with only spaces in them are really confusing to users. Years ago my co-workers set up a directory full of filenames with only spaces in them, briefly stumping our Sun representative.

So banning trailing spaces in a component might be a plausible broad rule. It’s not as important as getting rid of newlines in filenames, but it’s worth considering, because it would get rid of some confusion. Banning both leading and trailing spaces is also plausible; doing so would make while read -r correct in Bourne shell scripts.

Interesting alternative: Auto-convert spaces to unbreakable spaces

James K. Lowden proposed an interesting alternative for spaces: “Spaces could be transparently handled (no pun intended) with U+00A0, a non-breaking space, which in fact it is. Really. If the system is presented with a filename containing U+0020, it could just replace it unilaterally with the non-breaking space [Unicode U+00A0, represented in UTF-8 by the hex sequence 0xC2 0xA0]. Permanently, no questions asked.”

This idea is interesting, because by default Bourne shells only break on U+0020, so they would consider the filename as one long unbreakable string. Filenames really aren’t intended to be broken up, so that’s actually a defensible representation. He claims “For most purposes, that will be just fine. GUIs won’t mind. Shells won’t mind; most scripts will be happier.”

He does note that constructs like

if [ "$name" = "my nice name" ]

I’m guessing that the filesystem would internally always store spaces, but the API would always get unbreakable spaces. This could cause problems if other systems stored filenames on directories which only differed between the use of unbreakable spaces and regular spaces, but users would generally think that’s pretty evil in the first place.

I’m not sure how I feel about this one idea, but it’s certainly an interesting approach that’s worth thinking about. One reason I hesitate is that if other things are fixed, the difficulties of handling spaces in filenames diminishes anyway, as I’ll explain next.

One reader of this essay suggested that GUIs should transparently convert spaces to underscores when creating a file, reversing this when displaying a filename. It’s an interesting idea. However, I fear that some evil person will create multiple files in one directory which only differ because one uses spaces and the other uses underscores. That might look okay, but would create opportunity for confusion in the future. Thus, I haven’t recommended this approach.

Having spaces in filenames is no disaster, though, particularly if other problems are fixed.

First, it’s worth noting that many “obvious” shell programs already work correctly, today, even if filenames have spaces and you make no special settings. For example, glob expansions like “ cat ./* ” works correctly, even if some filenames have spaces, because file glob expansion occurs after splitting (more about this in a moment). The POSIX specification specifically requires this, and this is implemented correctly by lots of shells (I’ve checked bash, dash, zsh, ksh, and even busybox’s shell). The find commands’s “-exec” option can work with arbitrary filenames (even ones with control characters), though I find that if the exec command gets long, the script starts to get very confusing:

# This is straightforward: find . -type f -exec somecommand {} \; # As these get long, I scream (example from "explodingferret"): find . -type f -exec sh -c 'if true; then somecommand "$1"; fi' -- {} \;

Once newlines and tabs cannot happen in filenames, programs can safely use newlines and tabs as delimiters between filenames. Having safe delimiters makes spaces in filenames much easier to handle. In particular, programs can then safely do what many already do: they can use programs like ‘find’ to create a list of filenames (one per line), and then process the filenames a line at a time.

However, if we stopped here, spaces in filenames still cause problems for Bourne shell scripts. If you invoke programs like find via command substitution, such as “ for file in `find .` ”, then by default the shell will break up filenames on the spaces — corrupting the results. This is one of the reasons that many shell scripts don’t handle spaces-in-files correctly. Yet the “obvious” way to process files is to create a loop through the results of a command substitution with find ! We can make it much easier to write correct shell scripts by using a poorly-documented trick.

Writers of (Bourne-like) shell scripts can use an additional trick to make spaces-in-filenames easier to handle, as long as newlines and tabs can’t be in filenames. The trick: set the “IFS” variable to be just newline and tab.

What is IFS?

IFS (the “input field separator”) is an ancient, very standard, but not well-known capability of Bourne shells. After almost all substitutions, including command substitution ‘...‘ and variable substitution ${...}, the characters in IFS are used to split up any substitution results into multiple values (unless the results are inside double-quotes). Normally, IFS is set to space, tab, and newline — which means that by default, after almost all substitutions, spaces are interpreted as separating the substituted values into different values. This default IFS setting is very bad if file lists are produced through substitutions like command substitution and variable substitution, because filenames with spaces will get split into multiple filenames at the spaces (oops!). And processing filenames is really common.

Changing the IFS variable to include only newline and tab makes lists of filenames much easier to deal with, because then filenames with spaces are trivially handled. Once you set IFS this way, instead of having to create a “while read...” loop, you can place a ‘...‘ file-listing command in the “usual place” of a file list, and filenames with spaces will then work correctly. And if filenames can’t include tabs and newlines, you can correctly handle all filenames.

A quick clarification, if you’re not familiar with IFS: Even when the space character is removed from IFS, you can still use space in shell scripts as a separator in commands or the ‘in’ part of for loops. IFS only affects the splitting of unquoted values that are substituted by the shell. So you can still do this, even when IFS doesn’t include space:

for name in one two three ; do echo "$name" done

How to change IFS to just newline and tab

I recommend using this portable construct near the beginning of your (Bourne-like) shell scripts:

IFS="`printf '

\t'`"

If you have a really old system that doesn’t include the POSIX-required printf(1), you could use this instead (my thanks to Ralph Corderoy for pointing out this issue, though I’ve tweaked his solution somewhat):

IFS="`echo nt | tr nt '\012\011'`"

It’s quite plausible to imagine that in the future, the standard “prologue” of a shell script would be:

#!/bin/sh set -eu IFS="`printf '

\t'`"

An older version of this paper suggested setting IFS to tab followed by newline. Unfortunately, it can be slightly awkward to set IFS to just tab and newline, in that order, using only standard POSIX shell capabilities. The problem is that when you do command substitution in the shell with ‘...‘ or $(...), trailing newline characters are removed before the result is used (see POSIX shell & utilities, section 2.6.3). Removing trailing newlines is almost always what you want, but not if the last character you wanted is newline. You can also include a newline in a variable by starting a quote and inserting a newline directly, but this is easy to screw up; any other white space could be silently inserted there, including text-transformation tools that might insert \r

at the end, and people might “help” by indenting your code and quietly ruining it. There’s also the problem that the POSIX standard’s “echo” is almost featureless, but you can just use “printf” instead. In an older version of this paper I suggested doing IFS="`printf '\t

X'`" ; IFS="${IFS%X}" However, On LWN.net, Explodingferret pointed out a much better portable approach — just reverse their order. This doesn’t have the exactly the same result as my original approach (parameters are now joined by newline instead of tab when they are joined), but I think it’s actually slightly better, and it’s definitely simpler. I thought his actual code was harder to read, so I tweaked it (as shown above) to make it clearer.

A slightly more pleasant approach in Bourne-like shells is to use the $'...' extension. This isn’t standard, but it’s widely supported, including by the bash, ksh (korn shell), and zsh shells. In these shells you can just say IFS=$'

\t' and you’re done, which is slightly more pleasant. As the korn shell documentation says, the purpose of '...' is to ‘solve the problem of entering special characters in scripts [using] ANSI-C rules to translate the string... It would have been cleaner to have all “...” strings handle ANSI-C escapes, but that would not be backwards compatible.’ It might even be more efficient; some shells might implement ‘printf ...‘ by invoking a separate process, which would have nontrivial overhead (shells can optimize this away, too, since printf is typically a builtin). But this $'...' extension isn’t supported by some Bourne-like shells, including dash (the default /bin/sh in Ubuntu) and the busybox shell, and the portable version isn’t too bad. I’d like to see $'...' added to a future POSIX standard and these other shells, as it’s a widely implemented and useful extension. I think $'...' will in the next version of the POSIX specification (you can blame me for proposing it).

Writing shell scripts with IFS set to newline and tab

If filenames can’t include newline or tab, and IFS is set to just newline and tab, you can safely do this kind of thing to correctly handle all filenames:

for file in `find . -type f` ; do some_command "$file" ... done

This for loop is a better construct for file-at-a-time processing than the while read -r file construct listed earlier. This for loop isn’t in a separate subprocess, so you can set variables inside the loop and have their values persist outside the loop. The for loop has direct, easy access to standard input (the while loop uses standard input for the list of filenames). It’s shorter and easier to understand, and it’s less likely to go wrong (it’s easy to forget the “-r” option to read).

Some people like to build up a sequence of options and filenames in a variable, using the space character as the separator, and then call a program later with all the options and filenames built up. That general approach still works, but if the space character is not in IFS, then you can’t easily use it as a separator. Nor should you — if filenames can contain spaces, then you must not use the space as a separator. The solution is trivial; just use newlines or tabs as the separator instead. The usual shell tricks still apply (for example, if variable x leads with separators, then $x without quotes will cause the variable to get split using IFS and the leading separators will be thrown away). This is easiest to show by example:

# DO NOT DO THIS when the space character is NOT part of IFS: x="-option1 -option2 filename1" x="$x filename2" # Build up $x run_command $x # Do this instead: t=`printf "\t"` # Newline is tricky to portably set; use tab as separator x="-option1${t}-option2${t}filename1" x="$x${t}filename2" # Build up $x. run_command $x # Or do this (do NOT give printf a leading dash, that's not portable): x=`printf "%s

%s

%s" "-option1" "-option2" "filename1"` x=`printf "%s

%s" "$x" "filename2"` # Build up $x. run_command $x

Do not use plain “read” in Bourne shells &mdash use “read -r”. This is true regardless of the IFS setting. The problem is that “read”, when it sees a backslash, will merge the line with the next line, unless you undo that with “-r”. Notice that once you remove space from IFS, read stops corrupting filenames with spaces, but you still need to use the -r option with read to correctly handle backslash.

Of course, there are times when it’s handy to have IFS set to a different value, including its traditional default value. One solution is straightforward: Set IFS to the value you need, when you need it... that’s what it’s there for. So feel free to do this when appropriate:

#!/bin/sh set -eu traditionalIFS="$IFS" IFS="`printf '

\t'`" ... IFS="$traditionalIFS" # WARNING: You usually want "read -r", not plain "read": while read -r a b c do echo "a=$a, b=$b, c=$c" done IFS="`printf '

\t'`"

Setting IFS to a value that ends in newline is a little tricky. If you just want to temporarily restore IFS to its default value, just save its original value for use it later (as shown above). If you need IFS set to some other value with newline at the end, this kind of sequence does the trick:

IFS="`printf '\t

X'`" IFS="${IFS%X}"

Setting IFS to newline and tab is best if programs use newline or tab (not space) as their default data separator. If the data format is under your control, you could change the format to use newline or tab as the separator. It turns out that many programs (like GNU seq) already use these separators anyway, and the POSIX definition of IFS makes this essentially automatic for built-in shell commands (the first character of IFS is used as the separator for variables like $*). Once IFS is reset like this, filenames with spaces become much simpler to handle.

Characters that must be escaped in a shell before they can be used as an ordinary character are termed “shell metacharacters”. If filenames cannot contain some or all shell metacharacters, then some security vulnerabilities due to programming errors would go away.

I doubt all POSIX systems would forbid shell metacharacters, but it’d be nice if administrators could configure specific systems to prevent such filenames on higher-value systems, as sort of a belt-and-suspenders approach to counter errors in important programs. Many systems are dedicated to specific tasks; on such systems, a filename with unusual characters can only occur as part of an attack. To make this possible, software on such systems must not require that filenames have metacharacters, but that’s almost never a problem: Filenames with shell metacharacters are very rare, and these characters aren’t part of the POSIX portable filename character set anyway.

Here I’ll discuss a few options. One option is to just forbid the glob characters (*, ?, and [) — this can eliminate many errors due to forgetting to double-quote a variable reference in the Bourne shell. You could forbid the XML/HTML special characters “<”, “>”, “&”, and “"”, which would eliminate many errors caused by incorrectly escaping filenames. You could forbid the backslash character — this would eliminate a less-common error (forgetting the -r option of Bourne shell read). Finally, you could forbid all or nearly all shell meta-characters, which can eliminate errors due to failing to escape metacharacters where required in many circumstances.

All the Bourne shell programming books tell you that you’re supposed to double-quote all references to variables with filenames, e.g., cat "$file" . Without special filesystem rules, you definitely need to! In fact, correctly-written shell programs must be absolutely infested with double-quotes, since they have to surround almost every variable use. But I find that real people (even smart ones!) make mistakes and sometimes fail to include those quotation marks... leading to nasty bugs.

Although shell programming books don’t note it, you can actually omit the double quotes around variable references containing filenames if (1) IFS contains only newline and tab (not a space, as discussed above), and (2) tab, newline, and the shell globbing metacharacters (namely “*”, “?”, and “[”) can’t be in the filename. (The other shell metacharacters don’t matter, due to the POSIX-specified substitution order of Bourne shells.) This means that cat $file would work correctly in such cases, even if $file contains a space and other shell metacharacters. From a shell programming point of view, it’d be neat if such control and globbing characters could never show up in filenames... then correct shell scripts could be much cleaner (they wouldn’t require all that quoting).

I doubt there can be widespread agreement on forbidding all the globbing metacharacters across all Unix-like systems But if local systems reject or rename such names, then when someone accidentally forgets to quote a variable reference with a filename (it happens all the time), the the error cannot actually cause a problem. And that’s a great thing, especially for high-value servers (where you could impose more stringent naming rules). Older versions of this article mistakenly omitted the glob character issues; my thanks to explodingferret for correcting that. Similarly, if you also forbid spaces in filenames, as well as these other characters, then even without changing IFS, scripts which accidentally didn’t double-quote the variables would still work correctly. (Even if glob metacharacters can be in filenames, there are still good reasons to remove the space character from IFS, as noted in the section on spaces in filenames.)

So, by forbidding a few more characters — at least locally on high-value systems — you eliminate a whole class of programming errors that sometimes become security vulnerabilities. You will still need to put double-quotes around variables that contain values other than filenames, so this doesn’t eliminate the general need to surround variables with double-quotes in Bourne-like shells. But by forbidding certain characters in filenames, you decrease the likelihood that a common programming error can turn into an attack; in some cases that’s worth it.

Forbid XML/HTML special characters: <, >, "

You could forbid the XML/HTML special characters “<”, “>”, “&”, and “"”, which would eliminate many errors caused by incorrectly escaping filenames for XML/HTML.

This would also get rid of some nasty side-effects for shell and Perl programs. The < and > symbols redirect file writes, for both shell and Perl. This can be especially nasty for Perl, where filenames that begin with < or > can cause side-effects when open()ed — see “man perlopentut” for more information. Indeed, if you use Perl, see “man perlopentut” for other gotchas when opening files in Perl.

Forbid backslash character

You could forbid the backslash character. This would eliminate one error — forgetting the -r option of Bourne shell read .

Of course, you could go further forbid all (or nearly all) shell metacharacters.

Sometimes it’s useful to write out programs and run them later. For example, shell programs can be flattened into single long strings. Although filenames are supposed to be escaped if they have unusual characters, it’s not at all unusual for a program to fail to escape something correctly. If filenames never had characters that needed to be escaped, there’d be one less operation that could fail.

A useful starting-point list of shell metacharacters is “ *?:[]"<>|(){}&'!\;$ ” (this is Glindra’s “safe” list with ampersand, single-quote, bang, backslash, semicolon, and dollar-sign added). The colon causes trouble with Windows and MacOS systems, and although opening such a filename isn’t a problem on most Unix/Linux systems, the colon causes problems because it’s a directory separator in many directory or file lists (including PATH, bash CDPATH, gcc COMPILER_PATH, and gcc LIBRARY_PATH), and it has a special meaning in a URL/URI. Note that < and > and & and " are on the list; this eliminates many HTML/XML problems! I’d need to go through a complete analysis of all characters for a final list; for security, you want to identify everything that is permissible, and disallow everything else, but its manifestation can be either way as long as you’ve considered all possible cases.

In fact, for portability’s sake, you already don’t want to create filenames with weird characters either. MacOS and Windows XP also forbid certain characters/names. Some MacOS filesystems and interfaces forbid “:” in a name (it’s the directory separator). Microsoft Windows’ Explorer interface won’t let you begin filenames with a space or dot, and Windows also restricts these characters:

: * ? " < > |

In the end, you're safer if filenames are limited to the characters that are never misused. In a system where security is at a premium, I can see configuring it to only permit filenames with characters in the set A-Za-z0-9_-, with the additional rule that it must not begin with a dash. These display everywhere, are unambiguous, and this limitation cuts off many attack avenues.

For more info, see Wikipedia’s entry on Filenames. Windows’ NTFS rules are actually complicated, according to Wikipedia:

Windows kernel forbids the use of characters in range 1-31 (i.e., 0x01-0x1F) and characters " * : < > ? \ / | . Although NTFS allows each path component (directory or filename) to be 255 characters long and paths up to about 32767 characters long, the Windows kernel only supports paths up to 259 characters long. Additionally, Windows forbids the use of the MS-DOS device names AUX, CLOCK$, COM1, COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9, CON, LPT1, LPT2, LPT3, LPT4, LPT5, LPT6, LPT7, LPT8, LPT9, NUL and PRN, as well as these names with any extension (for example, AUX.txt), except when using Long UNC paths (ex. \\.\C:

ul.txt or \\?\D:\aux\con). (In fact, CLOCK$ may be used if an extension is provided.) These restrictions only apply to Windows — Linux, for example, allows use of " * : < > ? \ | even in NTFS [The source also included “/” in the last list, but Wheeler believes that is incorrect and has removed it.]

Microsoft Windows also makes some terrible mistakes with its filesystem naming; the section on Windows filename problems briefly discusses this.

Beware other assumptions about filenames

Beware of other assumptions about filenames. In particular, filenames that appear different may be considered the same by the operating system, particularly on Mac OS X, Windows, and remote filesystems (e.g., via NFS).

The git developers fixed a critical vulnerability in late 2014 (CVE-2014-9390) due to filenames. GitHub has an interesting post about it. Mercurial had the same problem (they notified the git developers about it). In particular, filenames that appear different are considered the same:

On many systems (including Mac OS X, Windows, and many remotely-accessible filesystems accessible via NFS), upper and lower case are not considered distinct. So ".Git" or ".GIT" are considered the same as ".git". In addition, the Mac OS X HFS+ file system considers certain Unicode codepoints as ignorable. For example, committing e.g. .g\u200cit/config and then pulling it on HFS+ would overwrite .git/config because U+200C is one of those ignorable codepoint. In general, Mac OS X normalizes filenames, and in many circumstances considers "different" filenames the same in many cases.

Thus, filtering based on filenames is tricky and potentially dangerous. This is in addition to the Windows-specific filenames (e.g., NUL) as discussed above.

Microsoft Windows has a whole host of other nasty tricks involving filenames. Normally periods and spaces at the end of a filename are silently stripped, e.g., "hello .. " is the same filename as "hello". You can also add various other selectors, e.g., "file1::$DATA" is the same as "file1", but the stripping does not happen so "file1...::$DATA" is not the same as "file1". Short 8+3 filenames can refer to longer names. There are other issues too, but this is not primarily an essay about Windows filenames; I just thought it important to note.

There are lots of tricks we can use in Bourne-like shells to work correctly, or at least not fail catastrophically, with nasty filenames. We’ve already noted a key approach: Set IFS early in a script to prevent breaking up filenames-with-spaces in the wrong place:

IFS="`printf '

\t'`"

The problem has been around for a long time, and I can’t possibly catalog all the techniques. Indeed, that’s the problem; we need too many techniques.

I guess I should mention a few other techniques for either handling arbitrary filenames, or filtering out “bad” filenames. I think they’ll show why people often don’t do it “correctly” in the first place. In Bourne shell, you must double-quote variable references for many other kinds of variables anyway, so let’s look beyond that. I will focus on using shell globbing and “find”, since those are where filenames often come from, and the ways for doing it aren’t always obvious. This BashFAQ answer gives some suggestions, indeed, there’s a lot of stuff out there on how to work around these misfeatures.

Globbing

Shell globbing is great when you just want to look at a list of files in a specific directory and ignore its “hidden” files (files beginning with “.”), particularly if you just want ones with a specific extension. Globbing doesn’t let you easily recurse down a tree of files, though; for that, use “file” (below). Problem is, globs happily return filenames that begin with a dash.

When globbing, make sure that your globs cannot return anything beginning with “-”, for example, prefix globs with “./” if they start in the current directory. This eliminates the “leading dash” problem in a simple and clean way.

When globbing, make sure that your globs cannot return anything beginning with “-”, for example, prefix globs with “./” if they start in the current directory. This eliminates the “leading dash” problem in a simple and clean way. Of course, this only works on POSIX; if you can get Windows filenames of the form C:\Users, you’ll need to consider drive: as well. When you glob using this pattern, you will quietly hide any leading dashes, skip hidden files (as expected), and you can use any filename (even with control characters and other junk):

for file in ./*.jpg ; do ... command "$file"

Making globbing safe for all filenames is actually not complicated — just prefix them with “./”. Problem is, nobody knows (or remembers) to prefix globs with “./”, leading to widespread problems with filenames starting with “-”. If we can’t even get people to do that simple prefixing task, then expecting them to do complicated things with “find” is silly.

Bash has an extension that can limit filenames, GLOBIGNORE, though setting it to completely deal with all these cases (while still being usable) is a very tricky. Here’s a GLOBIGNORE pattern so that globs will ignore filenames with control characters, leading dashes, or begin with a “.”, as well as traditional hidden files (names beginning with “.”), yet accept reasonable patterns (including those beginning with “./” and “../” and even multiple “../”):

GLOBIGNORE=`printf '.[!/.]*:..[!/]*:*/.[!/.]*:*/..[!/]*:*[\001-\037\177]*:-*'`

By the way, a special thanks to Eric Wald for this complicated GLOBIGNORE pattern, which resolves the GLOBIGNORE problems I mentioned in earlier versions of this article. With this pattern, if you remember to always prefix globs with “./” or similar (as you should), then you’ll safely get filenames that begin with dash (because they will appear as “./-NAME”). But when you forget to correctly prefix globs (and you will), then leading-dash filenames will be skipped (which isn’t ideal, but it’s generally far safer than silently changing command options). Yes, this GLOBIGNORE pattern is hideously complicated, but that’s my point: Safely traversing filenames is difficult, and it should be easy.

Globbing can’t express UTF-8, so you can’t filter out non-UTF-8 filenames with globbing. Again, you probably need a separate program to filter out those filenames.

Find

How can we use find correctly? Thankfully, “find” always prefixes filenames with its first parameter, so as long as the first parameter doesn’t begin with a dash (it’s often “.”), we don’t have the “leading dash” problem. (If you’re starting from a directory that begins with “-” inside your current directory, you can always prefix its name with “./”).

It’s worth noting that if you want to handle fully-arbitrary filenames, use “find . ... -exec” when you can; that’s 100% portable, and can handle arbitrarily-awkward filenames. The more-recent POSIX addition to find of -exec ... {} + can help too. So where you can, do this kind of thing:

# This is correct and portable; painful if "command" gets long: find . ... -exec command {} ; # This is correct and standard; some systems don't implement this: find . ... -exec command {} +

When you can’t do that, using find ... -print0 | xargs -0 is the common suggestion; that works, but those require non-standard extensions (though they are common), the resulting program can get really clumsy if what you want to do if the file isn’t simple, and the results don’t easily feed into shell command substitutions if you plan to pass in \0-separated results.

If you don’t mind using bash extensions, here’s one of the better ways to directly implement a shell loop that takes “find”-created filenames. In short, you use a while loop with ‘read’ and have read delimit only on the \0 (the IFS= setting is needed or filenames containing leading/trailing IFS characters will get corrupted; the -d '' option switches to \0 as the separator, and the -r option disables backslash processing). Here’s a way that at least works in simple cases:

# This handles all filenames, but uses bash-specific extensions: find . -print0 | while IFS="" read -r -d "" file ; do ... # Use "$file" not $file everywhere. done

# This handles all filenames, but uses bash-specific extensions: while IFS="" read -r -d "" file ; do ... # Use "$file" not $file everywhere. # You can set variables, and they'll stay set. done < <(find . -print0)

We can now loop through all the filenames, and retain any variable values we set, but this construct is hideously ugly and non-portable. Also, this approach means we can’t read the original standard input, which in many programs would be a problem. You can work around that by using other file descriptors, but that causes even more complications, leading to hideous results. Is there any wonder nobody actually does this correctly?!?

Notice that you can’t portably use this construct in “for” loops or as a command substitution, due to limitations in current shells (you can’t portably say “split input on \0”).

Oh, and while carefully using the find command can process filenames with embedded control characters (like newline and escape), what happens afterwords that can be “interesting”. In GNU find, if you use -print (directly or implicitly) to a teletype, it will silently change the filenames to prevent some attacks and problems. But once piped, there’s no way to distinguish between filenames-with-newlines and newlines-between-filenames (without additional options like the nonstandard -print0). And those later commands must be careful; merely printing a filename via those later commands is dangerous (since it may have terminal escape codes) and can go badly wrong (because the filename encoding need not match the environment variable settings).

Can you use the ‘find’ command in a portable way so it will filter out bad filenames, and have a simpler life from there on? Yes! If you have to write secure programs on systems with potentially bad filenames, this may be the way to go — by filtering out the bad filenames, you at least prevent your program from getting affected by them. Here’s the simplest portable (POSIX-compliant) approach I’ve found which filters out filenames with embedded ASCII control characters (including newline and tab); that way, newlines can separate filenames, displaying filenames is less dangerous (though we still have character encoding issues), and the results are easy to use in a command substitution (including a Bourne shell “for” loop) and with line-processing filters:

# This is correct and portable; it skips filenames with control chars: IFS="`printf '

\t'`" # Remove spaces so spaces-in-filenames still work controlchars=`printf '*[\001-\037\177]*'` for file in `find . ! -name "$controlchars"'` ; do command "$file" ... done

Unfortunately, UTF-8 can’t really be expressed with traditional globs, because globs can’t express a repetition of particular patterns. The standard find only supports globs, so it can’t do utf-8 matching by itself. In the long term, I hope “find” grows a simple option to determine if a filename is UTF-8. Full regular expressions are able to represent UTF-8, thankfully. So in the short term, if you want to only accept filenames that are UTF-8, you’ll need to filter the filename list through a regex (rejecting names that fail to meet UTF-8 requirements). (GNU find has “-regex” as an extension, which could do this, but obviously that wouldn’t port to other implementations of find.) Or you could write a small C program that filters them out (along with other bad patterns).

Of course, if filenames are clean (at least, can’t have control characters), this can become this far simpler, and that’s the point of this article:

IFS="`printf '

\t'`" # Remove spaces so spaces-in-filenames will work ... # This is correct if filenames can't have control characters: for file in `find .` ; do ... done # This will fail if scaled to very large lists, but it is correct for # smaller lists if filenames can't have control characters: cat `find . -type f`

Why do I need to keep using tricks?

Why do I need to add odd coding mechanisms that say “don’t send me garbage”, and constantly work around the garbage other programs copy to me? There are many conventions out there to try to deal with garbage, but it’s just too easy to write programs that fail to do so. Shouldn’t the system keep out the garbage in the first place?!?

Yes, I need to filter inputs provided by untrusted programs. Fine. But the operating system kernel shouldn’t be one of the untrusted programs I must protect myself against (grin).

Using the techniques discussed above, you can count how many filenames include control characters 1-31 or 127 in the entire system’s filesystem:

badfile=`printf '*[\\x01-\\x1f\\x7f]*'` find / -name "$badfile" -exec echo 1 \; | wc -l

For most systems, the answer is “0”. Which means this capability to store weird filenames isn’t really necessary. This “capability” costs a lot of development time, and causes many bugs; yet in return we get no real benefit.

So does limiting filenames, even in small ways, actually make things better? Yes! Let me focus on eliminating control characters (at least newline and tab), probably the worst offenders, and how things like a better IFS setting can improve things in a very public historical complaint about Unix.

The Unix-haters handbook page 167 (PDF page 205) begins Jamie Zawinski’s multi-page description of his frustrated 1992 effort to simply “find all .el files in a directory tree that didn’t have a corresponding .elc file. That should be easy.” After much agony (described over multiple pages), he found that the “perversity of the task had pulled me in, preying on my morbid fascination”. He ended up writing this horror, which is both horribly complicated and still doesn’t correctly handle all filenames:

find . -name '*.el' -print \ | sed 's/^/FOO=/' | \ sed 's/$/; if [ ! -f \ ${FOO}c ]; then \ echo \ $FOO ; fi/' | sh

Zawinski’s script fails when filenames have spaces, tabs, or newlines. In fact, just about any shell metacharacter in a filename will cause catastrophic effects, because they will be executed (unescaped!) by another shell.

Paul Dunne’s review of the “Unix Hater’s Handbook” (here and here) proposes a solution, but his solution is both wrong and complicated. Dunne’s solution is wrong because it only examines the directories that are the immediate children of the current directory; it fails to examine the current directory and it fails to examine deeper directories. Whups! In addition, his solution is quite complicated; he uses a loop inside another loop to do it, and has to show it in steps (presumably because it’s too complicated to show at once). Dunne’s solution also fails to handle filenames with spaces in them, and it even fails if there are empty directories. Dunne does note those last two weaknesses, to be fair. Dunne doesn’t even show the full, actual code; he only shows a code outline, and you have to fill in the pieces before it would actually run. (If it’s so complicated that you can only show an outline, it’s too complicated.) This is all part of the problem — if it’s too hard to write good examples of easy tasks that do the job correctly, then the system is making it too hard to do the job correctly!

Here’s my alternative; this one is simple, clear, and actually correct:

# This is correct if filenames can't include control characters: IFS="`printf '

\t'`" for file in `find . -name '*.el'` ; do if [ ! -f "${file}c" ] ; then echo "$file" fi done

This approach (above) just sets IFS to the value it should normally have anyway, followed by a single bog-standard loop over the result of “find”. This alternative is much simpler and clearer than either solutions, it actually handles the entire tree as Zawinski wanted (unlike Dunne’s), and it handles spaces-in-filenames correctly (as neither of the above do). It also handles empty directories, which Dunne’s doesn’t, and it handles metacharacters in filenames, which Zawinski’s doesn’t. It works on all filenames (including those with spaces), presuming that filenames can’t contain control characters. The find loop presumes that filenames cannot include newline or tab; the later “echo” that prints the filename presumes that the filename cannot contain characters (since if it did, the echo of control characters might cause a security vulnerability). If we also required that filenames be UTF-8, then we could be certain that the displayed characters would be sensible instead of mojibake. This particular program works even when file components begin with “-”, because “find” will prefix the filenames with “./”, but preventing such filenames is still a good idea for many other programs (the call to echo would fail and possibly be dangerous if the filename had been acquired via a glob like * ). My approach also avoids piping its results to another shell to run, something that Zawinski’s approach does. A variation could use “set -f” but this one does not need it. There’s nothing wrong with having a shell run a program generated by another program (it’s a powerful technique), but if you use this technique, small errors can have catastrophic effects (in Zawinski’s example, a filename with metacharacters could cause disaster). So it’s best to use the “run generated code” approach only when necessary. This is a trivial problem; such powerful grenade-like techniques should not necessary! Most importantly, it’s easy to generalize this approach to arbitrary file processing.

Adding small limits to filenames makes it much easier to create completely-correct programs.

That’s my point: Adding small limits to filenames makes it much easier to create completely-correct programs. Especially since most software developers act as if these limitations were already being enforced.

Peter Moulder sent me a shorter solution for this particular problem (he accidentally omitted -print, which I added):

# Works on all filenames, but requires a non-standard extension, and there # are security problems with some versions of find when printing filenames: find . -name '*.el' \! -exec test -e '{}c' \; -print

However, Moulder’s solution uses an implementation-defined (non-standard) extension; as noted by the Single UNIX specification version 3 section on find, “If a utility_name or argument string contains the two characters “{}”, but not just the two characters “{}”, it is implementation-defined whether find replaces those two characters or uses the string without change”. My thanks to Davide Brini who pointed out that this is implementation-defined, and also suggested this standard-conforming solution instead:

# This is correct for all filename