It’s often said that Ruby is “Perl done right”: that it combines the terseness and text processing power of Perl with inspiration from Smalltalk, Lisp, and CLU, and in doing so creates a language that’s “the best of all possible worlds”.

Regardless of the merit of this idea, it’s certainly true that — when it comes to text processing — anything you can do in Perl you can do in Ruby, thanks mainly to the fact that Ruby steals wholesale many of the best text processing ideas Perl has. And yet lots of Ruby developers aren’t aware of the power Ruby can give you when it comes to writing throwaway one-liners in the shell.

All users of Unix-like operating systems will find themselves in this position eventually: you have to process some output from a process or some files, and you’re just reaching the point where standard tools like grep , head , cut , tr , wc and their brethren are beginning to show their limitations.

You could learn awk . Or you could reach for a powerful tool that you already have in your box: Ruby!

The -e switch

We all know, I’m sure, that you can invoke Ruby from the command line by passing it the filename of a script to run:

$ ruby foo.rb

But did you know you can also pass code as an argument and have Ruby interpret it? Just use the -e flag when invoking Ruby:

$ ruby -e 'puts "Hello world"' Hello world

Nifty, perhaps. But we can get much niftier.

The -n switch

The -n switch acts as though the code you pass to Ruby was wrapped in the following:

while gets # code here end

In short, this means that the code you pass in the -e argument is executed once for each line in your input. So, imagining that you had a file called foo.txt , with the following content:

foo bar baz

Then invoking Ruby like so:

$ ruby -ne 'puts $_' file.txt

Will output:

foo bar baz

Congratulations! You’ve just implemented cat in Ruby.

But what’s this $_ ?

Throughout these examples, you’ll perhaps have noticed the use of the special global variable $_ . When you invoke Ruby this way, it sets $_ to the current line that’s being processed; so if you wanted to do something like only print lines that start with “f”, that would be very easy:

$ ruby -ne 'puts $_ if $_ =~ /^f/' file.txt

Working with standard input

Of course, like cat , this doesn’t work only with files; you can also pipe the output of another process, and use its output as your input.

To us a slightly contrived example, we might want to find the ID of any instances of top that are running on our system.

We can get a list of all running processes with ps ax . It outputs an enormous amount, but each line is formatted like follows:

49175 s010 Ss 0:00.18 login -fp rob

We have the process ID in the first column, and the process name in the right; so all we need to do is print the first column if the line contains top . Easy:

$ ps ax | ruby -ne 'puts $_.split.first if $_ =~ /top/' 46222

If you wanted to, you could then pipe that into something like kill , if you wanted to get rid of all the matching processes. Handy!

(If you’d like to find out more about how you’re able to use the same code to work with both files and standard input, without changing anything, then you can read up on ARGF in Ruby.)

The -p switch

These solutions are pretty concise already. But what if you feel as though all the puts statements are a bit unnecessary? Well, Ruby has you covered.

The -p switch acts similarly to -n , in that it loops over each of the lines in the input. However, it goes a bit further: after your code has finished, it always prints the value of $_ . So, you can imagine it as:

while gets # code here puts $_ end

It’s really useful, then, for doing transformations on the input. If you wanted to take every line you were given, but replace every instance of the letter e you found with the letter a , you could do:

$ echo "eats, shoots, and leaves" | ruby -pe '$_.gsub!("e", "a")' aats, shoots, and laavas

Here, we modify the value of $_ , and this modified value is what’s printed to the screen.

Using BEGIN blocks

Of course, our code here runs in a loop; what if we wanted to run something just once, before our loop starts? We might want to initialise a variable, for example.

In Ruby, we can use BEGIN blocks to do this. They’re an idiom borrowed from awk , and allow us to execute code just once, at the start of the program.

So, to output line numbers from your input, you could do:

$ echo "foo

bar

baz" | ruby -ne 'BEGIN { i = 1 }; puts "#{i} #{$_}"; i += 1'

Here, we initialise i to 0 at the start of the script. The BEGIN block executes only once, so is ignored on subsequent loops; we can then increment i , producing the following output:

1 foo 2 bar 3 baz

Wrapping up

Of course, all of these examples are fairly contrived; I haven’t done anything that wouldn’t already be possible with tools like grep , pgrep , tr , and so on.

But in reality you have access to the whole world of not just the Ruby standard library but every Ruby Gem too. Just think of the power in Ruby’s String class alone: gsub , scan , ljust and rjust , squeeze . Think of Digest ; think of all of the power of Regexp ; Ruby’s date and time processing; CSV , Net::HTTP , and Zlib . The possibilities are endless.

Getting used to the idea that Ruby can be as much a part of your standard pipeline toolchain as any of the usual Unix tools is an important idea: it suddenly opens up a world of possibilities to do complex processing in a terse and expressive way. Go try it!

Text Processing with Ruby