Cat Revisited cat

In a previous article , we had a brief look at the cat command, and how it may be used to view or create files. Here, we'll look at it in a text-processing context.

Displaying non-printing characters

In any regular text file, there are a few "hidden" characters that are not printed. These non-printing characters are translated as part of the formatting. For example, the tab character has its own special character symbol, and is translated to a tab space when the file is opened in editors.

We can display non-printing characters with the following options:

-e Display a $ at end of line. -t Display tab characters as ^I . -v Display control characters.

Additionally, we can show all non-printing characters with the -A shortcut.

$ cat sample.xml

# Regular format <?xml version="1.0" encoding="UTF-8"?> <note> <to>Tove</to> <from>Jani</from> <heading>Reminder</heading> <body>Please do not forget me this weekend!</body> </note>

$ cat -vet sample.xml

# Display hidden characters <?xml version="1.0" encoding="UTF-8"?>$ <note>$ ^I<to>Tove</to>$ ^I<from>Jani</from>$ ^I<heading>Reminder</heading>$ ^I<body>Please do not forget me this weekend!</body>$ </note>$

As you can see, this command outputs a ^I per each tab, and a $ per end of line.

unix2dos, dos2unix

So why do non-printing characters matter? In the UNIX world, there is a single line feed at the end of each line. In DOS format, however, there is a line feed and a carriage return. This portability issue can cause many bugs. To convert from one to the other, use the unix2dos or dos2unix command.

Viewing line numbers

To output line number, use the cat command with the -n option.

$ cat -n sample.xml

1 <?xml version="1.0" encoding="UTF-8"?> 2 <note> 3 <to>Tove</to> 4 <from>Jani</from> 5 <heading>Reminder</heading> 6 <body>Please do not forget me this weekend!</body> 7 </note>

Suppressing blank lines

If you have a file with multiple blank lines, you can compress the viewing mode so that multiple blank lines appear only once. Use the -s option.

$ cat exampleWithBlankLines.txt

Dear user, Thank you for using me, the Linux Command Line instead of a boring old GUI! This really means a lot to me. I will make sure to stay efficient and easy to use - just as long as you promise to practice on my every day and night. For today's practice, let's see if you can get rid of all this unnecessary white space? Simply compress it with the -s option on the cat command! Best, CLI

$ cat -s exampleWithBlankLines.txt

# Suppress extra lines Dear user, Thank you for using me, the Linux Command Line instead of a boring old GUI! This really means a lot to me. I will make sure to stay efficient and easy to use - just as long as you promise to practice on my every day and night. For today's practice, let's see if you can get rid of all this unnecessary white space? Simply compress it with the -s option on the cat command! Best, CLI

Sorting sort

The sort command can be used to sort contents from standard in or a file. You've probably seen and used it before in the context of pipelining.

There are tons of different ways you can specify sorts, and we'll go over the important ones.

Terminology

The input to sort should be a stream of records separated by a newline character. Each characteristic (or column) is known as the field , which is separated by some user-specified character (the delimiter ). This is most often a tab, comma, or semicolon.

Default sort settings

By default, the sort command will sort alphabetically by the first field.

$ cat employees.txt Caelestinus, Joon Directory Photios, Roland CEO Eliseus, Meindert Secretary Pino, Derryl Assistant Gemini, Klaos Assistant $ sort employees.txt # Sort first field by alphabetical order Caelestinus, Joon Directory Eliseus, Meindert Secretary Gemini, Klaos Assistant Photios, Roland CEO Pino, Derryl Assistant

Sorting multiple files at once

You can sort multiple files at once and have the output be a sorted file.

$ sort names1.txt names2.txt names3.txt # (names1.txt, names2.txt, names3.txt are unsorted files) Avi Bobby Brian Cat Derrick Duke Irvin JB Jen Jim Jizelle John Ragmar Shawn Telly

Merging two sorted files

You can use the -m option to merge presorted input files. The sort command is able to easily perform this because its implementation is based on merge sort.

$ cat sortedMales.txt Blowers, Nigel Gilbertson, Collin Imhoff, Parker Peru, Colton Shisler, Odell Twine, Leonardo $ cat sortedFemales.txt Aurea, Levin Deshazo, Taryn Duong, Arminda Moorhead, Alyse Murrieta, Caroll Yen, Beata $ sort -m sortedMales.txt sortedFemales.txt Aurea, Levin Blowers, Nigel Deshazo, Taryn Duong, Arminda Gilbertson, Collin Imhoff, Parker Moorhead, Alyse Murrieta, Caroll Peru, Colton Shisler, Odell Twine, Leona

Specifying a delimiter

To specify the delimiter, we can use the -t option, followed by the delimiter wrapped in single quotations.

Specifying the field

Furthermore, we can specify the field number with the -k option, followed by the field (column) index. With just a single integer (ie. -k2 ), the sort key will begin at column 2 and extend to the end of the line. However, if we use -k2,2 it will only sort based on the second column.

Sorting by sub-fields

If there is a group of parameters that you'd like to sort within a field, you can do so with the decimal point. For example, say you have a date field in the 3rd column formatted by MM-DD-YYYY. It would make sense to order by year first, then month, then day, right?

$ sort -t ',' -k 3.7n -k 3.1n -k 3.4n dates.txt Tel,Aziz,12-31-1989 Ping,Sarah,09-29-1990 Het,Holm,01-01-1992 Hum,Horry,04-23-1995 Ith,Rebecca,06-12-2001

Ignoring blank space

To ignore any leading blank spaces, use the -b option.

$ sort -t',' -k2,2 -b sortedMales.txt Moorhead, Alyse Duong, Arminda Murrieta, Caroll Gilbertson, Collin Peru, Colton Twine, Leonardo Aurea, Levin Blowers, Nigel Shisler, Odell Imhoff, Parker Deshazo, Taryn

Now our file is sorted by the second field (first name).

Sorting numbers

If you try to sort a text full of numbers, the sorted output may not be what you expect. For example, try sorting a list of even numbers from 1 - 10.

$ sort oneToTen.txt 10 2 4 6 8

To get the correct results, we must pass in the -n option, which sorts by integer value.

$ sort -n oneToTen.txt 2 4 6 8 10

More options

There are plenty more options you can check through the man page. Here are the most frequently used ones.

-b Ignore all leading whitespaces. -c Just check that input is correctly sorted. Exit code will be nonzero if not. -d Use dictionary order, sorting on whitespace and alphanumeric characters. -f Case-insensitive sort. f is for "folding" each letter to its corresponding lowercase letter. -g General numeric order. Compare as floating-points. -k Define sort key field. -k 2 would sort on the second field (aka second column). -i Ignore non-printable characters. -o Specify the out file. Default is standard out. -m Merge already sorted input files. -n Compare fields as integer values. String to integer conversion. -r Reverse sorting order. -R Random sort (not truly random). -t Specify the character to use as the separator of fields instead of whitespace. -t ';' would separate fields with a semicolon. -u Save only the first unique record only. All other repeated records with an equal key are discarded.

After sorting, we can use uniq to find characteristics of our file! Let's learn about that next.

Finding Unique or Duplicate elements uniq

The uniq command takes in a sorted file and reports duplicated lines. With the proper options, it can be used to omit or report unique or repeated lines.

Use sort first! If we use the uniq command on an unsorted list, it will report unexpected results, so make sure you sort before! $ sort test.txt | uniq

Default output

The example file here simply lists the names of a few people. If we sort then call uniq, it will return all names in the file, just once.

$ cat names.txt Bob Chase Jon Clara Bob Theresa Jon Billy Jonathan Clara Jonathan Bob Jonny $ sort names.txt | uniq Billy Bob Chase Clara Jon Jonathan Jonny Theresa

Outputting repeated lines

To output lines that have repeated elements, we pass in the -d option.

$ sort names.txt | uniq -d Bob Clara Jon Jonathan

Outputting unique lines

To output lines that occur just once, pass in the -u option.

$ sort names.txt | uniq -c Billy Chase Jonny Theresa

With count

With the -c option, we can find how many times each occurrence occurs.

$ sort names.txt | uniq -c 1 Billy 3 Bob 1 Chase 2 Clara 2 Jon 2 Jonathan 1 Jonny 1 Theresa

More options

Here are a list of the most-used options. Check out the man page for more.

-c, --count Precede each duplicate line occurrence by the number of times the duplicate occurs. -d, --repeated Output only repeated lines, rather than unique lines. -f, --skip-fields=n Ignore comparing the first n fields in each line. -i, --ignore-case Ignore case during the line comparisons. -s --skip-chars=n Skip the leading n characters of each line. -u, --unique Opposite of -d - output only unique lines. This is the default setting.

Cutting, Pasting and Joining cut, paste, join

Let's look at how we can use the cut , paste and join operations to edit and format text files.

Cut

Much like the "cutting" most of you are familiar with, the cut command in UNIX takes the section from a file and outputs it to standard out. However, it does not delete any part of the file it extracts text from. cut is powerful in that it may accept multiple files for its standard input.

With the options listed, there are several ways you can specify a cut .

-b Select only the bytes specified. May be a single, set or range of bytes, separated by a comma. -c Specify the number of characters from each line. -f Extract a set of specified fields. -d Used with the -f option. Use a specified delimiter rather than default tab.

You may only use one of the -b , -c or -f options. Each of these options come with a list that is made up of an integer, range of integers, or multiple integer ranges separated by a comma. A list is defined as follows:

n The n th byte, character or field. Count starts at 1. n- From the n th byte, character or field forward. n-m From the n th to the m th byte, character or field (inclusive). -m From the first to the m th byte.

$ cat test.txt doh re me fa so 1 2 3 4 5 $ cut -f 3-5 test.txt me fa so 3 4 5 $ cut -f 1,3-4 test.txt doh me fa 1 3 4

Paste

The paste command is used to merge lines of files together. With this command, you can add one or more columns (or fields) of text to a file. There are two options you should be aware of:

-d Specify the delimiter to be used instead of tabs. -s Append in serial instead of parallel. (Horizontal pasting instead of vertical.)

$ cat names.txt Billy Bob Chase Jon Jonathan $ cat birthdates.txt 09/21/1992 08/12/1982 05/24/1999 04/23/1974 08/09/2001 $ paste -d ',' names.txt birthdates.txt Billy,09/21/1992 Bob,08/12/1982 Chase,05/24/1999 Jon,04/23/1974 Jonathan,08/09/2001

If there are an unequal number of fields (5 rows in names.txt and only 3 in birthdates.txt ), the bottom two names will not be matched with anything.

$ paste -d ',' 5names.txt 3birthdates.txt Billy,05/24/1999 Bob,04/23/1974 Chase,08/09/2001 Jon, Jonathan,

Join

If you're familiar with a relational databases (and don't worry if you're not), the join command should sound very familiar. In short, join takes a common column between two tables, and joins them together based on that attribute.

-t Specify a delimiter -1 n Use the n th column as the join key for the first column. -2 n Use the n th column as the join key for the second column. -a n Also print the unprintable lines from n , where n is 1 or 2 (first or second file).

Basic joining

Let's try a join operation as an example.

$ cat birthdates.txt 05/24/1999,4 04/23/1974,2 08/09/2001,5 11/24/1991,3 01/23/1975,1 $ cat names.txt Billy,1 Bob,2 Chase,3 Jon,4 Jonathan,5

First, we must have both lists sorted according to the column we want to join on. names.txt is already sorted, but birthdates.txt is not. Refer to the sorting page to learn how the sort operation works.

$ sort -t ',' -k 2 birthdates.txt > sortedBirthdates.txt $ join -t ',' -1 2 -2 2 names.txt sortedBirthdates.txt 1,Billy,01/23/1975 2,Bob,04/23/1974 3,Chase,11/24/1991 4,Jon,05/24/1999 5,Jonathan,08/09/2001

Right/Left outer join

In some cases, you'll want to join two tables even though there are some rows without a corresponding value in the other row. Joining the right table with missing corresponding values is called a right outer join , and joining the left table with missing corresponding rows is called a left outer join .

A left outer join would have the option -a1 and a right outer join would have option -a2 .

Full outer join

In a full outer join , both table rows are included, even if they don't have a corresponding row. Expect many null cell values when using this option.

Use the -a option for a full outer join.

Tabs to Spaces & Spaces to Tabs expand, unexpand

There may be times when you have finished up writing code, but need to change tabs to spaces for a file to display consistently among different computers. The expand and unexpand commands help you convert from spaces to tabs and tabs to spaces.

Expand filter

In a pipeline, expand converts tabs to spaces. Using specific options, you can specify a number of parameters.

-i Skip the tab conversions after non-blanks. -t Set the number of spaces to replace each tab. By default, it's set to 8.

$ expand -t 4 sample.xml # Convert every tab into four spaces <?xml version="1.0" encoding="UTF-8"?> <note> <to>Tove</to> <from>Jani</from> <heading>Reminder</heading> <body>Please do not forget me this weekend!</body> </note>

Unexpand filter

The unexpand filter does the opposite - it converts spaces to tabs.

-a Convert all blanks. --first-only Convert only leading sequences. -t Set tabs to be a number of spaces apart. Default is 8.

In this example, assume that each "tab" instance is four spaces.

$ unexpand -t 2 sample.xml # Convert every two spaces to a tab <?xml version="1.0" encoding="UTF-8"?> <note> <to>Tove</to> <from>Jani</from> <heading>Reminder</heading> <body>Please do not forget me this weekend!</body> </note>

Column Editing and Pretty Printing column, colrm, fold

Pretty printing

The column command is the "pretty-print" of the command line. It formats its input into multiple columns.

-c Specify how wide each column should be. -s Specify set of characters -t Determine number of columns the input contains and create a table. Columns delimited by whitespace or the characters specified with the -s option. -x Fill columns before filling rows.

Column removal

The colrm command removes a range of columns from standard in or a file.

$ cat birthdates.txt 05/24/1999,4 04/23/1974,2 08/09/2001,5 11/24/1991,3 01/23/1975,1 $ colrm 3 5 < birthdates.txt # remove from column 3 to 5 05/1999,4 04/1974,2 08/2001,5 11/1991,3 01/1975,1 $ colrm 6 < birthdates.txt # remove from column 6 onward. 05/24 04/23 08/09 11/24 01/23

Files with tabs When working with files containing tabs, be sure to convert to spaces using the expand command first - otherwise, some unexpected behavior may occur.

Max-width with fold

The fold command makes long lines more readable by placing a new line every n characters.

To set n , use the -w option.

For example, take a look at a Lorem Ipsum text, which has no line breaks.

$ cat lorem.txt Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Curabitur dignissim venenatis pede. Quisque dui dui, ultricies ut, facilisis non, pulvinar non, purus. Duis quis arcu a purus volutpat iaculis. Morbi id dui in diam ornare dictum. Praesent consectetuer vehicula ipsum. Praesent tortor massa, congue et, ornare in, posuere eget, pede.

With the fold command we can break it up into lines with a width value of 45.

$ fold -w 45 lorem.txt Lorem ipsum dolor sit amet, consectetuer adip iscing elit. Curabitur dignissim venenatis pe de. Quisque dui dui, ultricies ut, facilisis non, pulvinar non, purus. Duis quis arcu a pu rus volutpat iaculis. Morbi id dui in diam or nare dictum. Praesent consectetuer vehicula i psum. Praesent tortor massa, congue et, ornar e in, posuere eget, pede.

This way, we are able to set a default width to our files for better readability.

Transliterating text tr

The tr command is used as a character-based search and replace function.

Simple Bioinformatics

If you're a bioinformaticist, you may find the need to replace all T characters from a DNA strand to U's to obtain the corresponding mRNA strand. We can do this in a one-liner with the tr command.

$ echo 'CATCGTAGCTAGTCACTG' | tr T U CAUCGUAGCUAGUCACUG

In biology, each DNA strand has a corresponding strand. G's pair to C's and T's pair to A's. Can we use tr to find the corresponding strand?

$ echo 'CAGTCGTACGTACGT' | tr 'ATGC' 'TACG' GTCAGCATGCATGCA

Lowercase to Uppercase

You can use this function to map lowercase letters to uppercase.

$ echo 'what is up world?' | tr a-z A-Z WHAT IS UP WORLD?

From standard in and to standard out

You may also use redirection to direct an input file and write to some output.

$ cat fox.txt The quick brown fox jumped over the lazy dog. $ tr 'aeiou' 'AEIOU' < fox.txt ThE qUIck brOwn fOx jUmpEd OvEr thE lAzy dOg. $ tr 'aeiou' 'AEIOU' < fox.txt > vowelsCappedFox.txt # Output to vowelsCappedFox.txt

Deleting specific characters

With the -d option, we can delete specific characters from a string.

$ echo 'Removing all vowels' | tr -d 'aeiou' Rmvng ll vwls

Squeezing text

We can eliminate repeated instances of a letter with the squeeze option, -s . This removes all repeated characters that occur in sequence.

$ echo '1111112222233333344444' | tr -s 123 12344444

Generating a secret message

We can generate an amateur secret message by mapping our lowercase characters to their inverse correspondence.

$ cat fox.txt | tr a-z zyxwvutsrqponmlkjihgfedcba Tsv jfrxp yildm ulc qfnkvw levi gsv ozab wlt.

Comparing Text cmp, comm

Comparing bytes with cmp

The cmp command compares files byte by byte. Use the -b option to see where the first difference occurs.

$ cat file1.txt hello how are you? $ cat file2.txt hello how art thee? $ cmp -b file1.txt file2.txt file1.txt file2.txt differ: byte 13, line 1 is 145 e 164 t

You may skip the first n bytes with the -i option. Or to compare at most n bytes, use the -n option.

Comparing lines with comm

The comm command allows you to see the lines common to two files. Simply pass in the file names of the two files you'd like to compare.

Let's first take a look at the two files we will use as an example:

$ cat file1.txt

Humphrey: Hey what's up? Jen: I'm fine how are you? Humphrey: I'm find as well. Jen: Great - want to watch a movie some time?

$ cat file2.txt

Humphrey: Top of the mornin' to ya - how art thee? Jen: I'm fine how are you? Humphrey: Thy is fine as one can be. Jen: Great - I'll see you around!

There are three columns of output that gets generated. The first column is indented by one tab, the second, two tabs, the third, three.

The first column shows the lines that are unique to the first file, while the second are those unique to the second file. The third line shows the line numbers that the files have in common.

$ comm file1.txt file2.txt

Humphrey: Hey what's up? Humphrey: Top of the mornin' to ya - how art thee? Jen: I'm fine how are you? Humphrey: I'm find as well. Humphrey: Thy is fine as one can be. Jen: Great - I'll see you around! Jen: Great - want to watch a movie some time?

To suppress any of these columns, use the -n option where n specifies the column number.

$ comm -12 file1.txt file2.txt # Print only column 3 - lines that the files have in common

Jen: I'm fine how are you?

Case-insensitive

Use the -i option for a case-insensitive comparison.

Finding Differences diff

A more powerful comparison with diff

We can use the diff command to compare files line by line with a richer set of output formats than we did with cmp .

Simply pass in the two file names as the arguments to the diff command.

$ cat file1.txt

Life goals 1) Graduate High School 2) Go skydiving 3) Help and feed the poor and hungry 4) Run a half marathon 5) Swim at the beach 6) Travel to Ireland 7) Get rock solid abs

$ cat file2.txt

Life bucket list 1) Graduate college 2) Go skydiving 3) Help and feed the poor 4) Run a marathon 5) Swim at the beach 6) Travel the world

$ diff file1.txt file2.txt

1,2c1,2 < Life goals < 1) Graduate High School --- > Life bucket list > 1) Graduate college 4,5c4,5 < 3) Help and feed the poor and hungry < 4) Run a half marathon --- > 3) Help and feed the poor > 4) Run a marathon 7,8c7 < 6) Travel to Ireland < 7) Get rock solid abs --- > 6) Travel the world

The output will tell you how to change the first file to get the second file. Lines beginning with a < mean that they're from the first file, while the > mean they're from the second. The --- signifies separation of file1.txt and file2.txt .

There are three letters that signify three types of changes:

a add c change d delete

Thus in our example above, the diff command tells us to change lines 1,2 from the file1.txt to lines 1,2 in the second file. The same for lines 4 and 5 (line 3 requires no change). Then it tells us to change lines 7,8 to just line 7 of file2.txt .

Unified mode

With the -u option, we can avoid redundant information. Again, the output will show you how to go from the first argument file to the second.

Here are a list of symbols used in unified mode.

+ Lines that have been added. - Lines that have been deleted. ! Lines that have been changed.

$ diff -u file1.txt file2.txt

--- file1.txt 2015-06-03 22:34:30.000000000 -0700 +++ file2.txt 2015-06-03 22:38:26.000000000 -0700 @@ -1,8 +1,8 @@ -Life goals -1) Graduate High School +Life bucket list +1) Graduate college 2) Go skydiving -3) Help and feed the poor and hungry -4) Run a half marathon +3) Help and feed the poor +4) Run a marathon 5) Swim at the beach -6) Travel to Ireland -7) Get rock solid abs +6) Travel the world +

Options

These a few common options used with the diff command. Make sure you check the man page for a complete listing.

-b Ignore changes in white space. -B Ignore changes that insert or delete blank lines. -i Case insensitive. -q Report when two files are different. -s Report when two files are the same. -w Ignore white space. -y View side by side.

10. Patching from a diff file patch

The diff command is used by software developer to check for differences in source code. The output of a diff command can be used to patch files.

Patches are used to convert one version of a file to another. When source codes need updating, patch files are sent instead of the entire source code, as this saves bandwidth and download time.

The command to apply a patch is patch .

Applying a patch

To apply a patch, first run the diff command in unified mode (using the -u option).

$ cat file1.txt Life goals 1) Graduate High School 2) Go skydiving 3) Help and feed the poor and hungry 4) Run a half marathon 5) Swim at the beach 6) Travel to Ireland 7) Get rock solid abs

$ cat file2.txt Life bucket list 1) Graduate college 2) Go skydiving 3) Help and feed the poor 4) Run a marathon 5) Swim at the beach 6) Travel the world

$ diff -u file1.txt file2.txt > update.diff patching file file1.txt

$ cat file1.txt # Now it's updated to file2.txt! Life bucket list 1) Graduate college 2) Go skydiving 3) Help and feed the poor 4) Run a marathon 5) Swim at the beach 6) Travel the world

Reversing a patch

To reverse a patch, use the -R mode.

$ patch -R < update.diff patching file file1.txt $ cat file1.txt # Back to original file Life goals 1) Graduate High School 2) Go skydiving 3) Help and feed the poor and hungry 4) Run a half marathon 5) Swim at the beach 6) Travel to Ireland 7) Get rock solid abs

Make sure that no changes were match to the updated file1.txt , as this will mess up the line numbers specified on the diff file.

Spell checking and Dictionary lookup aspell, lookup

Spell checking

Spell checking on the command line is interactive as it is easy. Use the aspell command, which is an interactive spell checker

$ cat > woodchuck.txt How mch wood wuld a woodchck chuk if a wodchuck coud chuc wod?

Now try correcting the woodchuck.txt file with the aspell check command

$ aspell check woodchuck.txt

Your terminal should turn into an interactive program such as the one below:

How mch wood wuld a woodchck chuk if a wodchuck coud chuc wod?

1) Mach 6) ch 2) Mich 7) MC 3) mach 8) MCI 4) much 9) mph 5) Ch 0) och i) Ignore I) Ignore all r) Replace R) Replace all a) Add l) Add Lower b) Abort x) Exit

To replace with suggested, word, use the number keys 0-9 . Otherwise, choose and press the appropriate letter.

HTML files

Oftentimes you may work with HTML pages to run spell check. To ignore HTML tags, use the -H option.

Dictionary Lookup

If you're playing Scrabble with a friend and need to prove that a word exist, you can do so right on the command line! Use the command lookup .

$ lookup quetzal quetzal

If the word exists, it will echo it back to standard out, along with other words with the same beginning characters.