Friday, November 20, 2009

DATA MANIPULATION

SkyHi @ Friday, November 20, 2009
How Can I Sort Linux Files?
The sort command sorts a file according to fields--the individual pieces of data on each line. By default, sort assumes that the fields are just words separated by blanks, but you can specify an alternative field delimiter if you want (such as commas or colons). Output from sort is printed to the screen, unless you redirect it to a file.
If you had a file like the one shown here containing information on people who contributed to your presidential reelection campaign, for example, you might want to sort it by last name, donation amount, or location. (Using a text editor, enter those three lines into a file and save it with donor.data as the file name.)

Bay Ching 500000 China
Jack Arta 250000 Indonesia
Cruella Lumper 725000 Malaysia

Let's take this sample donors file and sort it according to the donation amount. The following shows the command to sort the file on the second field (last name) and the output from the command:

sort +1 -2 donors.data
Jack Arta 250000 Indonesia
Bay Ching 500000 China
Cruella Lumper 725000 Malaysia

The syntax of the sort command is pretty strange, but if you study the following examples, you should be able to adapt one of them for your own use. The general form of the sort command is

sort

The most common flags are as follows:

-f Make all lines uppercase before sorting (so "Bill" and "bill" are treated the same).
-r Sort in reverse order (so "Z" starts the list instead of "A").
-n Sort a column in numerical order
-tx Use x as the field delimiter (replace x with a comma or other character).
-u Suppress all but one line in each set of lines with equal sort fields (so if you sort on a field containing last names, only one "Smith" will appear even if there are several).

Specify the sort keys like this:

+m Start at the first character of the m+1th field.
-n End at the last character of the nth field (if -N omitted, assume the end of the line).

Looks weird, huh? Let's look at a few more examples with the sample company.data file shown here, and you'll get the hang of it. (Each line of the file contains four fields: first name, last name, serial number, and department name.)

Jan Itorre 406378 Sales
Jim Nasium 031762 Marketing
Mel Ancholie 636496 Research
Ed Jucacion 396082 Sales

To sort the file on the third field (serial number) in reverse order and save the results in sorted.data, use this command:

sort -r +2 -3 company.data > sorted.data
Mel Ancholie 636496 Research
Jan Itorre 406378 Sales
Ed Jucacion 396082 Sales
Jim Nasium 031762 Marketing

Now let's look at a situation where the fields are separated by colons instead of spaces. In this case, we will use the -t: flag to tell the sort command how to find the fields on each line. Let's start with this file:

Itorre, Jan:406378:Sales
Nasium, Jim:031762:Marketing
Ancholie, Mel:636496:Research
Jucacion, Ed:396082:Sales

To sort the file on the second field (serial number), use this command:

sort -t: +1 -2 company.data
Nasium, Jim:031762:Marketing
Jucacion, Ed:396082:Sales
Itorre, Jan:406378:Sales
Ancholie, Mel:636496:Research

To sort the file on the third field (department name) and suppress the duplicates, use this command:

sort -t: -u +2 company.data
Nasium, Jim:031762:Marketing
Ancholie, Mel:636496:Research
Itorre, Jan:406378:Sales

Note that the line for Ed Jucacion did not print, because he's in Sales, and we asked the command (with the -u flag) to suppress lines that were the same in the sort field.

There are lots of fancy (and a few obscure) things you can do with the sort command. If you need to do any sorting that's not quite as straightforward as these examples, try the man sort command for more information.

For more information on the sort command, see the sort manual.

Previous Lesson: Heads or Tails?
Next Lesson: Eliminating Duplicates