Sorting with "-k" on Unix and Linux

The “sort” command on Solaris has a “-k” switch for sorting by a particular field. For example, “sort -k 2” will sort by the second field on each line of input. Parts of fields can be further specified with “-k n.m“, says the man page.

For example, “sort -k 2.3” should sort by the second field, starting with the third character in that field. But the man page isn’t the clearest, and getting the “-k x.y” notation to work is tricky. Tricky until you realize it never works you also supply the “-b” argument. Same on Linux.

Take the following example:

jim@pluto # cat 1
0   2 Que  12/08/12  16:51
988 1 Act  10/29/12  03:51
49  6 Wri  07/30/12  20:01
49  3 Wri  07/25/12  20:01
988 8 Wri  08/09/12  06:45
988 5 Wri  09/23/11  06:45
988 7 Wri  02/09/11  06:45

We want to sort by the date in the 4th column (American format – mm/dd/yy). The following command should do it. Those three “-k” arguments should sort by the year, then the month, then the day, respectively, putting the lines in overall date order. But no, the output is jumbled:

jim@pluto # cat 1 | sort -k 4.7,4.8 -k 4.1,4.2 -k 4.4,4.5
49  6 Wri  07/30/12  20:01
988 5 Wri  09/23/11  06:45
49  3 Wri  07/25/12  20:01
0   2 Que  12/08/12  16:51
988 7 Wri  02/09/11  06:45
988 8 Wri  08/09/12  06:45
988 1 Act  10/29/12  03:51

Add the “-b” flag, and it all works as expected:

jim@pluto # cat 1 | sort -b -k 4.7,4.8 -k 4.1,4.2 -k 4.4,4.5
988 7 Wri  02/09/11  06:45
988 5 Wri  09/23/11  06:45
49  3 Wri  07/25/12  20:01
49  6 Wri  07/30/12  20:01
988 8 Wri  08/09/12  06:45
988 1 Act  10/29/12  03:51
0   2 Que  12/08/12  16:51

All in date order: from the 9th of Feb 2011, down to 8th December 2012. Those -k switches mean: sort by characters 7 to 8 of the 4th field (the year), then by characters 1 to 2 (the month) and finally by characters 4 to 5 (the day).

The Solaris man page says that -b “ignores leading blank characters“. On Linux it adds “If neither -t nor -b is in effect, characters in a field are counted from the beginning of the preceding whitespace“. None of which really explains what is happening on the command line. Whatever. Just remember to always use “-b” with “-k“, and it works. Useful for sorting by dates, times, serial numbers, custom fields, whatever.

6 thoughts on “Sorting with "-k" on Unix and Linux

  1. Thank you Mo, article corrected.

    Even with the incorrect “-k 4.3,4.4”, it still appeared to sort dates in the right order, at least with the file above. Picking some new, simpler data shows the difference though:

    bash-4.2$ cat 2
    49  3 Wri  07/31/12  20:01
    49  6 Wri  07/30/12  20:01
    bash-4.2$ cat 2 | sort -b -k 4.7,4.8 -k 4.1,4.2 -k 4.4,4.5   # Correct sort, by Mo
    49  6 Wri  07/30/12  20:01
    49  3 Wri  07/31/12  20:01
    bash-4.2$ cat 2 | sort -b -k 4.7,4.8 -k 4.1,4.2 -k 4.3,4.4  # Incorrect sort
    49  3 Wri  07/31/12  20:01
    49  6 Wri  07/30/12  20:01
    

    As expected, the incorrect sort key gets it wrong. But, if the “3” in column 2 is changed to a “6”, making it the same for both lines, things change…

    bash-4.2$ cat 2
    49  6 Wri  07/31/12  20:01
    49  6 Wri  07/30/12  20:01
    bash-4.2$ cat 2 | sort -b -k 4.7,4.8 -k 4.1,4.2 -k 4.4,4.5  # Correct sort again
    49  6 Wri  07/30/12  20:01
    49  6 Wri  07/31/12  20:01
    bash-4.2$ cat 2 | sort -b -k 4.7,4.8 -k 4.1,4.2 -k 4.3,4.4  # Incorrect sort
    49  6 Wri  07/30/12  20:01
    49  6 Wri  07/31/12  20:01
    

    This time, even the duff sort key got it right. In fact, things are still right even if you remove all of the keys except for “year”:

    bash-4.2$ cat 2 | sort -b -k 4.7,4.8
    49  6 Wri  07/30/12  20:01
    49  6 Wri  07/31/12  20:01
    

    …which is not expected. Who knows what sort is doing here. It should leave the original order untouched. Instead it seems to be sorting by year, finding nothing to do, then it does some kind of fall back and runs a sort on the whole line ? Beats me.

    Jim

  2. I’m so glad that you published this article. I just spent 2 hours trying to get it to work, and sure enough, the addition of “-b” fixed it.

    I had discovered that “-t” (“-t SEP : use SEParator instead of non- to whitespace transition”) would allow me to work around it without realizing why. Note that it’s the non- to whitespace transition, not whitespace to non-whitespace. So, it makes some sense now that it’s keeping the fields internally with all the whitespace, and in fact, that makes perfect sense due to how I remember it sorting when the text is right-justified.

    However, the documentation should call out that -b should probably be used when using -k.

  3. Thanks Taed, glad it helped. The sort command on unix/Linux was always a bit of a pain when it came to sorting by different fields, but it is something that is quite useful, especially in scripts. I was trying to sort something for a couple of hours like you, then pretty much came on the “-b” thing by chance. They should probably have it turned on by default.

  4. I have the exact same issue with the exact same question. However my date fields are not all formatted the same way. Some 1/8/2010 and some 11/25/2016, for example. How do I do this in this situation?

    • Hi Justin. The best approach is probably to fix those fields before doing the sort. Here is the same example as in the article, but with some leading zeros removed:

      $ cat tempfile
      0   2 Que  12/8/12  16:51
      988 1 Act  10/29/12  03:51
      49  6 Wri  7/30/12  20:01
      49  3 Wri  7/25/12  20:01
      988 8 Wri  8/9/12  06:45
      988 5 Wri  09/23/11  06:45
      988 7 Wri  02/09/11  06:45

      The following “sed” command will insert leading zeros, giving all dates the same format:

      $ cat tempfile | sed -r 's_ ([0-9])/_ 0\1/_g ; s_/([0-9])/_/0\1/_g'
      0   2 Que  12/08/12  16:51
      988 1 Act  10/29/12  03:51
      49  6 Wri  07/30/12  20:01
      49  3 Wri  07/25/12  20:01
      988 8 Wri  08/09/12  06:45
      988 5 Wri  09/23/11  06:45
      988 7 Wri  02/09/11  06:45

      Now just pipe that into the “sort”, as above:

      $ cat tempfile | sed -r 's_ ([0-9])/_ 0\1/_g ; s_/([0-9])/_/0\1/_g' |  sort -b -k 4.7,4.8 -k 4.1,4.2 -k 4.4,4.5
      988 7 Wri  02/09/11  06:45
      988 5 Wri  09/23/11  06:45
      49  3 Wri  07/25/12  20:01
      49  6 Wri  07/30/12  20:01
      988 8 Wri  08/09/12  06:45
      988 1 Act  10/29/12  03:51
      0   2 Que  12/08/12  16:51

      Now the dates are sorted correctly. NB the first part of the sed puts a leading zero into the month, when it is missing, and the second part does the same for the day.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.