Don't Forget Pipe
When we need to mangle or massage some data on a Linux/UNIX server or desktop, we usually like to use applications like Google Sheets or other fully fledged data processing software forgetting about pipes.
A basic description of the pipe ( |
) is that it allows the output of one command to be the input of another. What this lets you do is filter data in steps as you chain commands.
Basic commands
Basic commands like grep
, cut
, sort
, uniq
and head
that come with every Linux/UNIX installation can take you a long way. Reading the man pages of these commands will help give you creative ways of how you could use them. A few quick starters are:
grep – file pattern searcher
Some flags:
-c, --count : Only a count of selected lines is written to standard output.
-i, --ignore-case : Perform case insensitive matching.
-v, --invert-match : Selected lines are those not matching any of the specified patterns.
cut – cut out selected portions of each line of a file
Some flags:
- d delim : Use delim as the field delimiter character instead of the tab character.
- f list : The list specifies fields, separated in the input by the field delimiter character.
sort – sort or merge records (lines) of text and binary files
Some flags:
-f, --ignore-case : Case-independent sorting.
-r, --reverse : Sort in reverse order.
-b, --ignore-leading-blanks : Ignore leading blank characters when comparing lines.
-h, --human-numeric-sort : Sort by numerical value, but take into account the SI suffix, if present.
uniq – report or filter out repeated lines in a file.
Some options:
-c : Precede each output line with the count of the number of times the line occurred in the input.
-i : Case insensitive comparison of lines.
head – display first lines of a file
Some options:
- n count : Displays the first count lines
It is particularly simple when the data that needs massaging has some repeated or unique patterns. A csv file is a good example. Let’s use this csv file (generic_call_data.csv) containing 350 rows of some generic failed call data. The columns are, date and time of the call (datetime), the SIP Call-ID (callid), the SIP To header field of the call (to) and the response code of the failed call (res_code). An example row is:
2019-12-17 00:03:38,0HODFHNEIG26699@192.168.86.100,+9994234818 <sip:+9994234818@192.168.86.200>;tag=gh567hkn,404
Filtering
Using the sample data, if we needed to count how many times calls to the number +9994234804 ended with an error, using grep
and the -c
flag we could just do …
grep -c 9994234804 generic_call_data.csv
Result:
8
no pipe needed. How about if we needed to find out all the different error codes? For that we would have to focus on the last (4th) column. Here cut
comes in handy. To get the 4th column or field we set the -f
flag to 4, knowing that we are using a csv file and that the separator of the columns is ','
, so we set that as our delimiter (-d
) …
cut -d ',' -f 4 generic_call_data.csv
We have now extracted the 4th column but we still have 350 rows. Our next step is to sort
the result (pun intended) and here we will need our first pipe …
cut -d ',' -f 4 generic_call_data.csv | sort
Now we have a sorted list of the 4th row. What is left is just to get the uniq
ue (pun intended) values …
cut -d ',' -f 4 generic_call_data.csv | sort | uniq
Result:
404
408
481
486
500
res_code
If we want to know the count of each error, we just add the -c
flag
cut -d ',' -f 4 generic_call_data.csv | sort | uniq -c
Result:
309 404
5 408
1 481
25 486
9 500
1 res_code
Now using the first example, let’s find the different error codes and their count for calls to +9994234804 …
grep 9994234804 generic_call_data.csv | cut -d ',' -f 4 | sort | uniq -c
Result:
6 404
1 408
1 500
Finally lets find the top 10 numbers that had the most errors. Looking at the data file, the numbers are in the 3rd column. Cut
could be used to extract it. Looking at an example data field
+9994234818 <sip:+9994234818@192.168.86.200>;tag=gh567hkn
we see that there is a space between the number and the SIP URI of the number, cut
could be used again (cut -d ' ' -f 1
). After that all that would be left would be sort
ing the numbers and then finding and counting uniq
ue values …
cut -d ',' -f 3 generic_call_data.csv | cut -d ' ' -f 1 | sort | uniq -c
Result:
17 +9993234050
6 +9993234100
2 +9993234101
1 +9993234104
27 +9993234109
1 +9993234116
...
...
We now need to sort the number count but from the biggest to the smallest ( sort -r
) and then we select the top 10 ( head
).
cut -d ',' -f 3 generic_call_data.csv | cut -d ' ' -f 1 | sort | uniq -c | sort -r | head
Result:
34 +9998234240
27 +9993234109
17 +9993234050
16 +9993234163
11 +9993234159
9 +9995234886
9 +9994234834
8 +9998234245
8 +9994234804
7 +9994234802
So the next time you need to quickly analyse a csv, log or any file with some, preferably, repeated patterns or words, I challenge you to resist the temptation to use a spreadsheet application but instead, fire up the terminal and get piping 😎