Linux text processing tools – Part 2
This is my attempt to get familiar with the various text processing tools in Linux. I am also noting down several oneliners that i found elsewhere.
Its a continuation of my earlier post Linux text processing tools – Part 1
This is NOT written as a tutorial. For details on usage and options check man pages.
sort – sort lines of text files
Sorting is done based on one or more sort keys extracted from each line of input.
If no sort keys are specified, the entire line is taken as key.
some of the options for sort are,
-f (ignore case), -b (ignore leading blanks),-n (numerically sort), -r (reverse), -u (unique), -tx (where x is the delimiter), -k (used to specify keys) -o (output file)
[tony@localhost tmp]$ sort List.txt bbb ccc CCC ddd [tony@localhost tmp]$ sort -n age.txt 08mike 18john 23tony 60jose [tony@localhost tmp]$ sort -bn 19 9 06 14 #press CTRL+D to stop inputting 06 9 14 19 [tony@localhost tmp]$ sort -c list.txt sort: list.txt:2: disorder: aaa #check whether sorted [tony@localhost tmp]$ sort -m list.txt list2.txt #merge sorted files, each file must be sorted previously. [tony@localhost tmp]$ sort list.txt -o list.txt #sorts and saves in the same file [tony@localhost tmp]$ sort -R list.txt -o list.txt #shuffle a list of lines
sort keys(or fields) are specified using the -k option with the syntax
-k m[,n] (start at field m, end at field n including it, or the end if n is omitted).
By default blank is used as the field separator. It can also be specified using -tx
Note that several of the options available can also be specified along with the key.
[tony@localhost tmp]$ sort -k 3b #sort using key from 3rd field to end of line. #Ignore any blanks preceding 3rd field. [tony@localhost tmp]$ sort -k2n,2 #this will sort numerically based on the 2nd field. [tony@localhost tmp]$ sort -t : -k 2,2n -k 5.3,5.4 #Sort numerically on the second field and then sort alphabetically on the third #and fourth characters of field five to break tie. Use `:' as the field delimiter. [tony@localhost tmp]$ sort -n -t . -k 1,1 -k 2,2 -k 3,3 -k 4,4 IP.txt #sort a list of IPv4 addresses.
Note: if you sometimes see strange sorting order(check below example), this is usually bcoz linux locale setting set to a non-POSIX locale. You could prefix sort (or any command) with ‘LC_ALL=c‘ to get the POSIX order.
[tony@localhost tmp]$ sort list 1 1 111 212 2 2 [tony@localhost tmp]$ LC_ALL=c sort list 1 1 111 2 2 212
——————————–
shuf – shuffle lines
[tony@localhost tmp]$ sort LIST | shuf 333 #shuf shuffles our sorted LIST 212 444 111 [tony@localhost tmp]$ shuf -i 1-4 2 #generates a shuffled list of numbers 3 4 1 [tony@localhost tmp]$ shuf -e clubs hearts diamonds spades clubs diamonds hearts spades [tony@localhost tmp]$ shuf LIST -o LIST #in place save
——————————–
uniq – Uniquify files
uniq discard adjacent duplicate lines. Non adjacent duplicate lines are not discarded. To discard non adjacent duplicate lines the file must be sorted before or we could use the sort -u command.
[tony@localhost tmp]$ cat list
aaa
bbb
bbb
ccc
bbb
[tony@localhost tmp]$ uniq list
aaa
bbb
ccc
bbb
[tony@localhost tmp]$ uniq -c list
1 aaa #prefixes repetition count
2 bbb
1 ccc
1 bbb
[tony@localhost tmp]$ uniq -d list
bbb #only repeated lines
[tony@localhost tmp]$ uniq -i list2
ccc #ignores case when comparing
Fields or characters can be ignored before comparison, by using the -fn and -sn options.
[tony@localhost tmp]$ cat > list2 aa bb ad bb ad cc ad dd [tony@localhost tmp]$ uniq -f 1 list2 aa bb #we skipped first field from comparison check ad cc ad dd [tony@localhost tmp]$ uniq -s 4 list2 aa bb #we skipped the first 4 characters ad cc ad dd [tony@localhost tmp]$ uniq -s 2 -w 1 list2 aa bb #-w option specifies number of characters to compare after # any characters and fields have been skipped [tony@localhost tmp]$ sort file | uniq -c | sort -n #displays the unique lines along with the number of times they occur
———————————–
comm -: Compare two sorted files line by line
Comm output 3 columns, the first two columns contain lines unique to the first and second file, respectively. The last column contains lines common to both.
The columns are separated by tabs. The files must be sorted before.
Option -1,-2.-3 can be specified to remove the corresponding column from output.
[tony@localhost tmp]$ cat > list aaa bbb # press CTRL+D to stop inputting [tony@localhost tmp]$ cat > list2 bbb bbb ccc [tony@localhost tmp]$ comm list list2 aaa bbb #bbb is common to both bbb #this bbb is only in list2 ccc [tony@localhost tmp]$ comm -32 list list2 aaa #only the lines unique in first file.
————————————–
head – Output the first part (10 lines) of files
supports the options -nK, -cK where K is the number of lines or bytes to be printed.
[tony@localhost ~]$ head -n5 .bash_history #prints 1st 5 lines su - lspci /etc/init.d/NetworkManager restart /etc/init.d/NetworkManager status tail -f /var/log/messages [tony@localhost ~]$ head -c10 .bash_history #1st 10 bytes su - lspci [tony@localhost ~]$ head -n-5 .bash_history #prints all but last 5 lines
—————————————
tail – Output the last part(10 lines) of files
tail supports -nk and -ck as in the case of head.
tail -f periodically (default 1sec) checks to read from the end of the line
[root@localhost ~]# tail -f /var/log/messages # is used to scan system messages especially when isolating a problem
—————————————
split – Split a file into pieces
Usage: split [OPTION] [INPUT [PREFIX]]
Split is generally used to split a file based on lines (-l LINES) or bytes (-b SIZE)
By default files are split by 1000 lines. The files are named by appending aa, ab, ac, etc. to PREFIX (if ommitted x is used)
[tony@localhost tmp]$ cat > file aaa bbb ggg ccc kkk # Press CTRL+D to stop inputting [tony@localhost tmp]$ split -l2 file #split into files with 2 lines [tony@localhost tmp]$ head xa? # viewing each of them ==> xaa <== aaa bbb ==> xab <== ggg ccc ==> xac <== kkk [tony@localhost tmp]$ split -b100KB thriller.ogg part # splitting music file by size 100KB [tony@localhost tmp]$ ls -l parta? -rw-rw-r--. 1 tony tony 100000 Oct 12 14:47 partaa -rw-rw-r--. 1 tony tony 100000 Oct 12 14:47 partab -rw-rw-r--. 1 tony tony 9305 Oct 12 14:47 partac [tony@localhost tmp]$ cat parta? > thriller2.ogg # Joining them back
—————————————
paste – merge lines of files
Paste joins file by horizontally by outputting lines consisting of the sequentially corresponding lines of each file specified.
[tony@localhost tmp]$ cat > Name john tom # Press CTRL+D to stop inputting [tony@localhost tmp]$ cat > Age 20 19 [tony@localhost tmp]$ paste Name Age john 20 tom 19
-s opton pastes the lines of one file at a time rather than one line from each file.
[tony@localhost tmp]$ paste -s Name Age john tom 20 19
a delimiter list can be specified using -d
[tony@localhost tmp]$ paste -d ':-' Name Age Name john:20-john tom:19-tom
—————————————
join – join lines of two files on a common field
Usage: join [OPTION] FILE1 FILE2
Join outputs a line for each pair of input lines with identical join fields. Each output line consists of the join field, the remaining fields from FILE1, then the remaining fields from FILE2.
The default join field is the first, delimited by whitespace and leading blanks on the line are ignored;
Both files have to be sorted on the join field.
[tony@localhost tmp]$ cat > j1 a ac b bc c cd # Press CTRL+D to stop inputting [tony@localhost tmp]$ cat > j2 a acc b bcc c cdd [tony@localhost tmp]$ join j1 j2 a ac acc b bc bcc c cd cdd [tony@localhost tmp]$ cat >> j2 c dcccc [tony@localhost tmp]$ join j1 j2 a ac acc b bc bcc c cd cdd c cd dcccc
Join supports options like the (-1 N) join field in first file ,(-2 N) join field in sec file, (-i) ignore case, (-t x) separator in output, (-v N) print non joined lines of file 1[or 2]
[tony@localhost tmp]$ join -1 2 -2 1 -t '-' file1 file2 # this joins based on 2nd field on first file and 1st field of sec file and uses - as separator in output
————————————
