Linux text processing tools – Part 2

This is my attempt to get familiar with the various text processing tools in Linux. I am also noting down several oneliners that i found elsewhere.
Its a continuation of my earlier post Linux text processing tools – Part 1

This is NOT written as a tutorial. For details on usage and options check man pages.

sort – sort lines of text files

Sorting is done based on one or more sort keys extracted from each line of input.
If no sort keys are specified, the entire line is taken as key.

some of the options for sort are,
-f (ignore case), -b (ignore leading blanks),-n (numerically sort), -r (reverse), -u (unique), -tx (where x is the delimiter), -k (used to specify keys) -o (output file)

[tony@localhost tmp]$ sort List.txt 
bbb
ccc
CCC
ddd
[tony@localhost tmp]$ sort -n age.txt
08mike
18john
23tony
60jose
[tony@localhost tmp]$ sort -bn
19
 9
06
  14   #press CTRL+D to stop inputting
06
 9
  14
19
[tony@localhost tmp]$ sort -c list.txt
sort: list.txt:2: disorder: aaa   #check whether sorted
[tony@localhost tmp]$ sort -m list.txt list2.txt 
   #merge sorted files, each file must be sorted previously.

[tony@localhost tmp]$ sort list.txt -o list.txt     #sorts and saves in the same file
[tony@localhost tmp]$ sort -R list.txt -o list.txt    #shuffle a list of lines

sort keys(or fields) are specified using the -k option with the syntax
-k m[,n] (start at field m, end at field n including it, or the end if n is omitted).
By default blank is used as the field separator. It can also be specified using -tx

Note that several of the options available can also be specified along with the key.

[tony@localhost tmp]$ sort -k 3b   
   #sort using key from 3rd field to end of line. #Ignore any blanks preceding 3rd field.
[tony@localhost tmp]$ sort -k2n,2   
   #this will sort numerically based on the 2nd field.

[tony@localhost tmp]$ sort -t : -k 2,2n -k 5.3,5.4 
   #Sort numerically on the second field and then sort alphabetically on the third 
   #and fourth characters of field five to break tie. Use `:' as the field delimiter.

[tony@localhost tmp]$ sort -n -t . -k 1,1 -k 2,2 -k 3,3 -k 4,4 IP.txt
   #sort a list of IPv4 addresses.

Note: if you sometimes see strange sorting order(check below example), this is usually bcoz linux locale setting set to a non-POSIX locale. You could prefix sort (or any command) with ‘LC_ALL=c‘ to get the POSIX order.

[tony@localhost tmp]$ sort list
1 1
111
212
2 2
[tony@localhost tmp]$ LC_ALL=c sort list
1 1
111
2 2
212

 

——————————–

shuf – shuffle lines

[tony@localhost tmp]$ sort LIST | shuf
333   #shuf shuffles our sorted LIST
212
444
111
[tony@localhost tmp]$ shuf -i 1-4
2   #generates a shuffled list of numbers
3
4
1
[tony@localhost tmp]$ shuf -e clubs hearts diamonds spades
clubs
diamonds
hearts
spades
[tony@localhost tmp]$ shuf LIST -o LIST   #in place save

 

——————————–

uniq – Uniquify files

uniq discard adjacent duplicate lines. Non adjacent duplicate lines are not discarded. To discard non adjacent duplicate lines the file must be sorted before or we could use the sort -u command.

[tony@localhost tmp]$ cat list
aaa
bbb
bbb
ccc
bbb
[tony@localhost tmp]$ uniq list
aaa
bbb
ccc
bbb
[tony@localhost tmp]$ uniq -c list
      1 aaa   #prefixes repetition count
      2 bbb
      1 ccc
      1 bbb
[tony@localhost tmp]$ uniq -d list
bbb   #only repeated lines
[tony@localhost tmp]$ uniq -i list2
ccc   #ignores case when comparing

Fields or characters can be ignored before comparison, by using the -fn and -sn options.

[tony@localhost tmp]$ cat > list2
aa bb
ad bb
ad cc
ad dd
[tony@localhost tmp]$ uniq -f 1 list2
aa bb   #we skipped first field from comparison check
ad cc
ad dd
[tony@localhost tmp]$ uniq -s 4 list2
aa bb   #we skipped the first 4 characters
ad cc
ad dd
[tony@localhost tmp]$ uniq -s 2 -w 1 list2
aa bb   #-w option specifies number of characters to compare after
   # any characters and fields have been skipped

[tony@localhost tmp]$ sort file | uniq -c | sort -n
#displays the unique lines along with the number of times they occur

 

———————————–

comm -: Compare two sorted files line by line

Comm output 3 columns, the first two columns contain lines unique to the first and second file, respectively. The last column contains lines common to both.
The columns are separated by tabs. The files must be sorted before.

Option -1,-2.-3 can be specified to remove the corresponding column from output.

[tony@localhost tmp]$ cat > list
aaa   
bbb   # press CTRL+D to stop inputting
[tony@localhost tmp]$ cat > list2
bbb
bbb
ccc
[tony@localhost tmp]$ comm list list2
aaa
		bbb   #bbb is common to both
	bbb           #this bbb is only in list2
	ccc
[tony@localhost tmp]$ comm -32 list list2
aaa   #only the lines unique in first file.

 

————————————–

head – Output the first part (10 lines) of files

supports the options -nK, -cK where K is the number of lines or bytes to be printed.

[tony@localhost ~]$ head -n5 .bash_history #prints 1st 5 lines
su -
lspci
/etc/init.d/NetworkManager restart
/etc/init.d/NetworkManager status
tail -f /var/log/messages 
[tony@localhost ~]$ head -c10 .bash_history   #1st 10 bytes
su -
lspci
[tony@localhost ~]$ head -n-5 .bash_history #prints all but last 5 lines

 

—————————————

tail – Output the last part(10 lines) of files

tail supports -nk and -ck as in the case of head.
tail -f periodically (default 1sec) checks to read from the end of the line

[root@localhost ~]# tail -f /var/log/messages
   # is used to scan system messages especially when isolating a problem

 

—————————————
split – Split a file into pieces

Usage: split [OPTION] [INPUT [PREFIX]]

Split is generally used to split a file based on lines (-l LINES) or bytes (-b SIZE)
By default files are split by 1000 lines. The files are named by appending aa, ab, ac, etc. to PREFIX (if ommitted x is used)


[tony@localhost tmp]$ cat > file
aaa
bbb
ggg
ccc
kkk   # Press CTRL+D to stop inputting
[tony@localhost tmp]$ split -l2 file   #split into files with 2 lines
[tony@localhost tmp]$ head xa?   # viewing each of them
==> xaa <==
aaa
bbb

==> xab <==
ggg
ccc

==> xac <==
kkk

[tony@localhost tmp]$ split -b100KB thriller.ogg part 
   # splitting music file by size 100KB
[tony@localhost tmp]$ ls -l parta?
-rw-rw-r--. 1 tony tony 100000 Oct 12 14:47 partaa
-rw-rw-r--. 1 tony tony 100000 Oct 12 14:47 partab
-rw-rw-r--. 1 tony tony   9305 Oct 12 14:47 partac
[tony@localhost tmp]$ cat parta? > thriller2.ogg
   # Joining them back

 

—————————————

paste – merge lines of files

Paste joins file by horizontally by outputting lines consisting of the sequentially corresponding lines of each file specified.

[tony@localhost tmp]$ cat > Name
john
tom   # Press CTRL+D to stop inputting
[tony@localhost tmp]$ cat > Age
20
19
[tony@localhost tmp]$ paste Name Age
john	20
tom	19

-s opton pastes the lines of one file at a time rather than one line from each file.

[tony@localhost tmp]$ paste -s Name Age
john	tom
20	19

a delimiter list can be specified using -d

[tony@localhost tmp]$ paste -d ':-' Name Age Name
john:20-john
tom:19-tom

 

—————————————

join – join lines of two files on a common field

Usage: join [OPTION] FILE1 FILE2

Join outputs a line for each pair of input lines with identical join fields. Each output line consists of the join field, the remaining fields from FILE1, then the remaining fields from FILE2.
The default join field is the first, delimited by whitespace and leading blanks on the line are ignored;
Both files have to be sorted on the join field.

[tony@localhost tmp]$ cat > j1
a ac
b bc
c cd   # Press CTRL+D to stop inputting
[tony@localhost tmp]$ cat > j2
a acc
b bcc
c cdd
[tony@localhost tmp]$ join j1 j2
a ac acc
b bc bcc
c cd cdd
[tony@localhost tmp]$ cat >> j2
c dcccc
[tony@localhost tmp]$ join j1 j2
a ac acc
b bc bcc
c cd cdd
c cd dcccc

Join supports options like the (-1 N) join field in first file ,(-2 N) join field in sec file, (-i) ignore case, (-t x) separator in output, (-v N) print non joined lines of file 1[or 2]

[tony@localhost tmp]$ join -1 2 -2 1 -t '-' file1 file2
   # this joins based on 2nd field on first file and 1st field of sec file and uses - as separator in output

 

————————————

Leave a Reply

Your email address will not be published. Required fields are marked *


four × = 32

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>