How To Remove Duplicate Lines While Maintaining Order in Linux
TL;DR
The most elegant solutions.
# Order not preserved (lines sorted)
sort file.txt | uniq
# Display first occurrence
awk '!v[$0]++' file.txt
# Display last occurrence
tac file.txt | awk '!v[$0]++' | tac
Without Preserving Order
If order doesn’t matter, these are two options for removing duplicate lines.
sort file.txt | uniq
sort -u file.txt
uniq only removes adjacent duplicate lines, which is why we sort first.
-u forces unique lines while sorting.
Preserving Order
Given the following file.txt:
111
222
222
111
We can either print the first or last occurrences of duplicates:
# First # Last
111 222
222 111
Print First Occurrence of Duplicates
1. Using cat, sort, cut
cat -n file.txt | sort -uk2 | sort -nk1 | cut -f2-
cat -n adds an order number to each line in order to store the original order.
sort -uk2 sorts the lines in the second column (-k2) and keep only first occurrence of duplicates (-u).
sort -nk1 returns to original order by sorting the order numbers in the first column (-k1) and treating the values as numbers (-n).
cut -f2- prints only the second column, or field, which is the line itself
Another way to achieve this is to use awk.
2. Using awk
awk '!v[$0]++' file.txt
This command will use a dictionary (a.k.a. map, associative array) v to store each line and their number of occurrences, or frequency, in the file so far.
!v[$0]++ will be run on every line in the file.
$0 holds the value of the current line being processed.
v[$0] checks for the number of occurrences of the current line so far.
!v[$0] returns true when v[$0] == 0, or when the current line is not a duplicate. This is when the line is printed (the print statement is omitted for simplicity).
v[$0]++ will increment the frequency of the current line by one.
Print Last Occurence of Duplicates
In order to print the last occurence of the duplicate line, we can use tac, which reverses the specified file.
1. Using cat, sort, cut
tac file.txt > file1.txt; cat -n file1.txt | sort -uk2 | sort -nk1 | cut -f2- > file2.txt; tac file2.txt > file3.txt; cat file3.txt
2. Using awk
tac file.txt | awk '!v[$0]++' | tac
Useful Tricks To Know
# Display only unique/duplicate lines
sort file.txt | uniq -u # Unique
sort file.txt | uniq -d # Duplicate
# Display number of duplicates per line
sort file.txt | uniq -uc
sort file.txt | uniq -dc
# Skip first 10 characters
uniq -s 10 file.txt
# Compare first 10 characters
uniq -w 10 file.txt