How To Remove Duplicate Lines While Maintaining Order in Linux
TL;DR
The most elegant solutions.
# Order not preserved (lines sorted)
sort file.txt | uniq
# Display first occurrence
awk '!v[$0]++' file.txt
# Display last occurrence
tac file.txt | awk '!v[$0]++' | tac
Without Preserving Order
If order doesn’t matter, these are two options for removing duplicate lines.
sort file.txt | uniq
sort -u file.txt
uniq
only removes adjacent duplicate lines, which is why we sort
first.
-u
forces unique lines while sorting.
Preserving Order
Given the following file.txt
:
111
222
222
111
We can either print the first or last occurrences of duplicates:
# First # Last
111 222
222 111
Print First Occurrence of Duplicates
1. Using cat
, sort
, cut
cat -n file.txt | sort -uk2 | sort -nk1 | cut -f2-
cat -n
adds an order number to each line in order to store the original order.
sort -uk2
sorts the lines in the second column (-k2
) and keep only first occurrence of duplicates (-u
).
sort -nk1
returns to original order by sorting the order numbers in the first column (-k1
) and treating the values as numbers (-n
).
cut -f2-
prints only the second column, or field, which is the line itself
Another way to achieve this is to use awk
.
2. Using awk
awk '!v[$0]++' file.txt
This command will use a dictionary (a.k.a. map, associative array) v
to store each line and their number of occurrences, or frequency, in the file so far.
!v[$0]++
will be run on every line in the file.
$0
holds the value of the current line being processed.
v[$0]
checks for the number of occurrences of the current line so far.
!v[$0]
returns true when v[$0] == 0
, or when the current line is not a duplicate. This is when the line is printed (the print statement is omitted for simplicity).
v[$0]++
will increment the frequency of the current line by one.
Print Last Occurence of Duplicates
In order to print the last occurence of the duplicate line, we can use tac
, which reverses the specified file.
1. Using cat
, sort
, cut
tac file.txt > file1.txt; cat -n file1.txt | sort -uk2 | sort -nk1 | cut -f2- > file2.txt; tac file2.txt > file3.txt; cat file3.txt
2. Using awk
tac file.txt | awk '!v[$0]++' | tac
Useful Tricks To Know
# Display only unique/duplicate lines
sort file.txt | uniq -u # Unique
sort file.txt | uniq -d # Duplicate
# Display number of duplicates per line
sort file.txt | uniq -uc
sort file.txt | uniq -dc
# Skip first 10 characters
uniq -s 10 file.txt
# Compare first 10 characters
uniq -w 10 file.txt