How To Remove Duplicate Lines While Maintaining Order in Linux


TL;DR

The most elegant solutions.

# Order not preserved (lines sorted)
sort file.txt | uniq
# Display first occurrence
awk '!v[$0]++' file.txt
# Display last occurrence
tac file.txt | awk '!v[$0]++' | tac

Without Preserving Order

If order doesn’t matter, these are two options for removing duplicate lines.

sort file.txt | uniq
sort -u file.txt

uniq only removes adjacent duplicate lines, which is why we sort first.

-u forces unique lines while sorting.

Preserving Order

Given the following file.txt:

111
222
222
111

We can either print the first or last occurrences of duplicates:

# First    # Last
111        222
222        111

1. Using cat, sort, cut

cat -n file.txt | sort -uk2 | sort -nk1 | cut -f2-

cat -n adds an order number to each line in order to store the original order.

sort -uk2 sorts the lines in the second column (-k2) and keep only first occurrence of duplicates (-u).

sort -nk1 returns to original order by sorting the order numbers in the first column (-k1) and treating the values as numbers (-n).

cut -f2- prints only the second column, or field, which is the line itself

Another way to achieve this is to use awk.

2. Using awk

awk '!v[$0]++' file.txt

This command will use a dictionary (a.k.a. map, associative array) v to store each line and their number of occurrences, or frequency, in the file so far.

!v[$0]++ will be run on every line in the file.

$0 holds the value of the current line being processed.

v[$0] checks for the number of occurrences of the current line so far.

!v[$0] returns true when v[$0] == 0, or when the current line is not a duplicate. This is when the line is printed (the print statement is omitted for simplicity).

v[$0]++ will increment the frequency of the current line by one.

In order to print the last occurence of the duplicate line, we can use tac, which reverses the specified file.

1. Using cat, sort, cut

tac file.txt > file1.txt; cat -n file1.txt | sort -uk2 | sort -nk1 | cut -f2- > file2.txt; tac file2.txt > file3.txt; cat file3.txt

2. Using awk

tac file.txt | awk '!v[$0]++' | tac

Useful Tricks To Know

# Display only unique/duplicate lines
sort file.txt | uniq -u # Unique
sort file.txt | uniq -d # Duplicate
# Display number of duplicates per line
sort file.txt | uniq -uc
sort file.txt | uniq -dc
# Skip first 10 characters
uniq -s 10 file.txt
# Compare first 10 characters
uniq -w 10 file.txt