How To Remove Duplicate Lines While Maintaining Order in Linux

TL;DR

The most elegant solutions.

# Order not preserved (lines sorted)
sort file.txt | uniq

# Display first occurrence
awk '!v[$0]++' file.txt

# Display last occurrence
tac file.txt | awk '!v[$0]++' | tac

Without Preserving Order

If order doesn’t matter, these are two options for removing duplicate lines.

sort file.txt | uniq
sort -u file.txt

uniq only removes adjacent duplicate lines, which is why we sort first.

-u forces unique lines while sorting.

Preserving Order

Given the following file.txt:

We can either print the first or last occurrences of duplicates:

# First    # Last
111        222
222        111

Print First Occurrence of Duplicates

1. Using `cat`, `sort`, `cut`

cat -n file.txt | sort -uk2 | sort -nk1 | cut -f2-

cat -n adds an order number to each line in order to store the original order.

sort -uk2 sorts the lines in the second column (-k2) and keep only first occurrence of duplicates (-u).

sort -nk1 returns to original order by sorting the order numbers in the first column (-k1) and treating the values as numbers (-n).

cut -f2- prints only the second column, or field, which is the line itself

Another way to achieve this is to use awk.

2. Using `awk`

awk '!v[$0]++' file.txt

This command will use a dictionary (a.k.a. map, associative array) v to store each line and their number of occurrences, or frequency, in the file so far.

!v[$0]++ will be run on every line in the file.

$0 holds the value of the current line being processed.

v[$0] checks for the number of occurrences of the current line so far.

!v[$0] returns true when v[$0] == 0, or when the current line is not a duplicate. This is when the line is printed (the print statement is omitted for simplicity).

v[$0]++ will increment the frequency of the current line by one.

Print Last Occurence of Duplicates

In order to print the last occurence of the duplicate line, we can use tac, which reverses the specified file.

1. Using `cat`, `sort`, `cut`

tac file.txt > file1.txt; cat -n file1.txt | sort -uk2 | sort -nk1 | cut -f2- > file2.txt; tac file2.txt > file3.txt; cat file3.txt

2. Using `awk`

tac file.txt | awk '!v[$0]++' | tac

Useful Tricks To Know

# Display only unique/duplicate lines
sort file.txt | uniq -u # Unique
sort file.txt | uniq -d # Duplicate

# Display number of duplicates per line
sort file.txt | uniq -uc
sort file.txt | uniq -dc

# Skip first 10 characters
uniq -s 10 file.txt

# Compare first 10 characters
uniq -w 10 file.txt

How To Remove Duplicate Lines While Maintaining Order in Linux

TL;DR

Without Preserving Order

Preserving Order

Print First Occurrence of Duplicates

1. Using cat, sort, cut

2. Using awk

Print Last Occurence of Duplicates

1. Using cat, sort, cut

2. Using awk

Useful Tricks To Know

Related in CLI

1. Using `cat`, `sort`, `cut`

2. Using `awk`

1. Using `cat`, `sort`, `cut`

2. Using `awk`