Edit distances

June 27, 2024

Comparing strings

String distances are a way to measure the similarity between two strings. They are useful in a variety of applications, such as spell checking, plagiarism detection, and DNA sequence alignment.

How do we compare two strings? We want a function that can take two strings and return a number that represents how similar they are. There are many ways to do this, and no one-size-fits-all solution. Here are a few common string distances:

Jaccard similarity

One approach is to treat the strings as sets of characters and compare them using set operations. The Jaccard similarity is the size of the intersection divided by the size of the union of the two sets. This is a simple and fast way to compare strings, but it doesn't take into account the order of the characters, which is usually important.

Hamming distance

One way to simplify the problem is to assume that the strings are the same length. Then one natural way to compare two strings is to count up the number of positions at which the corresponding characters are different. This is called the Hamming distance.

Levenshtein distance

In practice, we often want to compare strings of different lengths. In that case, we need to allow for insertions and deletions. The Levenshtein distance is a measure of the similarity between two strings, defined as the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one string into the other.

Longest common subsequence

But we can also compare strings by looking at the longest common subsequence. This is the longest sequence of characters that appear in the same order in both strings.

q-gram distance

Another approach is to break the strings into q-grams (substrings of length q) and compare the sets of q-grams. This is called the q-gram distance.

Damarau-Levenshtein distance

We can get fancier by allowing for transpositions, which often occur in spelling mistakes. That yields the Damareu-Levenshtein distance.

Parameterized distances

All of these distances assume that substitutions, insertions, and deletions are equally costly. But in practice, we might want to assign different costs to these operations. For example, we might want to say that a substitution is more costly than an insertion or deletion. If we score each operation differently, then we need to generalize our distance metrics.

Needleman-Wunsch

The Needleman-Wunsch algorithm is a generalization of the Levenshtein distance that allows for different costs for insertions, deletions, and substitutions. It is most commonly used in bioinformatics to align DNA sequences.

Smith-Waterman

Needleman-Wunsch is great for global alignment (aligning the entire sequences), but what if we want to find local alignments (aligning only parts of the sequences)? The Smith-Waterman algorithm is a generalization of the Needleman-Wunsch algorithm that allows for local alignments.

The Needleman-Wunsch and Smith-Waterman algorithms also produce an alignment, which tells you where exactly the two strings match up. This can be useful for identifying similar regions in DNA sequences or finding similar words in a document.