Edit distances
June 27, 2024
Comparing strings
String distances are a way to measure the similarity between two strings. They are useful in a variety of applications, such as spell checking, plagiarism detection, and DNA sequence alignment.
How do we compare two strings? We want a function that can take two strings and return a number that represents how similar they are. There are many ways to do this, and no one-size-fits-all solution. Here are a few common string distances:
Jaccard similarity
One approach is to treat the strings as sets of characters and compare them using set operations. The Jaccard similarity is the size of the intersection divided by the size of the union of the two sets. This is a simple and fast way to compare strings, but it doesn't take into account the order of the characters, which is usually important.
Hamming distance
One way to simplify the problem is to assume that the strings are the same length. Then one natural way to compare two strings is to count up the number of positions at which the corresponding characters are different. This is called the Hamming distance.
Levenshtein distance
In practice, we often want to compare strings of different lengths. In that case, we need to allow for insertions and deletions. The Levenshtein distance is a measure of the similarity between two strings, defined as the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one string into the other.
Longest common subsequence
But we can also compare strings by looking at the longest common subsequence. This is the longest sequence of characters that appear in the same order in both strings.
q-gram distance
Another approach is to break the strings into q-grams (substrings of length q) and compare the sets of q-grams. This is called the q-gram distance.
Damarau-Levenshtein distance
We can get fancier by allowing for transpositions, which often occur in spelling mistakes. That yields the Damareu-Levenshtein distance.
Parameterized distances
All of these distances assume that substitutions, insertions, and deletions are equally costly. But in practice, we might want to assign different costs to these operations. For example, we might want to say that a substitution is more costly than an insertion or deletion. If we score each operation differently, then we need to generalize our distance metrics.
Needleman-Wunsch
The Needleman-Wunsch algorithm is a generalization of the Levenshtein distance that allows for different costs for insertions, deletions, and substitutions. It is most commonly used in bioinformatics to align DNA sequences.
Smith-Waterman
Needleman-Wunsch is great for global alignment (aligning the entire sequences), but what if we want to find local alignments (aligning only parts of the sequences)? The Smith-Waterman algorithm is a generalization of the Needleman-Wunsch algorithm that allows for local alignments.
The Needleman-Wunsch and Smith-Waterman algorithms also produce an alignment, which tells you where exactly the two strings match up. This can be useful for identifying similar regions in DNA sequences or finding similar words in a document.