To implement a function for approximate string matching, we need a reasonable metric to know how similar two strings are. I decided to use Damerau-Levenshtein distance as the metric to achieve the goal.
Damerau-Levenshtein Distance
The Damerau-Levenshtein distance is defined as the minimum number of primitive edit operations needed to transform one string into the other and these operations are substitution, deletion, insertion and the transposition of two adjacent characters.
ABC → AXC Substitution
ABC → AC Deletion
ABC → ABXC Insertion
ABC → ACB Transposition
Here is the Delphi implementation of Damerau-Levenshtein distance algorithm:
function DamerauLevenshteinDistance(const Str1, Str2: String): Integer; function Min(const A, B, C: Integer): Integer; begin Result := A; if B < Result then Result := B; if C < Result then Result := C; end; var LenStr1, LenStr2: Integer; I, J, T, Cost, PrevCost: Integer; pStr1, pStr2, S1, S2: PChar; D: PIntegerArray; begin LenStr1 := Length(Str1); LenStr2 := Length(Str2); // save a bit memory by making the second index points to the shorter string if LenStr1 < LenStr2 then begin T := LenStr1; LenStr1 := LenStr2; LenStr2 := T; pStr1 := PChar(Str2); pStr2 := PChar(Str1); end else begin pStr1 := PChar(Str1); pStr2 := PChar(Str2); end; // bypass leading identical characters while (LenStr2 <> 0) and (pStr1^ = pStr2^) do begin Inc(pStr1); Inc(pStr2); Dec(LenStr1); Dec(LenStr2); end; // bypass trailing identical characters while (LenStr2 <> 0) and ((pStr1 + LenStr1 - 1)^ = (pStr2 + LenStr2 - 1)^) do begin Dec(LenStr1); Dec(LenStr2); end; // is the shorter string empty? so, the edit distance is length of the longer one if LenStr2 = 0 then begin Result := LenStr1; Exit; end; // calculate the edit distance GetMem(D, (LenStr2 + 1) * SizeOf(Integer)); for I := 0 to LenStr2 do D[I] := I; S1 := pStr1; for I := 1 to LenStr1 do begin PrevCost := I - 1; Cost := I; S2 := pStr2; for J := 1 to LenStr2 do begin if (S1^ = S2^) or ((I > 1) and (J > 1) and (S1^ = (S2 - 1)^) and (S2^ = (S1 - 1)^)) then Cost := PrevCost else Cost := 1 + Min(Cost, PrevCost, D[J]); PrevCost := D[J]; D[J] := Cost; Inc(S2); end; Inc(S1); end; Result := D[LenStr2]; FreeMem(D); end; |
Because DamerauLevenshteinDistance
function is normally called inside a loop, I had to optimize my implementation. Otherwise, the original algorithm is smaller and more readable.
Similarity Ratio
Now that we have a function to find out how similar two strings are, we can use it to calculate the similarity ratio of the two strings to approximately match them.
function StringSimilarityRatio(const Str1, Str2: String; IgnoreCase: Boolean): Double; var MaxLen: Integer; Distance: Integer; begin Result := 1.0; if Length(Str1) > Length(Str2) then MaxLen := Length(Str1) else MaxLen := Length(Str2); if MaxLen <> 0 then begin if IgnoreCase then Distance := DamerauLevenshteinDistance(LowerCase(Str1), LowerCase(Str2)) else Distance := DamerauLevenshteinDistance(Str1, Str2); Result := Result - (Distance / MaxLen); end; end; |
The return value of StringSimilarityRatio
function is a floating-point number between 0 (zero) and 1 (one), where 0 means not similar at all and 1 means equal strings.
UPDATE: The algorithm is updated to the one that is optimized by data man.
Thank for post.
I added
after
and the speed has increased slightly, if the line endings match.
Did you add the lines or replace the old ones?
———————————————-
I added
after
and the speed has increased slightly, if the line endings match.
—————————————————————————-
Thank you for the algorithm. There seems to be a little bug.
When initializing D, I should loop from 0 to T-1 (which will be equal to LenStr2) :
for I := 0 to T-1 do
D[I] := I;
Regards
@Michel: You’re right. Thanks for your attention!