## 09 June 2008

### Edit distance in Scala

How handy.

I was just looking for a way to highlight differences between 2 strings for my Behavior-Driven Development library, specs. And reading the excellent Algorithm Design Manual. Intuitively, I was thinking that there was a better way to show string differences than the one used in jUnit:

`Expected kit<t>en but was kit<ch>en`
`Expected <skate> but was <kite>`The first message shows that <t> has been replaced by <ch>, while the second message says that everything is different! There must be a way to find the minimal set of operations necessary to transform one string to another, right?

The Levenshtein distance

There is indeed. The "Edit" distance, also called "Levenshtein" distance, computes exactly this, the minimal number of insertions, deletions and substitutions required to transform "World" into "Peace" (5, they're very far apart,...).

The computation is usually done by:
1. examining the first 2 or last 2 letters of each string
2. assuming the eventual transformation which may occur for those 2 letters:
• they're equal: "h..." and "h...." --> distance = 0
• the first one has been removed: "ah..." and "h..." --> distance = 1
• the second one has been inserted: "h..." and "ah..." --> distance = 1
• they've been substituted: "ha..." and "ga..." --> distance = 1
• 3. saying that the final distance is the minimum of the distance when choosing one the
• 4 possibilities + the distance resulting from the consequences of that choice

• You can notice that many variations are possible: the distance could be different depending on the letters being added or removed, there could be more allowed operations such as the swapping of 2 letters (distance 1 instead of 2 substitutions of distance 2),...

Then the final result can either be constructed incrementally or recursively. Incrementally, for "skate" and "kite" you will compute costs from any small prefix to another prefix:
• "s" to "k" --> distance 1
• "s" to "ki" --> distance 2

• "sk" to "ki" --> distance 2
• "ska" to "kit" --> distance 3
Then you increase the prefix size and you reuse the distance numbers found for smaller prefixes. The different results can be stored in the following matrix:

s k a t e

k 1 1 2 3 4
i 2 2 2 3 4
t 3 3 3 2 3
e 4 4 4 4 2

Recursively, you do the inverse and you establish that the distance between 2 strings can be computed from knowing the distance between smaller prefixes and you travel the matrix to its upper left corner.

This is a very good example of Dynamic Programming, where the optimal quantity (the distance at [i, j] in the matrix) you're looking for only depends on:
• the optimal quantity for a subset of the data (the distance at [i-1, j-1], [i-1, j], [i, j-1])
• the resulting state (the substrings defined by a position [i-x, j-x] in the matrix)
• not how you got to that resulting state
The EditDistance trait

For the record, I will write here the code I'm using for the implementation, which is constructing the solution incrementally, as opposed to the recursive solution presented on Tony's blog, here.

`trait EditDistance { /** * Class encapsulating the functions related to the edit distance of 2 strings */ case class EditMatrix(s1: String, s2: String) { /* matrix containing the edit distance for any prefix of s1 and s2: matrix(i)(j) = edit distance(s1[0..i], s[0..j])*/ val matrix = new Array[Array[int]](s1.length + 1, s2.length + 1) /* initializing the matrix */ for (i <- 0 to s1.length; j <- 0 to s2.length) { if (i == 0) matrix(i)(j) = j // j insertions else if (j == 0) matrix(i)(j) = i // i suppressions else matrix(i)(j) = min(matrix(i - 1)(j) + 1, // suppression matrix(i - 1)(j - 1) + (if (s1(i - 1) == s2(j - 1)) 0 else 1), // substitution matrix(i)(j - 1) + 1) // insertion } /** @return the edit distance between 2 strings */ def distance = matrix(s1.length)(s2.length) /** prints the edit matrix of 2 strings */ def print = { for (i <- 0 to s1.length) { def row = for (j <- 0 to s2.length) yield matrix(i)(j) println(row.mkString("|")) } this } /** @return a (String, String) displaying the differences between each input strings. The used separators are parenthesis: '(' and ')'*/ def showDistance: (String, String) = showDistance("()") /** * @param sep separators used to hightlight differences. If sep is empty, * then no separator is used. If sep contains one character, it is * taken as the unique separator. If sep contains 2 or more characters, * the first 2 characters are taken as opening separator and closing * separator. * * @return a (String, String) displaying the differences between each * input strings. The used separators are specified by the caller. */ def showDistance(sep: String) = { val (firstSeparator, secondSeparator) = separators(sep) def modify(s: String, c: Char): String = modifyString(s, c.toString) def modifyString(s: String, mod: String): String = { (firstSeparator + mod + secondSeparator + s). removeAll(secondSeparator + firstSeparator) } def findOperations(dist: Int, i: Int, j:Int, s1mod: String, s2mod: String): (String, String) = { if (i == 0 && j == 0) { ("", "") } else if (i == 1 && j == 1) { if (dist == 0) (s1(0) + s1mod, s2(0) + s2mod) else (modify(s1mod, s1(0)), modify(s2mod, s2(0))) } else if (j < 1) (modifyString(s1mod, s1.slice(0, i)), s2mod) else if (i < 1) (s1mod, modifyString(s2mod, s2.slice(0, j))) else { val (suppr, subst, ins) = (matrix(i - 1)(j), matrix(i - 1)(j - 1), matrix(i)(j - 1)) if (suppr < subst) findOperations(suppr, i - 1, j, modify(s1mod, s1(i - 1)), s2mod) else if (ins < subst) findOperations(ins, i, j - 1, s1mod, modify(s2mod, s2(j - 1))) else if (subst < dist) findOperations(subst, i - 1, j - 1, modify(s1mod, s1(i - 1)), modify(s2mod, s2(j - 1))) else findOperations(subst, i - 1, j - 1, s1(i - 1) + s1mod, s2(j - 1) + s2mod) } } findOperations(distance, s1.length, s2.length, "", "") } def min(suppr: Int, subst: Int, ins: =>Int) = { if(suppr < subst) suppr else if (ins < subst) ins else subst } } def editDistance(s1: String, s2: String): Int = EditMatrix(s1, s2).distance def showMatrix(s1: String, s2: String) = EditMatrix(s1, s2).print def showDistance(s1: String, s2: String) = EditMatrix(s1, s2).showDistance def showDistance(s1: String, s2: String, sep: String) = EditMatrix(s1, s2).showDistance(sep) private def separators(s: String) = (firstSeparator(s), secondSeparator(s)) private def firstSeparator(s: String) = if (s.isEmpty) "" else s(0).toString private def secondSeparator(s: String) = { if (s.size < 2) firstSeparator(s) else s(1).toString }}`