Pages

09 June 2008

Edit distance in Scala

How handy.

I was just looking for a way to highlight differences between 2 strings for my Behavior-Driven Development library, specs. And reading the excellent Algorithm Design Manual. Intuitively, I was thinking that there was a better way to show string differences than the one used in jUnit:

Expected kit<t>en but was kit<ch>en
Expected <skate> but was <kite>

The first message shows that <t> has been replaced by <ch>, while the second message says that everything is different! There must be a way to find the minimal set of operations necessary to transform one string to another, right?

The Levenshtein distance

There is indeed. The "Edit" distance, also called "Levenshtein" distance, computes exactly this, the minimal number of insertions, deletions and substitutions required to transform "World" into "Peace" (5, they're very far apart,...).

The computation is usually done by:
  1. examining the first 2 or last 2 letters of each string
  2. assuming the eventual transformation which may occur for those 2 letters:
  • they're equal: "h..." and "h...." --> distance = 0
  • the first one has been removed: "ah..." and "h..." --> distance = 1
  • the second one has been inserted: "h..." and "ah..." --> distance = 1
  • they've been substituted: "ha..." and "ga..." --> distance = 1
  • 3. saying that the final distance is the minimum of the distance when choosing one the
  • 4 possibilities + the distance resulting from the consequences of that choice

  • You can notice that many variations are possible: the distance could be different depending on the letters being added or removed, there could be more allowed operations such as the swapping of 2 letters (distance 1 instead of 2 substitutions of distance 2),...

    Then the final result can either be constructed incrementally or recursively. Incrementally, for "skate" and "kite" you will compute costs from any small prefix to another prefix:
    • "s" to "k" --> distance 1
    • "s" to "ki" --> distance 2

    • "sk" to "ki" --> distance 2
    • "ska" to "kit" --> distance 3
    Then you increase the prefix size and you reuse the distance numbers found for smaller prefixes. The different results can be stored in the following matrix:

    s k a t e

    k 1 1 2 3 4
    i 2 2 2 3 4
    t 3 3 3 2 3
    e 4 4 4 4 2

    Recursively, you do the inverse and you establish that the distance between 2 strings can be computed from knowing the distance between smaller prefixes and you travel the matrix to its upper left corner.

    This is a very good example of Dynamic Programming, where the optimal quantity (the distance at [i, j] in the matrix) you're looking for only depends on:
    • the optimal quantity for a subset of the data (the distance at [i-1, j-1], [i-1, j], [i, j-1])
    • the resulting state (the substrings defined by a position [i-x, j-x] in the matrix)
    • not how you got to that resulting state
    The EditDistance trait

    For the record, I will write here the code I'm using for the implementation, which is constructing the solution incrementally, as opposed to the recursive solution presented on Tony's blog, here.

    The few interesting things to notice about this code are:
    • the algorithm itself ("initializing the matrix"), which is very straightforward
    • the minimum function which is a bit strange because it is not comparing all alternatives. The cost for an insertion is only computed if the cost for a suppression is not better than the cost for a substitution. I would like to work out a formal proof on why this is correct by my intuition tells me that a substitution is generally a "good" operation. It has the same cost as an insertion or a suppression but processes 2 letters at the time. So if we find a better cost using a suppression, it is going to be the best
    • the matrix traversal to retrieve one of the possible paths showing the letters which needs to be transformed for each string: not so fun code with a lot of corner cases
    • the "separators" feature which allows to select the separators of your choice to display the string differences

    trait EditDistance {
    /**
    * Class encapsulating the functions related to the edit distance of 2 strings
    */
    case class EditMatrix(s1: String, s2: String) {
    /* matrix containing the edit distance for any prefix of s1 and s2:
    matrix(i)(j) = edit distance(s1[0..i], s[0..j])*/
    val matrix = new Array[Array[int]](s1.length + 1, s2.length + 1)

    /* initializing the matrix */
    for (i <- 0 to s1.length;
    j <- 0 to s2.length) {
    if (i == 0) matrix(i)(j) = j // j insertions
    else if (j == 0) matrix(i)(j) = i // i suppressions
    else matrix(i)(j) = min(matrix(i - 1)(j) + 1, // suppression
    matrix(i - 1)(j - 1) + (if (s1(i - 1) == s2(j - 1)) 0
    else 1), // substitution
    matrix(i)(j - 1) + 1) // insertion

    }
    /** @return the edit distance between 2 strings */
    def distance = matrix(s1.length)(s2.length)

    /** prints the edit matrix of 2 strings */
    def print = {
    for (i <- 0 to s1.length) {
    def row = for (j <- 0 to s2.length) yield matrix(i)(j)
    println(row.mkString("|"))
    }
    this
    }

    /** @return a (String, String) displaying the differences between each
    input strings. The used separators are parenthesis: '(' and ')'*/
    def showDistance: (String, String) = showDistance("()")

    /**
    * @param sep separators used to hightlight differences. If sep is empty,
    * then no separator is used. If sep contains one character, it is
    * taken as the unique separator. If sep contains 2 or more characters,
    * the first 2 characters are taken as opening separator and closing
    * separator.
    *
    * @return a (String, String) displaying the differences between each
    * input strings. The used separators are specified by the caller.
    */
    def showDistance(sep: String) = {
    val (firstSeparator, secondSeparator) = separators(sep)
    def modify(s: String, c: Char): String = modifyString(s, c.toString)
    def modifyString(s: String, mod: String): String = {
    (firstSeparator + mod + secondSeparator + s).
    removeAll(secondSeparator + firstSeparator)
    }
    def findOperations(dist: Int, i: Int, j:Int, s1mod: String, s2mod: String): (String, String) = {
    if (i == 0 && j == 0) {
    ("", "")
    }
    else if (i == 1 && j == 1) {
    if (dist == 0) (s1(0) + s1mod, s2(0) + s2mod)
    else (modify(s1mod, s1(0)), modify(s2mod, s2(0)))
    }
    else if (j < 1) (modifyString(s1mod, s1.slice(0, i)), s2mod)
    else if (i < 1) (s1mod, modifyString(s2mod, s2.slice(0, j)))
    else {
    val (suppr, subst, ins) = (matrix(i - 1)(j), matrix(i - 1)(j - 1),
    matrix(i)(j - 1))
    if (suppr < subst)
    findOperations(suppr, i - 1, j, modify(s1mod, s1(i - 1)), s2mod)
    else if (ins < subst)
    findOperations(ins, i, j - 1, s1mod, modify(s2mod, s2(j - 1)))
    else if (subst < dist)
    findOperations(subst, i - 1, j - 1, modify(s1mod, s1(i - 1)),
    modify(s2mod, s2(j - 1)))
    else
    findOperations(subst, i - 1, j - 1, s1(i - 1) + s1mod, s2(j - 1) + s2mod)
    }
    }
    findOperations(distance, s1.length, s2.length, "", "")
    }
    def min(suppr: Int, subst: Int, ins: =>Int) = {
    if(suppr < subst) suppr
    else if (ins < subst) ins
    else subst
    }
    }
    def editDistance(s1: String, s2: String): Int = EditMatrix(s1, s2).distance
    def showMatrix(s1: String, s2: String) = EditMatrix(s1, s2).print
    def showDistance(s1: String, s2: String) = EditMatrix(s1, s2).showDistance
    def showDistance(s1: String, s2: String, sep: String) = EditMatrix(s1, s2).showDistance(sep)

    private def separators(s: String) = (firstSeparator(s), secondSeparator(s))
    private def firstSeparator(s: String) = if (s.isEmpty) "" else s(0).toString
    private def secondSeparator(s: String) = {
    if (s.size < 2) firstSeparator(s) else s(1).toString
    }
    }


    Conclusion

    What's the conclusion? Computer science is useful of course, but you only recognize it once you know it!

    No comments: