Eugene W. Myers 在他1986年發表於"Algorithmica"的論文"An O(ND) Difference Algorithm and Its Variations"中描述了一種用於處理diff的基礎貪婪算法. 在他的論文中, 還對這種算法進行了擴展"Linear Space Refinement".
定義文件A和文件B, 算法會讀取兩個文件的輸入, 假設B為新版本, 算法會生成一段Shortest Edit Script (SES, 最短編輯腳本) 用於將A轉換為B . SES只包含兩種命令: 從A中刪除, 以及在B中插入
尋找SES 等價於尋找 Longest Common Subsequence ( LCS 最長公共子序列 ), LCS是兩個文件中去掉一些字符后, 所產生的共有的最長的字符串序列. 注意, 這與 Longest Common Substring (最長公共字符串)不同, 后者是必須連續的.
兩個文件中, 可能存在多個LCS, 例如ABC和ACB, 存在兩個LCS "AB"和"AC", 在這里分別對應了一個SES. 這個算法在存在多個SES時, 僅返回第一個找到的SES.
算法的運作是依賴於A和B文件構成的有向編輯圖, 圖中A為X軸, B為Y軸, 假定A和B的長度分別為m, n, 每個坐標代表了各自字符串中的一個字符. 在圖中沿X軸前進代表刪除A中的字符, 沿Y軸前進代表插入B中的字符. 在橫坐標於縱坐標字符相同的地方, 會有一條對角線連接左上與右下兩點, 表示不需任何編輯, 等價於路徑長度為0. 算法的目標, 就是尋找到一個從坐標(0, 0)到(m, n)的最短路徑
算法在比較中, 定義了以下變量
k: 左上至右下的對角線, 以(0,0)對應的對角線k=0, 左側為-1, -2, ... 右側為1, 2, ...
d: 路徑長度
x, y: 坐標
snake: 代表了一步操作及其后面跟隨的對角線移動
Source: https://www.codeproject.com/Articles/42279/Investigating-Myers-diff-algorithm-Part-2-of-2
Java代碼
public static void main(String[] args) { String a = "ABCABBACDAB"; String b = "CBABACDAA"; char[] aa = a.toCharArray(); char[] bb = b.toCharArray(); int max = aa.length + bb.length; int[] v = new int[max * 2]; List<Snake> snakes = new ArrayList<>(); for (int d = 0; d <= aa.length + bb.length; d++) { System.out.println("D:" + d); for (int k = -d; k <= d; k += 2) { System.out.print("k:" + k); // down or right? boolean down = (k == -d || (k != d && v[k - 1 + max] < v[k + 1 + max])); int kPrev = down ? k + 1 : k - 1; // start point int xStart = v[kPrev + max]; int yStart = xStart - kPrev; // mid point int xMid = down ? xStart : xStart + 1; int yMid = xMid - k; // end point int xEnd = xMid; int yEnd = yMid; // follow diagonal int snake = 0; while (xEnd < aa.length && yEnd < bb.length && aa[xEnd] == bb[yEnd]) { xEnd++; yEnd++; snake++; } // save end point v[k + max] = xEnd; // record a snake snakes.add(0, new Snake(xStart, yStart, xEnd, yEnd)); System.out.print(", start:("+xStart+","+yStart+"), mid:("+xMid+","+yMid+"), end:("+xEnd+","+yEnd + ")\n"); // check for solution if (xEnd >= aa.length && yEnd >= bb.length) { /* solution has been found */ System.out.println("found"); /* print the snakes */ Snake current = snakes.get(0); System.out.println(String.format("(%2d, %2d)<-(%2d, %2d)", current.getxEnd(), current.getyEnd(), current.getxStart(), current.getyStart())); for (int i = 1; i < snakes.size(); i++) { Snake tmp = snakes.get(i); if (tmp.getxEnd() == current.getxStart() && tmp.getyEnd() == current.getyStart()) { current = tmp; System.out.println(String.format("(%2d, %2d)<-(%2d, %2d)", current.getxEnd(), current.getyEnd(), current.getxStart(), current.getyStart())); if (current.getxStart() == 0 && current.getyStart() == 0) { break; } } } return; } } } } public static class Snake { private int xStart; private int yStart; private int xEnd; private int yEnd; public Snake(int xStart, int yStart, int xEnd, int yEnd) { this.xStart = xStart; this.yStart = yStart; this.xEnd = xEnd; this.yEnd = yEnd; } public int getxStart() { return xStart; } public void setxStart(int xStart) { this.xStart = xStart; } public int getyStart() { return yStart; } public void setyStart(int yStart) { this.yStart = yStart; } public int getxEnd() { return xEnd; } public void setxEnd(int xEnd) { this.xEnd = xEnd; } public int getyEnd() { return yEnd; } public void setyEnd(int yEnd) { this.yEnd = yEnd; } }
運行結果
D:0 k:0, start:(0,-1), mid:(0,0), end:(0,0) D:1 k:-1, start:(0,0), mid:(0,1), end:(0,1) k:1, start:(0,0), mid:(1,0), end:(1,0) D:2 k:-2, start:(0,1), mid:(0,2), end:(2,4) k:0, start:(1,0), mid:(1,1), end:(2,2) k:2, start:(1,0), mid:(2,0), end:(3,1) D:3 k:-3, start:(2,4), mid:(2,5), end:(3,6) k:-1, start:(2,4), mid:(3,4), end:(4,5) k:1, start:(3,1), mid:(3,2), end:(5,4) k:3, start:(3,1), mid:(4,1), end:(5,2) D:4 k:-4, start:(3,6), mid:(3,7), end:(4,8) k:-2, start:(4,5), mid:(4,6), end:(4,6) k:0, start:(5,4), mid:(5,5), end:(5,5) k:2, start:(5,4), mid:(6,4), end:(10,8) k:4, start:(5,2), mid:(6,2), end:(7,3) D:5 k:-5, start:(4,8), mid:(4,9), end:(4,9) k:-3, start:(4,8), mid:(5,8), end:(5,8) k:-1, start:(5,5), mid:(5,6), end:(5,6) k:1, start:(10,8), mid:(10,9), end:(10,9) k:3, start:(10,8), mid:(11,8), end:(11,8) k:5, start:(7,3), mid:(8,3), end:(8,3) D:6 k:-6, start:(4,9), mid:(4,10), end:(4,10) k:-4, start:(5,8), mid:(5,9), end:(5,9) k:-2, start:(5,8), mid:(6,8), end:(7,9) k:0, start:(10,9), mid:(10,10), end:(10,10) k:2, start:(11,8), mid:(11,9), end:(11,9) found (11, 9)<-(11, 8) (11, 8)<-(10, 8) (10, 8)<-( 5, 4) ( 5, 4)<-( 3, 1) ( 3, 1)<-( 1, 0) ( 1, 0)<-( 0, 0)