代碼背景:
- 由於Excel本身無法簡單的比較兩個Excel數據的異同,所以用VBA編寫代碼的方式來實現。
- 這里的比較條件是:數據行為單位,假設對應Sheet中沒有重復數據,對應數據行的所有列的數據都相等,即為此行數據相同。
- 這里的兩個Sheet的數據行量級別大約為:50000 * 50000,數據列大約:50,對應Cell中的字符串大約100以內,中英文混合。
- 如何在Excel中調出VBA的編寫工具,請參考如下鏈接: https://jingyan.baidu.com/article/63f236281f17650208ab3d97.html
整體來說,需求非常明確,若是不考慮效率的話,代碼邏輯比較簡單,循環比較即可。
相關代碼:

Sub CompareData() Dim i As Long Dim j As Long Dim fullSheetName As String fullSheetName = "Sheet1" Set fullSheet = Sheets(fullSheetName) Dim fullDataRange As Variant fullDataRange = fullSheet.Range("A1", "AT80000").CurrentRegion.Value Dim fullSheetRowMax As Long fullSheetRowMax = fullSheet.Range("A1", "AT80000").CurrentRegion.Rows.Count Dim partialSheetName As String partialSheetName = "Sheet2" Set partialSheet = Sheets(partialSheetName) Dim partialDataRange As Variant partialDataRange = partialSheet.Range("A1", "AT80000").CurrentRegion.Value Dim partialSheetRowMax As Long partialSheetRowMax = partialSheet.Range("A1", "AT80000").CurrentRegion.Rows.Count Dim columnMax As Integer columnMax = 46 Dim columnMark As Integer columnMark = 48 Dim sameRow As Boolean For i = 1 To fullSheetRowMax For j = 1 To partialSheetRowMax sameRow = True For columnIndex = 1 To columnMax If fullDataRange(i, columnIndex) <> partialDataRange(j, columnIndex) Then sameRow = False Exit For End If Next columnIndex If sameRow Then fullSheet.Cells(i, columnMark) = 1 Exit For End If Next j Next i MsgBox "Successfully!" End Sub
上述代碼實際運行大約用30分鍾完成此數量級的比較,大約1000億次的比較。
當然了我們需要更快的比較方式,那么就需要對大數據進行結構優化,即:將partial的sheet中的數據進行分組,比如每1000個row放到一個組里,然后用一個標志位標記這個組里1000個row是否都有相同的數據,如有都有的話,那么下次再比較的時候就可以跳過這個組,進行下一組的1000個row的循環遍歷。相同數量級,大約2分鍾比較完成。
注:實際數據是按照時間進行抽取出來的,所以partial的sheet數據 大致都在full的sheet的前半部分相同,如果數據無規律,非常混亂,那么還要對每一個row的數據進行結構優化,即:用另外一個標記為進行標記此row是否有相同的數據,判斷的時候先判斷這個標記位】
相關代碼如下:
【注:函數中的一些變量都是HardCode的,要根據具體數據進行修改】

Public Type PartialBasedModule IsAllSame As Boolean SheetDataRange As Variant SameCount As Long End Type Sub CompareData() Dim i As Long Dim j As Long Dim k As Long Dim fullSheetName As String fullSheetName = "Sheet1" Set fullSheet = Sheets(fullSheetName) Dim fullDataRange As Variant fullDataRange = fullSheet.Range("A1", "AT80000").CurrentRegion.Value Dim fullSheetRowMax As Long fullSheetRowMax = fullSheet.Range("A1", "AT80000").CurrentRegion.Rows.Count Dim partialSheetName As String partialSheetName = "Sheet2" Set partialSheet = Sheets(partialSheetName) Dim PartialDataRange As Variant PartialDataRange = partialSheet.Range("A1", "AT80000").CurrentRegion.Value Dim partialSheetRowMax As Long partialSheetRowMax = partialSheet.Range("A1", "AT80000").CurrentRegion.Rows.Count Dim partialSheetPages() As PartialBasedModule partialSheetPages = SeparatePartialSheet(PartialDataRange, partialSheetRowMax) Dim columnMax As Integer columnMax = 46 Dim columnMark As Integer columnMark = 48 Dim sameRow As Boolean For i = 1 To fullSheetRowMax For j = 1 To UBound(partialSheetPages) If partialSheetPages(j).SameCount < 1000 Then For k = 1 To UBound(partialSheetPages(j).SheetDataRange) sameRow = True For ColumnIndex = 1 To columnMax If fullDataRange(i, ColumnIndex) <> partialSheetPages(j).SheetDataRange(k, ColumnIndex) Then sameRow = False Exit For End If Next ColumnIndex If sameRow Then fullSheet.Cells(i, columnMark) = 1 partialSheetPages(j).SameCount = partialSheetPages(j).SameCount + 1 Exit For End If Next k Else sameRow = False End If If sameRow Then Exit For End If Next j Next i MsgBox "Successfully!" End Sub Public Function SeparatePartialSheet(ByRef PartialDataRange As Variant, ByVal rowCount As Long) As PartialBasedModule() Dim eachPageCount As Long eachPageCount = 1000 Dim pageCount As Integer pageCount = Int(rowCount / eachPageCount) + 1 Dim pageIndex As Long Dim pageArr() As PartialBasedModule Dim startIndex As Long Dim endIndex As Long For pageIndex = 1 To pageCount Dim seperatedDataRange(1 To 1000, 1 To 46) As Variant Dim seperatedIndex As Long seperatedIndex = 1 Dim colIndex As Integer If pageIndex < pageCount Then endIndex = pageIndex * eachPageCount Else endIndex = rowCount End If For startIndex = (pageIndex - 1) * eachPageCount + 1 To endIndex For colIndex = 1 To 46 seperatedDataRange(seperatedIndex, colIndex) = PartialDataRange(startIndex, colIndex) Next colIndex seperatedIndex = seperatedIndex + 1 Next startIndex Dim pageData As PartialBasedModule pageData.SheetDataRange = seperatedDataRange pageData.SameCount = 0 pageData.IsAllSame = False ReDim Preserve pageArr(pageIndex) pageArr(pageIndex) = pageData Next pageIndex SeparatePartialSheet = pageArr End Function
給每個Row都加上標記的代碼如下所示:【相同界別的數據,大約1分鍾完成比較】

Public Type RowModule IsSame As Boolean RowData As Variant End Type Public Type PartialBasedModule IsAllSame As Boolean SheetDataRange() As RowModule SameCount As Long End Type Sub CompareData() Dim i As Long Dim j As Long Dim k As Long Dim fullSheetName As String fullSheetName = "Sheet1" Set fullSheet = Sheets(fullSheetName) Dim fullDataRange As Variant fullDataRange = fullSheet.Range("A1", "AT80000").CurrentRegion.Value Dim fullSheetRowMax As Long fullSheetRowMax = fullSheet.Range("A1", "AT80000").CurrentRegion.Rows.Count Dim partialSheetName As String partialSheetName = "Sheet2" Set partialSheet = Sheets(partialSheetName) Dim PartialDataRange As Variant PartialDataRange = partialSheet.Range("A1", "AT80000").CurrentRegion.Value Dim partialSheetRowMax As Long partialSheetRowMax = partialSheet.Range("A1", "AT80000").CurrentRegion.Rows.Count Dim partialSheetPages() As PartialBasedModule partialSheetPages = SeparatePartialSheet(PartialDataRange, partialSheetRowMax) Dim columnMax As Integer columnMax = 46 Dim columnMark As Integer columnMark = 48 Dim sameRow As Boolean For i = 1 To fullSheetRowMax For j = 1 To UBound(partialSheetPages) If partialSheetPages(j).SameCount < 1000 Then For k = 1 To UBound(partialSheetPages(j).SheetDataRange) sameRow = True If partialSheetPages(j).SheetDataRange(k).IsSame Then sameRow = False Else For ColumnIndex = 1 To columnMax If fullDataRange(i, ColumnIndex) <> partialSheetPages(j).SheetDataRange(k).RowData(ColumnIndex) Then sameRow = False Exit For End If Next ColumnIndex If sameRow Then fullSheet.Cells(i, columnMark) = 1 partialSheetPages(j).SheetDataRange(k).IsSame = True partialSheetPages(j).SameCount = partialSheetPages(j).SameCount + 1 Exit For End If End If Next k Else sameRow = False End If If sameRow Then Exit For End If Next j Next i MsgBox "Successfully!" End Sub Public Function SeparatePartialSheet(ByRef PartialDataRange As Variant, ByVal rowCount As Long) As PartialBasedModule() Dim eachPageCount As Long eachPageCount = 1000 Dim pageCount As Integer pageCount = Int(rowCount / eachPageCount) + 1 Dim pageIndex As Long Dim pageArr() As PartialBasedModule Dim startIndex As Long Dim endIndex As Long For pageIndex = 1 To pageCount Dim seperatedDataRange(1 To 1000) As RowModule Dim dataRows(1 To 1000) As Variant Dim seperatedIndex As Long seperatedIndex = 1 Dim colIndex As Integer If pageIndex < pageCount Then endIndex = pageIndex * eachPageCount Else endIndex = rowCount End If For startIndex = (pageIndex - 1) * eachPageCount + 1 To endIndex Dim dataRow(1 To 46) As Variant For colIndex = 1 To 46 dataRow(colIndex) = PartialDataRange(startIndex, colIndex) Next colIndex Dim currentRowModule As RowModule currentRowModule.RowData = dataRow currentRowModule.IsSame = False seperatedDataRange(seperatedIndex) = currentRowModule seperatedIndex = seperatedIndex + 1 Next startIndex Dim pageData As PartialBasedModule pageData.SheetDataRange = seperatedDataRange pageData.SameCount = 0 pageData.IsAllSame = False ReDim Preserve pageArr(pageIndex) pageArr(pageIndex) = pageData Next pageIndex SeparatePartialSheet = pageArr End Function
最終的一個簡單的數據結構如下圖所示: