Apache POI Java讀取100萬行Excel性能優化：split vs indexOf+subString，誰性能好

本文轉載自查看原文 2014-07-27 21:05 3930 java/ split/ subString/ 優化/ 性能/ indexOf

使用Apache POI eventmodel實現一個Excel流式讀取類，目標是100萬行，每行46列，文件大小152MB的Excel文件能在20s讀取並處理完。一開始實現的程序需要260s，離目標差太遠了，使用jvisualvm分析各方法執行時間，結果如下：

可以看到，程序中的splitLine和getRowNum方法消耗了大量時間。這兩個方法都特別簡單。splitLine方法將類似“123==hello”這樣的字符串分解成{"123","hello"}數組，使用了String.split方法，getRowNum從Excel單元格地址字符串（比如“AB123456”）中獲取行號“123456”，以下是原始實現方法：

private String getRowNum(String cellRef){
    if(cellRef == null || cellRef == ""){
        return "-1";
    }
    
    String[] nums = cellRef.split("\\D+");
    if(nums.length > 1){
        return nums[1];
    }
    return "-1;
}

private String[] splitLine(String line){
    return line.split("==");
}

兩個如此簡單的方法卻消耗了這么多時間，一時間不知如何優化。最后突然想到：split的性能是否最優呢？對於如此簡單的字符串分割，使用indexOf + subString性能如何呢？於是，我做了如下的實驗：

public static void main(String[] args) throws ParseException{
    String str = "AB123456";
    long start = System.currentTimeMillis();
    for(int i = 0 ; i < 10 * 10000 ; i ++){
        String[] lines = str.split("\\D+");
    }
    long end = System.currentTimeMillis();
    System.out.println("split time consumed:" + (end - start) / 1000.0 + "s");
    
    start = System.currentTimeMillis();
    int index = -1;
    for(int i = 0 ; i < 10 * 10000 ; i ++){
        index = -1;
        for(int k = 0 ; k < str.length() ; k ++){
            if(str.charAt(k) >= '0' && str.charAt(k) <= '9'){
                index = k;
                break;
            }
        }
        
        if(index > 0){
            String[] lines = new String[]{str.substring(0, index),str.substring(index)};
        }
    }
    end = System.currentTimeMillis();
    System.out.println("indexof time consumed:" + (end - start) / 1000.0 + "s");
}

以下是輸出結果：
split time consumed:0.104s
indexof time consumed:0.007s

雖然表面上看，split比index + subString要簡單很多，但后者性能是前者的將近15倍。用這種方法改寫前面的splitLine和getRowNum，代碼如下：

private String getRowNum(String cellRef){
    int index = -1;
    for(int k = 0 ; k < cellRef.length() ; k ++){
        if(cellRef.charAt(k) >= '0' && cellRef.charAt(k) <= '9'){
            index = k;
            break;
        }
    }
    
    if(index >= 0){
        String[] nums = new String[]{cellRef.substring(0, index),cellRef.substring(index)};
        if(nums.length > 1){
            return nums[1];
        }
    }
    
    return "-1";
}

private String[] splitLine(String line){
    int index = line.indexOf("==");
    
    if(index > 0){
        return new String[]{line.substring(0, index),line.substring(index + 2)};
    }
    
    return new String[0]; 
}

優化后再用jvisualvm測試各方法執行時間：

可以看到，我自己的數據處理方法已不是明顯的性能瓶頸，而Apache POI的zip解壓和文件讀取占用了絕大部分時間。整體時間也從260s下降到了160s，已有了明顯的提高。

我們知道indexOf就是暴力搜索，split內部使用正則表達式做匹配，在搜索字符串較簡單時肯定是indexOf性能好。大多數情況下調用split時都用不到正則表達式的那些高大上功能，所以完全沒必要圖方便在任何時候都用split，而是有所取舍：當簡單分割字符串時自己用indexOf實現split，而涉及到復雜的分割操作，不得不用正則表達式時，才用split。為了看清String.split方法在做什么，我們看看JDK中String.split的源碼：

    public String[] split(String regex, int limit) {
        /* fastpath if the regex is a
         (1)one-char String and this character is not one of the
            RegEx's meta characters ".$|()[{^?*+\\", or
         (2)two-char String and the first char is the backslash and
            the second is not the ascii digit or ascii letter.
         */
        char ch = 0;
        if (((regex.value.length == 1 &&
             ".$|()[{^?*+\\".indexOf(ch = regex.charAt(0)) == -1) ||
             (regex.length() == 2 &&
              regex.charAt(0) == '\\' &&
              (((ch = regex.charAt(1))-'0')|('9'-ch)) < 0 &&
              ((ch-'a')|('z'-ch)) < 0 &&
              ((ch-'A')|('Z'-ch)) < 0)) &&
            (ch < Character.MIN_HIGH_SURROGATE ||
             ch > Character.MAX_LOW_SURROGATE))
        {
            int off = 0;
            int next = 0;
            boolean limited = limit > 0;
            ArrayList<String> list = new ArrayList<>();
            while ((next = indexOf(ch, off)) != -1) {
                if (!limited || list.size() < limit - 1) {
                    list.add(substring(off, next));
                    off = next + 1;
                } else {    // last one
                    //assert (list.size() == limit - 1);
                    list.add(substring(off, value.length));
                    off = value.length;
                    break;
                }
            }
            // If no match was found, return this
            if (off == 0)
                return new String[]{this};

            // Add remaining segment
            if (!limited || list.size() < limit)
                list.add(substring(off, value.length));

            // Construct result
            int resultSize = list.size();
            if (limit == 0)
                while (resultSize > 0 && list.get(resultSize - 1).length() == 0)
                    resultSize--;
            String[] result = new String[resultSize];
            return list.subList(0, resultSize).toArray(result);
        }
        return Pattern.compile(regex).split(this, limit);
    }

盡管split方法的實現還是挺優化的，但仍做了太多的操作。

想一想我過去寫的代碼經常圖方便濫用String.split，這樣是經不起大數據量考驗的，學了這么長時間Java，竟從沒想過這樣的問題，不禁感嘆自己還是菜鳥。雖然像Java或C#這種語言各種方法使用起來方便，但其庫方法之下隱藏的性能開銷，需要每一個使用者注意。

（全文完）

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 關於Java導出100萬行數據到Excel的優化方案為什么nginx性能比apache性能好 ConcurrentHashMap為什么比HashTable性能好？讀取超大Excel（39萬行數據） 2020 年了，Java 日志框架到底哪個性能好？——技術選型篇 POI讀寫大數據量excel，解決超過幾萬行而導致內存溢出的問題 JAVA導出上萬行Excel數據的解決方案 SQLSERVER 里SELECT COUNT(1) 和SELECT COUNT(*)哪個性能好？ POI讀寫大數據量excel，解決超過幾萬行而導致內存溢出的問題 Excel 怎樣為幾萬行數據自動加序號