快速讀取大文件的幾種方式

本文轉載自查看原文 2017-09-29 14:11 2834 java Scala/ Scala

轉一篇：http://blog.csdn.net/fengxingzhe001/article/details/67640083

原來使用一行一行讀取文本的方式，速度是慢的的可以，弄了好久還是不行，后來看了下才知道要用字節流傳輸會快很多

我自己也測了一下80M的文件，發現給讀入塊的大小會很明顯的影響讀入的速度。

測試代碼如下：

def useBufferIStream(): Util = {
    try {
      val begin = System.currentTimeMillis
      val file = new File(s"E:\\data\\part-m-00000")
      val fis = new FileInputStream(file)
      val bis = new BufferedInputStream(fis)
      val buffer = new Array[Byte](1024*1024*90)
      var content = ""
      var cnt = 0
      cnt = bis.read(buffer)
      while( cnt != -1) {
        content += new String(buffer, 0, cnt)
        cnt=bis.read(buffer)
      }

      bis.close()
      println("=====BufferIStream===== time: " + (System.currentTimeMillis - begin) + "ms")

    } catch {
      case e: Exception =>
        // TODO Auto-generated catch block
        e.printStackTrace()
        println("error")
    }

  }

　　代碼中綠色部分為讀入塊的大小，目前設定的是90M大於要讀的數據，這時的讀入時間只要0.2s

　　如果改為10M即（1024102410），讀入時間就需要10s左右，速度有很明顯的變化。

　　這里解釋一下一部分代碼：

　　1、val buffer = new Array[Byte](1024*1024*90) 為每次讀入文件的大小；

　　2、cnt = bis.read(buffer) 讀入數據塊大小的標識，如果讀入塊沒用信息則為-1，有信息則為這塊信息的大小；

　　3、content 為最終讀入的文本信息

　　4、這里使用的Scala語言，測試中發現

　　　 cnt = bis.read(buffer)
      while( cnt != -1) {
        content += new String(buffer, 0, cnt)
        cnt=bis.read(buffer)
      }

　　while的這塊語句書寫必須用這種形式，不能使用 while((cnt=fis.read(buffer)) != -1) ，雖然在java上運行是都可以的，但是在Scala中，后者運行會報錯，具體原因不明，應該跟Scala的一些機制有關

　　上面代碼能夠解決基本的讀取數據問題，但是無法保證數據分塊讀入時每一行數據是完整的，因此在前文基礎上作出部分改動

　　代碼如下：（實現將一個86G文件分解為90M的若干小文件，並保證每個小文件中每行數據的完整性）

def safeCopy():Unit={
    try {
      var size = 0
      var count = 0
      var tmp = -1
      var tmp2 = 0
      var stmp = new Array[Byte](1024*1024*90)
      var content = ""
      val fis = new FileInputStream("E:\\data\\part-m-00003")
//val fis = new FileInputStream("E:\\data\\test\\test.txt")
      val bis = new BufferedInputStream(fis)
      val buffer = new Array[Byte](1024*1024*90)
//      val buffer = new Array[Byte](1024)
      size = bis.read(buffer)
      while (size != -1){
        var fos = new FileOutputStream("E:\\data\\part3\\part_"+"%04d".format(count))
//        var fos = new FileOutputStream("E:\\data\\test\\test_"+"%04d".format(count))
        var bos = new BufferedOutputStream(fos)
        tmp = findSize(buffer,size)
        if(tmp>0) {
          if (count != 0) {
            bos.write(stmp, 0, tmp2)
          }
          tmp2 = size - tmp
          Array.copy(buffer, tmp, stmp, 0, tmp2)
          bos.write(buffer, 0, tmp-1)
        }else{
          bos.write(buffer, 0, size)
        }
        size = bis.read(buffer)
        bos.flush()

        println(s"finish $count")
        count+=1
      }
      bis.close()
      println("success!!")
    } catch {
      case e: Exception =>
        e.printStackTrace()
        println ("error!!")
    }
  }

  def findSize(buffer:Array[Byte],size:Int):Int={

    var i=size-1
//    println(size)
//    println(i)
    var j=1
    var num = -1
    while(i>=0 && j==1){
      if(buffer(i)==13 || buffer(i)==10){
        num = i
        j=0
      }
      i-=1
    }
    num+1
  }

　　其中，findSize函數負責尋找每塊文件中完整數據的長度，buffer(i)==10 其中的10，為換行符的Byte值（每一行數據以換行符作為結束）

新的方法，發現可以更為容易的解決上述問題

    public static void main(String[] args) throws IOException {
        String inputFile = "E:\\data\\part-m-00000";
        String outputFile = "E:\\data\\test01\\a-0";
        BufferedInputStream bis = new BufferedInputStream(new FileInputStream(new File(inputFile)));
        BufferedReader in = new BufferedReader(new InputStreamReader(bis,"utf-8"),60*1024*1024);
        FileWriter fw=new FileWriter(outputFile);
        int count = 0;
        int count1 = 0;
        Long start = System.currentTimeMillis();
        while(in.ready()){
            String line = in.readLine();
            count++;
            fw.append(line+"\n");
            if(count == 150000){
                count1++;
                fw.flush();
                fw.close();
                fw = new FileWriter("E:\\data\\test01\\a-"+count1);
                count =0;
                Long end = System.currentTimeMillis();
                System.out.println((end-start));
                start = System.currentTimeMillis();
            }

        }
        in.close();
        fw.flush();
        fw.close();
    }

　　速度與前面不相上下，並且可以很好地解決按條讀取的需求，大小通過控制條數的多少來實現。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 C++快速讀取大文件 Java快速讀取大文件 php 快速讀取文件夾下文件列表快速讀取內存文件-內存映射文件的方法 Python 讀取大文件的方式 python快速讀取大數據 python讀取大文件處理方式 C#快速讀寫文件 python 大文件以行為單位讀取方式比對 Java讀取resource文件/路徑的幾種方式