獲取csv文件編碼，解決csv讀取中文亂碼問題

本文轉載自查看原文 2019-03-19 15:58 1237 csv文件編碼/ csv中文亂碼

咱們解析csv文件時最經常遇到的問題就是亂碼，可能有朋友說了我在解析時直接設定編碼類型為GBK，GB2312就可以解決中文亂碼，如下

    public static List<List<String>> readTxtOrCsvFile(InputStream input) {
        List<List<String>> data = Lists.newArrayList();
        if (input == null) {
            return data;
        }
        InputStreamReader read = null;
        BufferedReader br = null;
        try {
            read = new InputStreamReader(input, "GB2312");
            br = new BufferedReader(read);
            String line;
            while ((line = br.readLine()) != null) {
                if (StringUtils.isNotBlank(line)) {
                    List<String> dd = Arrays.asList(line.split(","));
                    List<String> n = new ArrayList<>();
                    for (int i = 0; i < dd.size(); i++) {
                        String cellData = dd.get(i);
                        n.add(buildText(cellData));
                    }
                    data.add(n);
                }
            }
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            try {
                if (br != null) {
                    br.close();
                }
                if (read != null) {
                    read.close();
                }
                if (input != null) {
                    input.close();
                }
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
        return data;
    }

這樣可以解決部分用戶的亂碼，我想問下如果我的文件類型為UTF-8呢，這樣解析出來的還是有亂碼。如何做到完全解決呢。作者本人也遇到這種問題，想了許久，也在網上搜索很久，得到一個解決方案。

解決方案是：自己獲取要解析文件編碼，然后按照編碼進行解析。如何才能獲取編碼，有如下步驟

1、從流中讀取前三個字節到一個byte[3]數組中；
2、通過Integer.toHexString(byte[0] & 0xFF)，將byte[3]數組中的三個byte分別轉換成16進制的字符表示；
3、根據對三個byte進行轉換后得到的字符串，與UTF-8格式頭EFBBBF進行比較即可知道是否UTF-8格式。

    /**
     * 讀取txt,csv文件
     *
     * @return
     */
    public static List<List<String>> readTxtOrCsvFile(InputStream input) {

        List<List<String>> data = Lists.newArrayList();
        if (input == null) {
            return data;
        }
        InputStreamReader read = null;
        BufferedReader br = null;
        BufferedInputStream bb = null;
        try {
            bb = new BufferedInputStream(input);
            read = new InputStreamReader(bb, getCharSet(bb));
            br = new BufferedReader(read);
            String line;
            while ((line = br.readLine()) != null) {
                if (StringUtils.isNotBlank(line)) {
                    List<String> dd = Arrays.asList(line.split(","));
                    List<String> n = new ArrayList<>();
                    for (int i = 0; i < dd.size(); i++) {
                        String cellData = dd.get(i);
                        n.add(buildText(cellData));
                    }
                    data.add(n);
                }
            }
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            try {
                if (br != null) {
                    br.close();
                }
                if (read != null) {
                    read.close();
                }
                if (bb != null) {
                    bb.close();
                }
                if (input != null) {
                    input.close();
                }
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
        return data;
    }


    /**
     * 獲取流對應的編碼類型
     * @param bb
     * @return
     * @throws Exception
     */
    private static String getCharSet(BufferedInputStream bb) throws Exception {

        String charSet = null;
        byte[] buffer = new byte[3];
        //因流讀取后再讀取可能會缺少內容，此處需要先讀，然后再還原
        bb.mark(bb.available() + 1);
        bb.read(buffer);
        bb.reset();
        String s = Integer.toHexString(buffer[0] & 0xFF) + Integer.toHexString(buffer[1] & 0xFF) + Integer.toHexString(buffer[2] & 0xFF);
        switch (s) {
            //GBK,GB2312對應均為d5cbba，統一當成GB2312解析
            case "d5cbba":
                charSet = "GB2312";
                break;
            case "efbbbf":
                charSet = "UTF-8";
                break;
            default:
                charSet = "GB2312";
                break;
        }

        return charSet;
    }

問題圓滿解決

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 jmeter參數化CSV文件內容為中文讀取亂碼的問題 php 生成讀取csv文件並解決中文亂碼 JAVA本地讀取文件，解決中文亂碼問題解決FileInputStream 讀取文件中文亂碼問題（轉）解決Java讀取properties文件中文亂碼的問題 java讀取.properties文件及解決中文亂碼問題解決IDEA springBoot讀取*.properties文件中文內容亂碼的問題 [大數據技術]Kettle從CSV文件讀取清洗后到MySQL中文亂碼問題解決CSV文件用Excel打開亂碼問題解決jmeter導入csv文件亂碼的問題