获取csv文件编码，解决csv读取中文乱码问题

本文转载自查看原文 2019-03-19 15:58 1237 csv文件编码/ csv中文乱码

咱们解析csv文件时最经常遇到的问题就是乱码，可能有朋友说了我在解析时直接设定编码类型为GBK，GB2312就可以解决中文乱码，如下

    public static List<List<String>> readTxtOrCsvFile(InputStream input) {
        List<List<String>> data = Lists.newArrayList();
        if (input == null) {
            return data;
        }
        InputStreamReader read = null;
        BufferedReader br = null;
        try {
            read = new InputStreamReader(input, "GB2312");
            br = new BufferedReader(read);
            String line;
            while ((line = br.readLine()) != null) {
                if (StringUtils.isNotBlank(line)) {
                    List<String> dd = Arrays.asList(line.split(","));
                    List<String> n = new ArrayList<>();
                    for (int i = 0; i < dd.size(); i++) {
                        String cellData = dd.get(i);
                        n.add(buildText(cellData));
                    }
                    data.add(n);
                }
            }
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            try {
                if (br != null) {
                    br.close();
                }
                if (read != null) {
                    read.close();
                }
                if (input != null) {
                    input.close();
                }
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
        return data;
    }

这样可以解决部分用户的乱码，我想问下如果我的文件类型为UTF-8呢，这样解析出来的还是有乱码。如何做到完全解决呢。作者本人也遇到这种问题，想了许久，也在网上搜索很久，得到一个解决方案。

解决方案是：自己获取要解析文件编码，然后按照编码进行解析。如何才能获取编码，有如下步骤

1、从流中读取前三个字节到一个byte[3]数组中；
2、通过Integer.toHexString(byte[0] & 0xFF)，将byte[3]数组中的三个byte分别转换成16进制的字符表示；
3、根据对三个byte进行转换后得到的字符串，与UTF-8格式头EFBBBF进行比较即可知道是否UTF-8格式。

    /**
     * 读取txt,csv文件
     *
     * @return
     */
    public static List<List<String>> readTxtOrCsvFile(InputStream input) {

        List<List<String>> data = Lists.newArrayList();
        if (input == null) {
            return data;
        }
        InputStreamReader read = null;
        BufferedReader br = null;
        BufferedInputStream bb = null;
        try {
            bb = new BufferedInputStream(input);
            read = new InputStreamReader(bb, getCharSet(bb));
            br = new BufferedReader(read);
            String line;
            while ((line = br.readLine()) != null) {
                if (StringUtils.isNotBlank(line)) {
                    List<String> dd = Arrays.asList(line.split(","));
                    List<String> n = new ArrayList<>();
                    for (int i = 0; i < dd.size(); i++) {
                        String cellData = dd.get(i);
                        n.add(buildText(cellData));
                    }
                    data.add(n);
                }
            }
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            try {
                if (br != null) {
                    br.close();
                }
                if (read != null) {
                    read.close();
                }
                if (bb != null) {
                    bb.close();
                }
                if (input != null) {
                    input.close();
                }
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
        return data;
    }


    /**
     * 获取流对应的编码类型
     * @param bb
     * @return
     * @throws Exception
     */
    private static String getCharSet(BufferedInputStream bb) throws Exception {

        String charSet = null;
        byte[] buffer = new byte[3];
        //因流读取后再读取可能会缺少内容，此处需要先读，然后再还原
        bb.mark(bb.available() + 1);
        bb.read(buffer);
        bb.reset();
        String s = Integer.toHexString(buffer[0] & 0xFF) + Integer.toHexString(buffer[1] & 0xFF) + Integer.toHexString(buffer[2] & 0xFF);
        switch (s) {
            //GBK,GB2312对应均为d5cbba，统一当成GB2312解析
            case "d5cbba":
                charSet = "GB2312";
                break;
            case "efbbbf":
                charSet = "UTF-8";
                break;
            default:
                charSet = "GB2312";
                break;
        }

        return charSet;
    }

问题圆满解决

免责声明！

本站转载的文章为个人学习借鉴使用，本站对版权不负任何法律责任。如果侵犯了您的隐私权益，请联系本站邮箱yoyou2525@163.com删除。

猜您在找 【JMeter】解决读取CSV文件中文乱码问题 pandas读取csv文件中文乱码问题 jmeter参数化CSV文件内容为中文读取乱码的问题 php 生成读取csv文件并解决中文乱码 Python读取 csv文件中文乱码处理 Java里面读取csv文件中文乱码 Flink读取csv文件遇到中文乱码 Java里面读取csv文件中文乱码解决Excel打开UTF-8编码CSV文件乱码的问题解决NavicatPremium导入CSV文件中文乱码的问题