咱们解析csv文件时最经常遇到的问题就是乱码,可能有朋友说了我在解析时直接设定编码类型为GBK,GB2312就可以解决中文乱码,如下
public static List<List<String>> readTxtOrCsvFile(InputStream input) { List<List<String>> data = Lists.newArrayList(); if (input == null) { return data; } InputStreamReader read = null; BufferedReader br = null; try { read = new InputStreamReader(input, "GB2312"); br = new BufferedReader(read); String line; while ((line = br.readLine()) != null) { if (StringUtils.isNotBlank(line)) { List<String> dd = Arrays.asList(line.split(",")); List<String> n = new ArrayList<>(); for (int i = 0; i < dd.size(); i++) { String cellData = dd.get(i); n.add(buildText(cellData)); } data.add(n); } } } catch (Exception e) { e.printStackTrace(); } finally { try { if (br != null) { br.close(); } if (read != null) { read.close(); } if (input != null) { input.close(); } } catch (Exception e) { e.printStackTrace(); } } return data; }
这样可以解决部分用户的乱码,我想问下如果我的文件类型为UTF-8呢,这样解析出来的还是有乱码。如何做到完全解决呢。作者本人也遇到这种问题,想了许久,也在网上搜索很久,得到一个解决方案。
解决方案是:自己获取要解析文件编码,然后按照编码进行解析。如何才能获取编码,有如下步骤
1、从流中读取前三个字节到一个byte[3]数组中;
2、通过Integer.toHexString(byte[0] & 0xFF),将byte[3]数组中的三个byte分别转换成16进制的字符表示;
3、根据对三个byte进行转换后得到的字符串,与UTF-8格式头EFBBBF进行比较即可知道是否UTF-8格式。
/** * 读取txt,csv文件 * * @return */ public static List<List<String>> readTxtOrCsvFile(InputStream input) { List<List<String>> data = Lists.newArrayList(); if (input == null) { return data; } InputStreamReader read = null; BufferedReader br = null; BufferedInputStream bb = null; try { bb = new BufferedInputStream(input); read = new InputStreamReader(bb, getCharSet(bb)); br = new BufferedReader(read); String line; while ((line = br.readLine()) != null) { if (StringUtils.isNotBlank(line)) { List<String> dd = Arrays.asList(line.split(",")); List<String> n = new ArrayList<>(); for (int i = 0; i < dd.size(); i++) { String cellData = dd.get(i); n.add(buildText(cellData)); } data.add(n); } } } catch (Exception e) { e.printStackTrace(); } finally { try { if (br != null) { br.close(); } if (read != null) { read.close(); } if (bb != null) { bb.close(); } if (input != null) { input.close(); } } catch (Exception e) { e.printStackTrace(); } } return data; } /** * 获取流对应的编码类型 * @param bb * @return * @throws Exception */ private static String getCharSet(BufferedInputStream bb) throws Exception { String charSet = null; byte[] buffer = new byte[3]; //因流读取后再读取可能会缺少内容,此处需要先读,然后再还原 bb.mark(bb.available() + 1); bb.read(buffer); bb.reset(); String s = Integer.toHexString(buffer[0] & 0xFF) + Integer.toHexString(buffer[1] & 0xFF) + Integer.toHexString(buffer[2] & 0xFF); switch (s) { //GBK,GB2312对应均为d5cbba,统一当成GB2312解析 case "d5cbba": charSet = "GB2312"; break; case "efbbbf": charSet = "UTF-8"; break; default: charSet = "GB2312"; break; } return charSet; }
问题圆满解决