咱们解析csv文件时最经常遇到的问题就是乱码,可能有朋友说了我在解析时直接设定编码类型为GBK,GB2312就可以解决中文乱码,如下
public static List<List<String>> readTxtOrCsvFile(InputStream input) {
List<List<String>> data = Lists.newArrayList();
if (input == null) {
return data;
}
InputStreamReader read = null;
BufferedReader br = null;
try {
read = new InputStreamReader(input, "GB2312");
br = new BufferedReader(read);
String line;
while ((line = br.readLine()) != null) {
if (StringUtils.isNotBlank(line)) {
List<String> dd = Arrays.asList(line.split(","));
List<String> n = new ArrayList<>();
for (int i = 0; i < dd.size(); i++) {
String cellData = dd.get(i);
n.add(buildText(cellData));
}
data.add(n);
}
}
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
if (br != null) {
br.close();
}
if (read != null) {
read.close();
}
if (input != null) {
input.close();
}
} catch (Exception e) {
e.printStackTrace();
}
}
return data;
}
这样可以解决部分用户的乱码,我想问下如果我的文件类型为UTF-8呢,这样解析出来的还是有乱码。如何做到完全解决呢。作者本人也遇到这种问题,想了许久,也在网上搜索很久,得到一个解决方案。
解决方案是:自己获取要解析文件编码,然后按照编码进行解析。如何才能获取编码,有如下步骤
1、从流中读取前三个字节到一个byte[3]数组中;
2、通过Integer.toHexString(byte[0] & 0xFF),将byte[3]数组中的三个byte分别转换成16进制的字符表示;
3、根据对三个byte进行转换后得到的字符串,与UTF-8格式头EFBBBF进行比较即可知道是否UTF-8格式。
/**
* 读取txt,csv文件
*
* @return
*/
public static List<List<String>> readTxtOrCsvFile(InputStream input) {
List<List<String>> data = Lists.newArrayList();
if (input == null) {
return data;
}
InputStreamReader read = null;
BufferedReader br = null;
BufferedInputStream bb = null;
try {
bb = new BufferedInputStream(input);
read = new InputStreamReader(bb, getCharSet(bb));
br = new BufferedReader(read);
String line;
while ((line = br.readLine()) != null) {
if (StringUtils.isNotBlank(line)) {
List<String> dd = Arrays.asList(line.split(","));
List<String> n = new ArrayList<>();
for (int i = 0; i < dd.size(); i++) {
String cellData = dd.get(i);
n.add(buildText(cellData));
}
data.add(n);
}
}
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
if (br != null) {
br.close();
}
if (read != null) {
read.close();
}
if (bb != null) {
bb.close();
}
if (input != null) {
input.close();
}
} catch (Exception e) {
e.printStackTrace();
}
}
return data;
}
/**
* 获取流对应的编码类型
* @param bb
* @return
* @throws Exception
*/
private static String getCharSet(BufferedInputStream bb) throws Exception {
String charSet = null;
byte[] buffer = new byte[3];
//因流读取后再读取可能会缺少内容,此处需要先读,然后再还原
bb.mark(bb.available() + 1);
bb.read(buffer);
bb.reset();
String s = Integer.toHexString(buffer[0] & 0xFF) + Integer.toHexString(buffer[1] & 0xFF) + Integer.toHexString(buffer[2] & 0xFF);
switch (s) {
//GBK,GB2312对应均为d5cbba,统一当成GB2312解析
case "d5cbba":
charSet = "GB2312";
break;
case "efbbbf":
charSet = "UTF-8";
break;
default:
charSet = "GB2312";
break;
}
return charSet;
}
问题圆满解决
