咱們解析csv文件時最經常遇到的問題就是亂碼,可能有朋友說了我在解析時直接設定編碼類型為GBK,GB2312就可以解決中文亂碼,如下
public static List<List<String>> readTxtOrCsvFile(InputStream input) {
List<List<String>> data = Lists.newArrayList();
if (input == null) {
return data;
}
InputStreamReader read = null;
BufferedReader br = null;
try {
read = new InputStreamReader(input, "GB2312");
br = new BufferedReader(read);
String line;
while ((line = br.readLine()) != null) {
if (StringUtils.isNotBlank(line)) {
List<String> dd = Arrays.asList(line.split(","));
List<String> n = new ArrayList<>();
for (int i = 0; i < dd.size(); i++) {
String cellData = dd.get(i);
n.add(buildText(cellData));
}
data.add(n);
}
}
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
if (br != null) {
br.close();
}
if (read != null) {
read.close();
}
if (input != null) {
input.close();
}
} catch (Exception e) {
e.printStackTrace();
}
}
return data;
}
這樣可以解決部分用戶的亂碼,我想問下如果我的文件類型為UTF-8呢,這樣解析出來的還是有亂碼。如何做到完全解決呢。作者本人也遇到這種問題,想了許久,也在網上搜索很久,得到一個解決方案。
解決方案是:自己獲取要解析文件編碼,然后按照編碼進行解析。如何才能獲取編碼,有如下步驟
1、從流中讀取前三個字節到一個byte[3]數組中;
2、通過Integer.toHexString(byte[0] & 0xFF),將byte[3]數組中的三個byte分別轉換成16進制的字符表示;
3、根據對三個byte進行轉換后得到的字符串,與UTF-8格式頭EFBBBF進行比較即可知道是否UTF-8格式。
/**
* 讀取txt,csv文件
*
* @return
*/
public static List<List<String>> readTxtOrCsvFile(InputStream input) {
List<List<String>> data = Lists.newArrayList();
if (input == null) {
return data;
}
InputStreamReader read = null;
BufferedReader br = null;
BufferedInputStream bb = null;
try {
bb = new BufferedInputStream(input);
read = new InputStreamReader(bb, getCharSet(bb));
br = new BufferedReader(read);
String line;
while ((line = br.readLine()) != null) {
if (StringUtils.isNotBlank(line)) {
List<String> dd = Arrays.asList(line.split(","));
List<String> n = new ArrayList<>();
for (int i = 0; i < dd.size(); i++) {
String cellData = dd.get(i);
n.add(buildText(cellData));
}
data.add(n);
}
}
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
if (br != null) {
br.close();
}
if (read != null) {
read.close();
}
if (bb != null) {
bb.close();
}
if (input != null) {
input.close();
}
} catch (Exception e) {
e.printStackTrace();
}
}
return data;
}
/**
* 獲取流對應的編碼類型
* @param bb
* @return
* @throws Exception
*/
private static String getCharSet(BufferedInputStream bb) throws Exception {
String charSet = null;
byte[] buffer = new byte[3];
//因流讀取后再讀取可能會缺少內容,此處需要先讀,然后再還原
bb.mark(bb.available() + 1);
bb.read(buffer);
bb.reset();
String s = Integer.toHexString(buffer[0] & 0xFF) + Integer.toHexString(buffer[1] & 0xFF) + Integer.toHexString(buffer[2] & 0xFF);
switch (s) {
//GBK,GB2312對應均為d5cbba,統一當成GB2312解析
case "d5cbba":
charSet = "GB2312";
break;
case "efbbbf":
charSet = "UTF-8";
break;
default:
charSet = "GB2312";
break;
}
return charSet;
}
問題圓滿解決
