有一個UTF-8編碼的文本文件,用FileReader讀取到一個字符串,然后轉換字符集:str=new String(str.getBytes(),"UTF-8");結果大部分中文顯示正常,但最后仍有部分漢字顯示為問號!
public static List<String> getLines(String fileName){ List<String> lines=new ArrayList<String>(); try { BufferedReader br = new BufferedReader(new FileReader(fileName)); String line = null; while ((line = br.readLine()) != null) { lines.add(new String(line.getBytes("GBK"),"UTF-8")); } br.close(); } catch (FileNotFoundException e) { }catch (IOException e) {} return lines; }
文件讀入時是按OS的默認字符集即GBK解碼的,我先用默認字符集GBK編碼str.getBytes(“GBK”),此時應該還原為文件中的字節序列了,然后再按UTF-8解碼,生成的字符串按理說應該就應該是正確的。
為什么結果中還是有部分亂碼呢?
問題出在FileReader讀取文件的過程中,FileReader繼承了InputStreamReader,但並沒有實現父類中帶字符集參數的構造函數,所以FileReader只能按系統默認的字符集來解碼,然后在UTF-8 -> GBK -> UTF-8的過程中編碼出現損失,造成結果不能還原最初的字符。
原因明確了,這個問題解決起來並不困難,用InputStreamReader代替FileReader,InputStreamReader isr=new InputStreamReader(new FileInputStream(fileName),"UTF-8");這樣讀取文件就會直接用UTF-8解碼,不用再做編碼轉換。
public static List<String> getLines(String fileName){ List<String> lines=new ArrayList<String>(); try { BufferedReader br=new BufferedReader(new InputStreamReader(new FileInputStream(fileName),"UTF-8")); String line = null; while ((line = br.readLine()) != null) { lines.add(line); } br.close(); } catch (FileNotFoundException e) { }catch (IOException e) {} return lines; }