spark 中文編碼處理

本文轉載自查看原文 2016-06-09 15:30 8359 Linux

日志的格式是GBK編碼的，而hadoop上的編碼是用UTF-8寫死的，導致最終輸出亂碼。

研究了下Java的編碼問題。

網上其實對spark輸入文件是GBK編碼有現成的解決方案，具體代碼如下

import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.io.Text
import org.apache.hadoop.mapred.TextInputFormat

rdd = ctx.hadoopFile(file_list, classOf[TextInputFormat],
            classOf[LongWritable], classOf[Text]).map(
            pair => new String(pair._2.getBytes, 0, pair._2.getLength, "GBK"))

這種想法的來源是基於

public static Text transformTextToUTF8(Text text, String encoding) {
    String value = null;
    try {
    value = new String(text.getBytes(), 0, text.getLength(), encoding);
    } catch (UnsupportedEncodingException e) {
    e.printStackTrace();
    }
    return new Text(value);
}

但這種方法還有一個問題，

大家都知道gbk是2~3個字節編碼的。如果日志中按照直接截斷，導致按照gbk讀取文件的時候，將后面的分隔符\t一並讀取了，導致按照\t split的時候，字段的個數不對（或者說順序錯位了）。

這個時候，需要找到一種單字節的解析方案，即 ISO-8859-1編碼。代碼如下

rdd = ctx.hadoopFile(file_list, classOf[TextInputFormat],
            classOf[LongWritable], classOf[Text]).map(
            pair => new String(pair._2.getBytes, 0, pair._2.getLength, "ISO-8859-1"))

但這又帶來了一個問題，即輸出的結果（按照UTF-8存儲）是亂碼，不可用。

如果我們換一種思路來考慮這個問題，Java或scala中如何將一個gbk文件轉換為UTF8？網上有很多的現成的代碼，具體到我們的場景，以行為單位處理的話，示例代碼如下

public class Encoding {
    private static String kISOEncoding = "ISO-8859-1";
    private static String kGBKEncoding = "GBK";
    private static String kUTF8Encoding = "UTF-8";
    
    public static void main(String[] args) throws UnsupportedEncodingException {
        try {
            File out_file = new File(args[1]);
            Writer out = new BufferedWriter(new OutputStreamWriter(
                         new FileOutputStream(out_file), kUTF8Encoding));
            List<String> lines = Files.readAllLines(Paths.get(args[0]), Charset.forName(kGBKEncoding));
            for (String line : lines) {
                out.append(line).append("\n");
            }
            out.flush();
            out.close();
        } catch (IOException e) {
            System.out.println(e);
        }
    }
}

如上的代碼給了我們一個啟示，即在寫入文件的時候，系統自動進行了編碼的轉換，我們沒必要對行進行單獨的直接轉換處理。

通過查詢資料，Java中字符編碼是內部編碼，即字節流按照編碼轉化為String。

所謂結合以上兩點認識，我們模擬在spark上以ISO-8859-1

打開文件和以UTF-8寫入文件的過程，發現只需要將其強制轉換為GBK的string即可，最終得到的文件以UTF-8打開不是亂碼，具體代碼如下。

public class Encoding {
    private static String kISOEncoding = "ISO-8859-1";
    private static String kGBKEncoding = "GBK";
    private static String kUTF8Encoding = "UTF-8";
    
    public static void main(String[] args) throws UnsupportedEncodingException {
        try {
            File out_file = new File(args[1]);
            Writer out = new BufferedWriter(new OutputStreamWriter(
                         new FileOutputStream(out_file), kUTF8Encoding));
            List<String> lines = Files.readAllLines(Paths.get(args[0]), Charset.forName(kISOEncoding)); for (String line : lines) {
                String gbk_str = new String(line.getBytes(kISOEncoding), kGBKEncoding);
                out.append(gbk_str).append("\n");
            }
            out.flush();
            out.close();
        } catch (IOException e) {
            System.out.println(e);
        }
    }
}

完美的解決了。。。花費了一個工作日解決才解決的問題，對Java還是不夠熟練啊。

總結出來，希望對大家有用。

總結

1. 要舉一反三

2. 學會google，最近我就指望着它活着了。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python處理中文編碼 001_Python2 的中文編碼處理 python 中文編碼(一) .NET C#中處理Url中文編碼問題 URL中文編碼問題 Python中文編碼問題 Java中文編碼小結 QString 中文編碼轉換 json中文編碼問題 shp 文件中文編碼