hadoop輸出lzo文件並添加索引

本文轉載自查看原文 2016-08-15 16:36 2127 lzo/ java/ hadoop

public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();

        conf.set("mapred.job.tracker", Constants.HADOOP_MAIN_IP + Constants.MAO_HAO + Constants.HADOOP_MAIN_PORT);

        if (args.length != 3) {
            System.err.println("Usage: Data Deduplication <in> <out> <reduceNum>");
            System.exit(2);
        }
        Job job = new Job(conf, "ETLTld Job");
        job.setJarByClass(ETLTldMain.class);

        job.setMapperClass(ETLTldMapper.class);
        job.setReducerClass(ETLTldReducer.class);

        job.setInputFormatClass(LzoTextInputFormat.class);

        job.setNumReduceTasks(Integer.parseInt(args[2]));

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);

        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        FileOutputFormat.setCompressOutput(job, true); FileOutputFormat.setOutputCompressorClass(job, LzopCodec.class); int result = job.waitForCompletion(true) ? 0 : 1; if (result == 0) { LzoIndexer lzoIndexer = new LzoIndexer(conf); lzoIndexer.index(new Path(args[1])); System.exit(result); } else if(result == 1){ System.exit(result); }

    }

如果已經有了lzo文件，可以采用如下方法添加索引：

bin/yarn jar /module/cloudera/parcels/GPLEXTRAS-5.4.0-1.cdh5.4.0.p0.27/lib/hadoop/lib/hadoop-lzo-0.4.15-cdh5.4.0.jar com.hadoop.compression.lzo.DistributedLzoIndexer /user/hive/warehouse/cndns.db/ods_cndns_log/dt=20160803/node=alicn/part-r-00000.lzo

lzo格式默認是不支持splitable的，需要為其添加索引文件，才能支持多個map並行對lzo文件進行處理。

【參考】http://blog.csdn.net/wisgood/article/details/17080361

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Hadoop配置lzo和lzop Hadoop配置lzo lzo文件壓縮，解壓 hadoop-lzo 安裝配置 hadoop支持lzo完整過程 Spark 掃描 HDFS lzo/gz/orc異常壓縮文件 hadoop2.6 上hive運行報“native-lzo library not available”異常處理 hadoop的MultipleOutputs多目錄輸出為Xunsearch添加索引以及導入mysql文件記錄添加索引,聯合唯一索引