Java 讀取HDFS文件系統

本文轉載自查看原文 2017-07-29 14:59 5166 JAVA

最近有個需求，計算用戶畫像。

系統大概有800W的用戶量，算每個用戶的一些數據。

數據量比較大，算用hive還是毫無壓力的，但是寫的oracle，在給出數據給前端，就比較難受了。

然后換了種解決方法：

　　1.hive計算，寫的HDFS

　　2.API讀出來，寫到hbase（hdfs和hbase的版本不匹配，沒辦法用sqoop 直接導）

然后問題就來了。

需要寫個API，讀HDFS上的文件。

主要類：ReadHDFS

public class ReadHDFS {

    public static void main(String[]args){

        long startLong = System.currentTimeMillis();
        HDFSReadLog.writeLog("start read file");
        String path;
        if (args.length > 1) {
//            path = args[0];
            Constant.init(args[0],args[1]);
        }
        HDFSReadLog.writeLog(Constant.PATH);

        try {
            getFile(Constant.URI + Constant.PATH);
        } catch (IOException e) {
            e.printStackTrace();
        }

        long endLong = System.currentTimeMillis();
        HDFSReadLog.writeLog("cost " + (endLong -startLong)/1000 + " seconds");
        HDFSReadLog.writeLog("cost " + (endLong -startLong)/1000/60 + " minute");
    }

    private static void getFile(String filePath) throws IOException {

        FileSystem fs = FileSystem.get(URI.create(filePath), HDFSConf.getConf());
        Path path = new Path(filePath);
        if (fs.exists(path) && fs.isDirectory(path)) {

            FileStatus[] stats = fs.listStatus(path);
            FSDataInputStream is;
            FileStatus stat;
            byte[] buffer;
            int index;
            StringBuilder lastStr = new StringBuilder();
            for(FileStatus file : stats){
                try{
                    HDFSReadLog.writeLog("start read : " + file.getPath());
                    is = fs.open(file.getPath());
                    stat = fs.getFileStatus(path);
                    int sum  = is.available();
                    if(sum == 0){
                        HDFSReadLog.writeLog("have no data : " + file.getPath() );
                        continue;
                    }
                    HDFSReadLog.writeLog("there have  : " + sum + " bytes" );
                    buffer = new byte[sum];
　　　　　　　　　　　　// 注意一點，如果文件太大了，可能會內存不夠用。在本機測得時候，讀一個100多M的文件，導致內存不夠。
                    is.readFully(0,buffer);
                    String result = Bytes.toString(buffer);
                    // 寫到 hbase
                    WriteHBase.writeHbase(result);
                    
                    is.close();
                    HDFSReadLog.writeLog("read : " + file.getPath() + " end");
                }catch (IOException e){
                    e.printStackTrace();
                    HDFSReadLog.writeLog("read " + file.getPath() +" error");
                    HDFSReadLog.writeLog(e.getMessage());
                }
            }
            HDFSReadLog.writeLog("Read End");
            fs.close();

        }else {
            HDFSReadLog.writeLog(path + " is not exists");
        }

    }
}

配置類：HDFSConfie(趕緊沒什么用，url和path配好了，不需要配置就可以讀)

public class HDFSConf {

    public static Configuration conf = null;
    public static Configuration getConf(){
        if (conf == null){
            conf = new Configuration();
            String path  = Constant.getSysEnv("HADOOP_HOME")+"/etc/hadoop/";
            HDFSReadLog.writeLog("Get hadoop home : " + Constant.getSysEnv("HADOOP_HOME"));
            // hdfs conf
            conf.addResource(path+"core-site.xml");
            conf.addResource(path+"hdfs-site.xml");
            conf.addResource(path+"mapred-site.xml");
            conf.addResource(path+"yarn-site.xml");
        }
        return conf;
    }

}

一些常量：

　url ： hdfs:ip:prot

　path : HDFS的路徑

注：考慮到讀的表，可能不止有一個文件，做了循環。

看下篇，往hbase寫數據

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 HDFS文件系統簡單的Java讀寫操作（Spark）Spark 讀取文件系統的數據基於JAVA實現的文件系統認識HDFS分布式文件系統大數據 | 分布式文件系統 HDFS hdfs(分布式文件系統)優缺點 Hadoop分布式文件系統（HDFS）詳解我理解中的Hadoop HDFS分布式文件系統 nodejs 操作文件系統讀取寫入文件 windows下讀取ext4文件系統