安裝hdfs包
pip install hdfs
查看hdfs目錄
[root@hadoop hadoop]# hdfs dfs -ls -R / drwxr-xr-x - root supergroup 0 2017-05-18 23:57 /Demo -rw-r--r-- 1 root supergroup 3494 2017-05-18 23:57 /Demo/hadoop-env.sh drwxr-xr-x - root supergroup 0 2017-05-18 19:01 /logs -rw-r--r-- 1 root supergroup 2223 2017-05-18 19:01 /logs/anaconda-ks.cfg -rw-r--r-- 1 root supergroup 57162 2017-05-18 18:32 /logs/install.log
創建hdfs連接實例
#!/usr/bin/env python # -*- coding:utf-8 -*- __Author__ = 'kongZhaGen' import hdfs client = hdfs.Client("http://172.10.236.21:50070")
list:返回遠程文件夾包含的文件或目錄名稱,如果路徑不存在則拋出錯誤。
hdfs_path:遠程文件夾的路徑
status:同時返回每個文件的狀態信息
def list(self, hdfs_path, status=False): """Return names of files contained in a remote folder. :param hdfs_path: Remote path to a directory. If `hdfs_path` doesn't exist or points to a normal file, an :class:`HdfsError` will be raised. :param status: Also return each file's corresponding FileStatus_. """
示例:
print client.list("/",status=False) 結果: [u'Demo', u'logs']
status:獲取hdfs系統上文件或文件夾的狀態信息
hdfs_path:路徑名稱
strict:
False:如果遠程路徑不存在返回None
True:如果遠程路徑不存在拋出異常
def status(self, hdfs_path, strict=True): """Get FileStatus_ for a file or folder on HDFS. :param hdfs_path: Remote path. :param strict: If `False`, return `None` rather than raise an exception if the path doesn't exist. .. _FileStatus: FS_ .. _FS: http://hadoop.apache.org/docs/r1.0.4/webhdfs.html#FileStatus """
示例:
print client.status(hdfs_path="/Demoo",strict=False) 結果: None
makedirs:在hdfs上創建目錄,可實現遞歸創建目錄
hdfs_path:遠程目錄名稱
permission:為新創建的目錄設置權限
def makedirs(self, hdfs_path, permission=None): """Create a remote directory, recursively if necessary. :param hdfs_path: Remote path. Intermediate directories will be created appropriately. :param permission: Octal permission to set on the newly created directory. These permissions will only be set on directories that do not already exist. This function currently has no return value as WebHDFS doesn't return a meaningful flag. """
示例:
如果想在遠程客戶端通過腳本給hdfs創建目錄,需要修改hdfs-site.xml
<property> <name>dfs.permissions</name> <value>false</value> </property>
重啟hdfs
stop-dfs.sh start-dfs.sh
遞歸創建目錄
client.makedirs("/data/rar/tmp",permission=755)
rename:移動一個文件或文件夾
hdfs_src_path:源路徑
hdfs_dst_path:目標路徑,如果路徑存在且是個目錄,則源目錄移動到此目錄中。如果路徑存在且是個文件,則會拋出異常
def rename(self, hdfs_src_path, hdfs_dst_path): """Move a file or folder. :param hdfs_src_path: Source path. :param hdfs_dst_path: Destination path. If the path already exists and is a directory, the source will be moved into it. If the path exists and is a file, or if a parent destination directory is missing, this method will raise an :class:`HdfsError`. """
示例:
client.rename("/SRC_DATA","/dest_data")
delete:從hdfs刪除一個文件或目錄
hdfs_path:hdfs系統上的路徑
recursive:如果目錄非空,True:可遞歸刪除.False:拋出異常。
def delete(self, hdfs_path, recursive=False): """Remove a file or directory from HDFS. :param hdfs_path: HDFS path. :param recursive: Recursively delete files and directories. By default, this method will raise an :class:`HdfsError` if trying to delete a non-empty directory. This function returns `True` if the deletion was successful and `False` if no file or directory previously existed at `hdfs_path`. """
示例:
client.delete("/dest_data",recursive=True)
upload:上傳文件或目錄到hdfs文件系統,如果目標目錄已經存在,則將文件或目錄上傳到此目錄中,否則新建目錄。
def upload(self, hdfs_path, local_path, overwrite=False, n_threads=1, temp_dir=None, chunk_size=2 ** 16, progress=None, cleanup=True, **kwargs): """Upload a file or directory to HDFS. :param hdfs_path: Target HDFS path. If it already exists and is a directory, files will be uploaded inside. :param local_path: Local path to file or folder. If a folder, all the files inside of it will be uploaded (note that this implies that folders empty of files will not be created remotely). :param overwrite: Overwrite any existing file or directory. :param n_threads: Number of threads to use for parallelization. A value of `0` (or negative) uses as many threads as there are files. :param temp_dir: Directory under which the files will first be uploaded when `overwrite=True` and the final remote path already exists. Once the upload successfully completes, it will be swapped in. :param chunk_size: Interval in bytes by which the files will be uploaded. :param progress: Callback function to track progress, called every `chunk_size` bytes. It will be passed two arguments, the path to the file being uploaded and the number of bytes transferred so far. On completion, it will be called once with `-1` as second argument. :param cleanup: Delete any uploaded files if an error occurs during the upload. :param \*\*kwargs: Keyword arguments forwarded to :meth:`write`. On success, this method returns the remote upload path. """
示例:
>>> import hdfs >>> client=hdfs.Client("http://172.10.236.21:50070") >>> client.upload("/logs","/root/training/jdk-7u75-linux-i586.tar.gz") '/logs/jdk-7u75-linux-i586.tar.gz' >>> client.list("/logs") [u'anaconda-ks.cfg', u'install.log', u'jdk-7u75-linux-i586.tar.gz']
content:獲取hdfs系統上文件或目錄的概要信息
print client.content("/logs/install.log") 結果: {u'spaceConsumed': 57162, u'quota': -1, u'spaceQuota': -1, u'length': 57162, u'directoryCount': 0, u'fileCount': 1}
write:在hdfs文件系統上創建文件,可以是字符串,生成器或文件對象
def write(self, hdfs_path, data=None, overwrite=False, permission=None, blocksize=None, replication=None, buffersize=None, append=False, encoding=None): """Create a file on HDFS. :param hdfs_path: Path where to create file. The necessary directories will be created appropriately. :param data: Contents of file to write. Can be a string, a generator or a file object. The last two options will allow streaming upload (i.e. without having to load the entire contents into memory). If `None`, this method will return a file-like object and should be called using a `with` block (see below for examples). :param overwrite: Overwrite any existing file or directory. :param permission: Octal permission to set on the newly created file. Leading zeros may be omitted. :param blocksize: Block size of the file. :param replication: Number of replications of the file. :param buffersize: Size of upload buffer. :param append: Append to a file rather than create a new one. :param encoding: Encoding used to serialize data written. """