webhdfspy

A Python wrapper library to access Hadoop WebHDFS REST API

Installation

To install webhdfspy from PyPI:

$ pip install webhdfspy

Python versions

webhdfspy requires Python 3.9+

Usage

>>> import webhdfspy
>>> client = webhdfspy.WebHDFSClient("localhost", 50070, "username")
>>> print(client.listdir('/'))
[]
>>> client.mkdir('/foo')
True
>>> print(client.listdir('/'))
[{'group': 'supergroup', 'permission': '755', ...}]
>>> client.create('/foo/foo.txt', "just put some text here", overwrite=True)
True
>>> print(client.open('/foo/foo.txt'))
just put some text here
>>> client.remove('/foo')
True

Using a context manager:

>>> with webhdfspy.WebHDFSClient("localhost", 50070, "username") as client:
...     client.listdir('/')
[]

HTTPS support:

>>> client = webhdfspy.WebHDFSClient("host", 9871, "user", scheme="https")

API Documentation

class webhdfspy.WebHDFSClient(host: str, port: int, username: str | None = None, logger: Logger | None = None, *, timeout: float = 60.0, scheme: str = 'http')

Client for Hadoop WebHDFS REST API.

Supports context manager protocol for automatic resource cleanup:

with WebHDFSClient("host", 50070, username="user") as client:
    client.listdir("/")
append(path: str, file_data: Any, buffersize: int | None = None) bool

Append data to a file.

Parameters:
  • path – path of the file

  • file_data – data to append

  • buffersize – size of the buffer used to transfer the data

cancel_delegation_token(token: str) bool

Cancel a delegation token.

Parameters:

token – the delegation token

chmod(path: str, permission: str) bool

Set the permissions of a file or directory.

Parameters:
  • path – path of the file/dir

  • permission – permissions in octal (e.g. "755")

close() None

Close the underlying HTTP session.

copyfromlocal(local_path: str, hdfs_path: str, overwrite: bool | None = None) bool

Copy a file from the local filesystem to HDFS.

Parameters:
  • local_path – path of the local file

  • hdfs_path – HDFS destination path

  • overwrite – whether to overwrite an existing file

create(path: str, file_data: Any, overwrite: bool | None = None) bool

Create a new file in HDFS.

Uses the two-step WebHDFS create protocol (NameNode redirect then DataNode upload).

Parameters:
  • path – the file path to create

  • file_data – the data to write

  • overwrite – whether to overwrite an existing file

environ_home() str

Return the home directory of the user.

get_checksum(path: str) dict[str, Any]

Return the checksum of a file.

Parameters:

path – path of the file

Returns:

FileChecksum dict

get_content_summary(path: str) dict[str, Any]

Return the content summary of a directory.

Parameters:

path – path of the directory

Returns:

ContentSummary dict

get_delegation_token(renewer: str) dict[str, Any]

Get a delegation token.

Parameters:

renewer – the user who can renew the token

Returns:

Token dict

listdir(path: str = '/') list[dict[str, Any]]

List all the contents of a directory.

Parameters:

path – path of the directory

Returns:

a list of FileStatus dicts

mkdir(path: str, permission: str | None = None) bool

Create a directory hierarchy, like mkdir -p.

Parameters:
  • path – the path of the directory

  • permission – dir permissions in octal (e.g. "755")

open(path: str, offset: int | None = None, length: int | None = None, buffersize: int | None = None) str

Open a file to read.

Parameters:
  • path – path of the file

  • offset – starting byte position

  • length – number of bytes to read

  • buffersize – size of the buffer used to transfer the data

Returns:

the file data as text

remove(path: str, recursive: bool = False) bool

Delete a file or directory.

Parameters:
  • path – path of the file or dir to delete

  • recursive – delete content in subdirectories

rename(src: str, dst: str) bool

Rename a file or directory.

Parameters:
  • src – path of the file or dir to rename

  • dst – destination path

renew_delegation_token(token: str) int

Renew a delegation token.

Parameters:

token – the delegation token

Returns:

new expiration time in ms since epoch

set_owner(path: str, owner: str | None = None, group: str | None = None) bool

Set the owner and/or group of a file or directory.

Parameters:
  • path – path of the file/dir

  • owner – new owner name

  • group – new group name

set_replication(path: str, replication_factor: int) bool

Set the replication factor of a file.

Parameters:
  • path – path of the file

  • replication_factor – number of replications (>0)

set_times(path: str, modificationtime: int | None = None, accesstime: int | None = None) bool

Set modification and/or access time of a file.

Parameters:
  • path – path of the file

  • modificationtime – modification time in ms since epoch

  • accesstime – access time in ms since epoch

status(path: str) dict[str, Any]

Return the FileStatus of a file or directory.

Parameters:

path – path of the file/dir

Returns:

a FileStatus dictionary

Exceptions

class webhdfspy.WebHDFSException(msg: str)

Base exception for WebHDFS errors.

class webhdfspy.WebHDFSRemoteException(message: str, status_code: int, exception: str = '', java_class_name: str = '')

Exception raised when WebHDFS returns a RemoteException.

class webhdfspy.WebHDFSConnectionError(msg: str, cause: Exception | None = None)

Exception raised when a connection to WebHDFS fails.

WebHDFS documentation

https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/WebHDFS.html

Indices and tables