Bulk Ingest

The ingest.sh shell script is used to ingest data onto a PNDA cluster.


To install dependencies and configure the tool for an HttpFS endpoint, run the install command with the IP address of a node running HttpFS role and port 14000.

 ingest.sh install http://host:port
 ingest.sh install

Please check with your cluster administrator for the correct IP address and port, as it may change in production deployments.


To upload file or directory onto a cluster, run the upload command with the dir/file name. You can use the -f flag to overwrite existing files, and -t number of threads for parallelism.

ingest.sh upload localfile or local_directory
ingest.sh upload Readme.txt
ingest.sh upload -f -t 10 /user/data

Once the upload completes, verify whether the transferred files are stored in the /user/pnda/PNDA_datasets/bulk/ folder in HDFS.


The tool depends on the hdfs python pip package. The install command when run also sets up the package, as well as populating the cli config.

Known bugs

  • Sometime the files or nested directories are not overwritten for large folders. In such case rerun the upload command with -f switch.
  • Appending a '/' slash at the end seem to have weird effect, in some case even overwriting the directory. To upload a directory just use its name as the argument.

