hadoop: Cheat Sheet

Updated: July 5, 2024 | Published: October 7, 2020

Learn essential commands for distributed storage, file operations, cluster management, and data processing with this quick reference guide. Perfect for big data engineers and analysts!

Hadoop is an open-source framework for distributed storage and processing of large data sets using the MapReduce programming model. It includes components like HDFS, YARN, and MapReduce.

hadoop

These commands used on hadoop cluster running in Azure moving files between the cluster and blob storge.

Copy folder and its contents from HDI storage account to HDP local storage.

hadoop distcp \
wasb://this@that.blob.core.windows.net/thing/incoming \
hdfs://hadoop-hdfs-cluster:8020//tmp/test/

Copy folder and its contents from HDP local storage to cold storage account.

hadoop distcp \
hdfs://hadoop-hdfs-cluster:8020//tmp/test/incoming \
wasb://this@that.blob.core.windows.net/tmp/test/

Remove folder on cold storage account for next test

hdfs dfs -rm -r wasb://this@that.blob.core.windows.net/tmp/test/incoming/

Copy folder and its contents from HDI storage account to cold storage account on HDP master.

hadoop distcp \
wasb://this@that.blob.core.windows.net/thing/incoming \
wasb://foo@bar.blob.core.windows.net/tmp/test/

hdfs

Run filesystem commands

hdfs dfs COMMAND

Show command usage

hdfs dfs

List contents of directory

hdfs dfs -ls /some/path/to/list

Cat a file

hdfs dfs -cat /path/to/file

Add file to HDFS. Use -f to force overwrite

hdfs dfs -put /some/file.ext /path/in/hdfs/for/file

Add folder and its contents to HDFS

hdfs dfs -put /linux/path /hdfs/path

Get a file

hdfs dfs -get /some/path/in/hdfs /path/to/place/file

Remove a file. Use -r for recursive delete

hdfs dfs -rm /some/path/in/hdfs

Make directory. Use -p to create entire path.

hdfs dfs -mkdir /path/to/new/directory

Remove directory.

hdfs dfs -rmdir /path/to/new/directory

Remove directory with content

hdfs dfs -rm -r /path/to/new/directory

Leave safe mode, this was necessary when trying to restart HDFS after config change but it failed to leave safe mode and start, need to run as privileged user hdfs

sudo -u hdfs hdfs dfsadmin -safemode leave

Check HDFS file consistency, need to run as privilged user hdfs, find corrupt, missing, and under replicated blocks

sudo -u hdfs hdfs fsck /

List HDI storage account from HDP master

hdfs dfs -ls wasb://this@that.blob.core.windows.net/

List cold storage account from HDP master

hdfs dfs -ls wasb://this@that.blob.core.windows.net/

Create folder on cold storage account

hdfs dfs -mkdir -p wasb://this@that.blob.core.windows.net/tmp/test

Create folder on local HDFS

hdfs dfs -mkdir /tmp/test