A fully containerized Hadoop, Spark, Hive, Pig, Hbase and Zookeeper environment for quick and efficient Big Data processing.
- π My Story
- π₯ Authors
- β¨ Features
- π§ Tech Stack
- π» OS support
- π Prerequisites
- π Installation Guide
- π Modify the Owner Name
- π Interact with the Web UI
- π Contact
Setting up a Hadoop cluster manually is frustrating, especially when integrating Spark, Hive, Hbase, etc. My friend and I initially developed HaMu (Hadoop Multi Node) for a simple Hadoop Cluster deployment using Docker.
Building upon that foundation, I extended the project to include a full-fledged Big Data stack-adding Spark, Hive, Pig, HBase and Zookeeper. The goal was to create an all-in-one, containerized Big Data environment thatβs easy to spin up, experiment with, and build on no manual config nightmares.
π‘ I hope HadoopSphere helps you quickly set up a Big Data environment for learning and development! π
- @Quoc Huy (Extended with Spark, Hive, Pig, Hbase and Zookeeper)
π Deploy a multi-node Hadoop cluster with an extended Big Data stack - including Spark, Hive, Pig, HBase, and Zookeeper - using just one command.
π Easily configure the number of slave nodes to match your testing or development needs.
π All core services (HDFS, YARN, Spark, Hive, Pig, HBase, Zookeeper) run smoothly inside Docker containers.
π Access Web UIs for monitoring Hadoop and Spark, Hbase jobs, etc.
π Modify the cluster owner's name.
- Hadoop Cluster (HDFS, YARN)
- Apache Spark (Standalone Mode)
- Apache Hive (With Derby Metastore)
- Apache Pig
- Apache Hbase
- Apache Zookeeper
- Docker (Containerized Setup)
Cross-Platform Compatibility: This project leverages Docker containers, enabling seamless execution across various operating systems, including:
- πͺ Windows via WSL2 (Windows Subsystem for Linux 2) or Docker Desktop.
- π§ Linux Ubuntu, CentOS, Debian, and other distributions.
- π³ Docker
- ποΈ Basic Knowledge of Hadoop, Spark, Hive, Pig, Hbase, Zookeeper
git clone https://github.com/huy-dataguy/HadoopSphere.git
cd HadoopSphere
Building Docker images is required only for the first time or after making changes in the HadoopSphere directory (such as modifying the owner name). Make sure Docker is running before proceeding.
β³ Note: The first build may take a few minutes as no cached layers exist.
β οΈ If you're using Windows:
You might see errors likerequired file not found
when running shell scripts (.sh
) because Windows uses a different line-ending format.
To fix this, convert the script files to Unix format usingdos2unix
.
dos2unix ./scripts/build-image.sh
dos2unix ./scripts/start-cluster.sh
dos2unix ./scripts/resize-number-slaves.sh
./scripts/build-image.sh
./scripts/start-cluster.sh
By default, this will start a cluster with 1 master and 2 slaves.
To start a cluster with 1 master and 5 slaves:
./scripts/start-cluster.sh 6
After Step 3, you will be inside the master container's CLI, where you can interact with the cluster.
π‘ Start the HDFS services:
start-dfs.sh
π‘ Check HDFS Nodes
hdfs dfsadmin -report
π‘ Start the YARN services:
start-yarn.sh
π‘ Check YARN Nodes
yarn node -list
π‘ Run Spark Cluster
spark-shell
π‘ Run Hive Metastore
hive
π‘ Run a Pig Script
pig -x mapreduce
π‘ Start Hbase
start-hbase.sh
π‘ Run a Hbase shell
hbase shell
π Expected Output:
-
HDFS:
If you see live DataNodes, your cluster is running successfully. π
-
YARN:
If you see live NodeManagers, YARN is running successfully. π
To verify that the system is working correctly after start hdfs and yarn service, you can run the test scripts.
./scripts/word_count.sh
This script runs a sample Word Count job to ensure that HDFS and YARN are functioning correctly.
Since the system uses Docker Volumes for NameNode and DataNode, Hive Metastore DB and HBase, please ensure that:
- The number of containers remains the same when restarting (e.g., if started with 5 slaves, restart with 5 slaves).
- If the number of slaves changes, you may face volume inconsistencies.
β How to Ensure the Correct Number of Containers During Restart:
-
Always restart with the same number of containers:
./scripts/start-cluster.sh 6 # If you previously used 6 nodes
-
Do not delete volumes when stopping the cluster, use:
./scripts/stop-cluster.sh
Avoid using docker compose -f compose-dynamic.yaml down -v
as it will remove all volumes data.
β Check Existing Volumes:
docker volume ls
π If the Word Count job runs successfully, your system is fully operational!
If you need to change the owner name, run the rename-owner.py
script and enter your new owner name when prompted.
β³ Note: If you want to check the current owner name, it is stored in
hamu-config.json
.π There are some limitations; you should use a name that is different from words related to the 'Hadoop' or 'Docker' syntax. For example, avoid names like 'hdfs', 'yarn', 'container', or 'docker-compose'.
python rename-owner.py
You can access the following web interfaces to monitor and manage your Hadoop cluster:
-
YARN Resource Manager UI β http://localhost:9004
Provides an overview of cluster resource usage, running applications, and job details. -
NameNode UI β http://localhost:9870
Displays HDFS file system details, block distribution, and overall health status. -
Spark UI β http://localhost:4040
Track Spark jobs, tasks, and execution performance. -
Hbase UI β http://localhost:16010 Access HBase Master status, region servers, and table metrics.
π§ Email: quochuy.working@gmail.com
π¬ Feel free to contribute and improve this project! π