🚀 HadoopSphere

A fully containerized Hadoop, Spark, Hive, Pig, Hbase and Zookeeper environment for quick and efficient Big Data processing.

🐜 Table of Contents

📚 My Story
👥 Authors
✨ Features
🔧 Tech Stack
💻 OS support
📌 Prerequisites
🚀 Installation Guide
🔄 Modify the Owner Name
🌐 Interact with the Web UI
📞 Contact

📚 My Story (feel free to skip)

Setting up a Hadoop cluster manually is frustrating, especially when integrating Spark, Hive, Hbase, etc. My friend and I initially developed HaMu (Hadoop Multi Node) for a simple Hadoop Cluster deployment using Docker.

Building upon that foundation, I extended the project to include a full-fledged Big Data stack-adding Spark, Hive, Pig, HBase and Zookeeper. The goal was to create an all-in-one, containerized Big Data environment that’s easy to spin up, experiment with, and build on no manual config nightmares.

💡 I hope HadoopSphere helps you quickly set up a Big Data environment for learning and development! 🚀

👥 Authors

@Quoc Huy (Extended with Spark, Hive, Pig, Hbase and Zookeeper)

✨ Features

👉 Deploy a multi-node Hadoop cluster with an extended Big Data stack - including Spark, Hive, Pig, HBase, and Zookeeper - using just one command.
👉 Easily configure the number of slave nodes to match your testing or development needs.
👉 All core services (HDFS, YARN, Spark, Hive, Pig, HBase, Zookeeper) run smoothly inside Docker containers. 👉 Access Web UIs for monitoring Hadoop and Spark, Hbase jobs, etc.
👉 Modify the cluster owner's name.

🔧 Tech Stack

Hadoop Cluster (HDFS, YARN)
Apache Spark (Standalone Mode)
Apache Hive (With Derby Metastore)
Apache Pig
Apache Hbase
Apache Zookeeper
Docker (Containerized Setup)

🖥️ OS Support

Cross-Platform Compatibility: This project leverages Docker containers, enabling seamless execution across various operating systems, including:

🪟 Windows via WSL2 (Windows Subsystem for Linux 2) or Docker Desktop.
🐧 Linux Ubuntu, CentOS, Debian, and other distributions.

📌 Prerequisites

🐳 Docker
🗃️ Basic Knowledge of Hadoop, Spark, Hive, Pig, Hbase, Zookeeper

🚀 Installation Guide

Step 1: Clone the Repository

  git clone https://github.com/huy-dataguy/HadoopSphere.git
  cd HadoopSphere

Step 2: Build Docker Images

Building Docker images is required only for the first time or after making changes in the HadoopSphere directory (such as modifying the owner name). Make sure Docker is running before proceeding.

⏳ Note: The first build may take a few minutes as no cached layers exist. ⚠️ If you're using Windows:
You might see errors like required file not found when running shell scripts (.sh) because Windows uses a different line-ending format.
To fix this, convert the script files to Unix format using dos2unix.

  dos2unix ./scripts/build-image.sh
  dos2unix ./scripts/start-cluster.sh
  dos2unix ./scripts/resize-number-slaves.sh

  ./scripts/build-image.sh

Step 3: Start the Cluster

  ./scripts/start-cluster.sh

By default, this will start a cluster with 1 master and 2 slaves.

To start a cluster with 1 master and 5 slaves:

  ./scripts/start-cluster.sh 6

Step 4: Verify the Installation

After Step 3, you will be inside the master container's CLI, where you can interact with the cluster.

💡 Start the HDFS services:

  start-dfs.sh

💡 Check HDFS Nodes

  hdfs dfsadmin -report

💡 Start the YARN services:

  start-yarn.sh

💡 Check YARN Nodes

  yarn node -list

💡 Run Spark Cluster

  spark-shell

💡 Run Hive Metastore

  hive

💡 Run a Pig Script

  pig -x mapreduce

💡 Start Hbase

  start-hbase.sh

💡 Run a Hbase shell

  hbase shell

📌 Expected Output:

HDFS: If you see live DataNodes, your cluster is running successfully. 🚀
YARN: If you see live NodeManagers, YARN is running successfully. 🚀

Step 5: Test the System with Scripts

To verify that the system is working correctly after start hdfs and yarn service, you can run the test scripts.

🔹 Step 1: Run a Word Count Test

  ./scripts/word_count.sh

This script runs a sample Word Count job to ensure that HDFS and YARN are functioning correctly.

📌 Important Notes on Volumes & Containers

Since the system uses Docker Volumes for NameNode and DataNode, Hive Metastore DB and HBase, please ensure that:

The number of containers remains the same when restarting (e.g., if started with 5 slaves, restart with 5 slaves).
If the number of slaves changes, you may face volume inconsistencies.

✅ How to Ensure the Correct Number of Containers During Restart:

Always restart with the same number of containers:

./scripts/start-cluster.sh 6  # If you previously used 6 nodes

Do not delete volumes when stopping the cluster, use:
```
  ./scripts/stop-cluster.sh
```

Avoid using docker compose -f compose-dynamic.yaml down -v as it will remove all volumes data.

✅ Check Existing Volumes:

docker volume ls

🚀 If the Word Count job runs successfully, your system is fully operational!

🔄 Modify the Owner Name

If you need to change the owner name, run the rename-owner.py script and enter your new owner name when prompted.

⏳ Note: If you want to check the current owner name, it is stored in hamu-config.json.

📌 There are some limitations; you should use a name that is different from words related to the 'Hadoop' or 'Docker' syntax. For example, avoid names like 'hdfs', 'yarn', 'container', or 'docker-compose'.

python rename-owner.py

🌐 Interact with the Web UI

You can access the following web interfaces to monitor and manage your Hadoop cluster:

YARN Resource Manager UI → http://localhost:9004
Provides an overview of cluster resource usage, running applications, and job details.
NameNode UI → http://localhost:9870
Displays HDFS file system details, block distribution, and overall health status.
Spark UI → http://localhost:4040
Track Spark jobs, tasks, and execution performance.
Hbase UI → http://localhost:16010 Access HBase Master status, region servers, and table metrics.

📞 Contact

📧 Email: quochuy.working@gmail.com

💬 Feel free to contribute and improve this project! 🚀

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🚀 HadoopSphere

🐜 Table of Contents

📚 My Story (feel free to skip)

👥 Authors

✨ Features

🔧 Tech Stack

🖥️ OS Support

📌 Prerequisites

🚀 Installation Guide

Step 1: Clone the Repository

Step 2: Build Docker Images

Step 3: Start the Cluster

Step 4: Verify the Installation

Step 5: Test the System with Scripts

🔹 Step 1: Run a Word Count Test

📌 Important Notes on Volumes & Containers

🔄 Modify the Owner Name

🌐 Interact with the Web UI

📞 Contact

About

Uh oh!

Releases 1

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
config-hadoop		config-hadoop
data		data
scripts		scripts
services		services
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
compose-dynamic.yaml		compose-dynamic.yaml
compose.yaml		compose.yaml
hamu-config.json		hamu-config.json
rename-owner.py		rename-owner.py

License

huy-dataguy/HadoopSphere

Folders and files

Latest commit

History

Repository files navigation

🚀 HadoopSphere

🐜 Table of Contents

📚 My Story (feel free to skip)

👥 Authors

✨ Features

🔧 Tech Stack

🖥️ OS Support

📌 Prerequisites

🚀 Installation Guide

Step 1: Clone the Repository

Step 2: Build Docker Images

Step 3: Start the Cluster

Step 4: Verify the Installation

Step 5: Test the System with Scripts

🔹 Step 1: Run a Word Count Test

📌 Important Notes on Volumes & Containers

🔄 Modify the Owner Name

🌐 Interact with the Web UI

📞 Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages