Skip to content

Containerized Hadoop cluster with Spark, Hive, Pig, HBase, and Zookeeper for scalable Big Data processing using Docker.

License

Notifications You must be signed in to change notification settings

huy-dataguy/HadoopSphere

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

95 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸš€ HadoopSphere

A fully containerized Hadoop, Spark, Hive, Pig, Hbase and Zookeeper environment for quick and efficient Big Data processing.

🐜 Table of Contents


πŸ“š My Story (feel free to skip)

Setting up a Hadoop cluster manually is frustrating, especially when integrating Spark, Hive, Hbase, etc. My friend and I initially developed HaMu (Hadoop Multi Node) for a simple Hadoop Cluster deployment using Docker.

Building upon that foundation, I extended the project to include a full-fledged Big Data stack-adding Spark, Hive, Pig, HBase and Zookeeper. The goal was to create an all-in-one, containerized Big Data environment that’s easy to spin up, experiment with, and build on no manual config nightmares.

πŸ’‘ I hope HadoopSphere helps you quickly set up a Big Data environment for learning and development! πŸš€


πŸ‘₯ Authors

  • @Quoc Huy (Extended with Spark, Hive, Pig, Hbase and Zookeeper)

✨ Features

πŸ‘‰ Deploy a multi-node Hadoop cluster with an extended Big Data stack - including Spark, Hive, Pig, HBase, and Zookeeper - using just one command.
πŸ‘‰ Easily configure the number of slave nodes to match your testing or development needs.
πŸ‘‰ All core services (HDFS, YARN, Spark, Hive, Pig, HBase, Zookeeper) run smoothly inside Docker containers. πŸ‘‰ Access Web UIs for monitoring Hadoop and Spark, Hbase jobs, etc.
πŸ‘‰ Modify the cluster owner's name.


πŸ”§ Tech Stack

  • Hadoop Cluster (HDFS, YARN)
  • Apache Spark (Standalone Mode)
  • Apache Hive (With Derby Metastore)
  • Apache Pig
  • Apache Hbase
  • Apache Zookeeper
  • Docker (Containerized Setup)

πŸ–₯️ OS Support

Cross-Platform Compatibility: This project leverages Docker containers, enabling seamless execution across various operating systems, including:

  • πŸͺŸ Windows via WSL2 (Windows Subsystem for Linux 2) or Docker Desktop.
  • 🐧 Linux Ubuntu, CentOS, Debian, and other distributions.

πŸ“Œ Prerequisites

  • 🐳 Docker
  • πŸ—ƒοΈ Basic Knowledge of Hadoop, Spark, Hive, Pig, Hbase, Zookeeper

πŸš€ Installation Guide

Step 1: Clone the Repository

  git clone https://github.com/huy-dataguy/HadoopSphere.git
  cd HadoopSphere

Step 2: Build Docker Images

Building Docker images is required only for the first time or after making changes in the HadoopSphere directory (such as modifying the owner name). Make sure Docker is running before proceeding.

⏳ Note: The first build may take a few minutes as no cached layers exist. ⚠️ If you're using Windows:
You might see errors like required file not found when running shell scripts (.sh) because Windows uses a different line-ending format.
To fix this, convert the script files to Unix format using dos2unix.

  dos2unix ./scripts/build-image.sh
  dos2unix ./scripts/start-cluster.sh
  dos2unix ./scripts/resize-number-slaves.sh
  ./scripts/build-image.sh

Step 3: Start the Cluster

  ./scripts/start-cluster.sh

By default, this will start a cluster with 1 master and 2 slaves.

To start a cluster with 1 master and 5 slaves:

  ./scripts/start-cluster.sh 6 

Step 4: Verify the Installation

After Step 3, you will be inside the master container's CLI, where you can interact with the cluster.

πŸ’‘ Start the HDFS services:

  start-dfs.sh

πŸ’‘ Check HDFS Nodes

  hdfs dfsadmin -report

πŸ’‘ Start the YARN services:

  start-yarn.sh

πŸ’‘ Check YARN Nodes

  yarn node -list

πŸ’‘ Run Spark Cluster

  spark-shell

πŸ’‘ Run Hive Metastore

  hive

πŸ’‘ Run a Pig Script

  pig -x mapreduce

πŸ’‘ Start Hbase

  start-hbase.sh

πŸ’‘ Run a Hbase shell

  hbase shell

πŸ“Œ Expected Output:

  • HDFS: Deme If you see live DataNodes, your cluster is running successfully. πŸš€

  • YARN: yarn If you see live NodeManagers, YARN is running successfully. πŸš€

Step 5: Test the System with Scripts

To verify that the system is working correctly after start hdfs and yarn service, you can run the test scripts.

πŸ”Ή Step 1: Run a Word Count Test

  ./scripts/word_count.sh

This script runs a sample Word Count job to ensure that HDFS and YARN are functioning correctly.


πŸ“Œ Important Notes on Volumes & Containers

Since the system uses Docker Volumes for NameNode and DataNode, Hive Metastore DB and HBase, please ensure that:

  • The number of containers remains the same when restarting (e.g., if started with 5 slaves, restart with 5 slaves).
  • If the number of slaves changes, you may face volume inconsistencies.

βœ… How to Ensure the Correct Number of Containers During Restart:

  1. Always restart with the same number of containers:

    ./scripts/start-cluster.sh 6  # If you previously used 6 nodes
  2. Do not delete volumes when stopping the cluster, use:

      ./scripts/stop-cluster.sh

Avoid using docker compose -f compose-dynamic.yaml down -v as it will remove all volumes data.

βœ… Check Existing Volumes:

docker volume ls 

πŸš€ If the Word Count job runs successfully, your system is fully operational!


πŸ”„ Modify the Owner Name

If you need to change the owner name, run the rename-owner.py script and enter your new owner name when prompted.

⏳ Note: If you want to check the current owner name, it is stored in hamu-config.json.

πŸ“Œ There are some limitations; you should use a name that is different from words related to the 'Hadoop' or 'Docker' syntax. For example, avoid names like 'hdfs', 'yarn', 'container', or 'docker-compose'.

python rename-owner.py

🌐 Interact with the Web UI

You can access the following web interfaces to monitor and manage your Hadoop cluster:

  • YARN Resource Manager UI β†’ http://localhost:9004
    Provides an overview of cluster resource usage, running applications, and job details.

  • NameNode UI β†’ http://localhost:9870
    Displays HDFS file system details, block distribution, and overall health status.

  • Spark UI β†’ http://localhost:4040
    Track Spark jobs, tasks, and execution performance.

  • Hbase UI β†’ http://localhost:16010 Access HBase Master status, region servers, and table metrics.


πŸ“ž Contact

πŸ“§ Email: quochuy.working@gmail.com

πŸ’¬ Feel free to contribute and improve this project! πŸš€

About

Containerized Hadoop cluster with Spark, Hive, Pig, HBase, and Zookeeper for scalable Big Data processing using Docker.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Shell 56.4%
  • Dockerfile 33.6%
  • Python 10.0%