Skip to content

This repository contains the configuration and scripts necessary to run Apache Spark on a Hadoop YARN cluster in client mode. The setup allows you to leverage the scalability of YARN for distributed data processing with Spark.

Notifications You must be signed in to change notification settings

huy-dataguy/Spark-on-YARN

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 

Repository files navigation

🏢 Spark on YARN Architecture Client Mode

image

🚀 Installation Guide

Step 1: Clone the Repository

  git clone https://github.com/huy-dataguy/Spark-on-YARN.git
  cd Spark-on-YARN

Step 2: Build Image Base

⏳ Note: The first build may take a few minutes as no cached layers exist.

  docker build -t base -f docker/base.dockerfile .

Step 3: Build and Start Cluster

  • build image (build in the first time or after make changes in dockerfile)
  docker compose -f docker/compose.yaml build
  • run container
  docker compose -f docker/compose.yaml up -d

Step 4: Verify the Installation

  • Go inside master container's CLI

💡 Start the HDFS - YARN services:

  start-dfs.sh
  start-yarn.sh
image

Step 5: Run Spark Submit on Yarn Client Mode

Create folder store spark logs

  hdfs dfs -mkdir /spark-logs

Run spark on yarn

  spark-submit \
  --class org.apache.spark.examples.SparkPi \
  $SPARK_HOME/examples/jars/spark-examples_*.jar 10

If success you will see answear Pi = 3,14159 image


🌐 Interact with the Web UI

You can access the following web interfaces to monitor and manage your Hadoop cluster:

  • YARN Resource Manager UIhttp://localhost:9004
    Provides an overview of cluster resource usage, running applications, and job details.

  • NameNode UIhttp://localhost:9870
    Displays HDFS file system details, block distribution, and overall health status.

  • Spark Web UIhttp://localhost:4040 Provides an interface to monitor running Spark jobs, stages, and tasks. Note: Because you are using YARN client mode, the Spark UI will automatically redirect to the master node's web UI.

image image

📞 Contact

📧 Email: quochuy.working@gmail.com

💬 Feel free to contribute and improve this project! 🚀

About

This repository contains the configuration and scripts necessary to run Apache Spark on a Hadoop YARN cluster in client mode. The setup allows you to leverage the scalability of YARN for distributed data processing with Spark.

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published