Lorem ipsum dolor sit amet, conse ctetur adip elit, pellentesque turpis.

  • No products in the cart.

Image Alt

Hadoop and Spark Training for Developers

  /    /  Hadoop and Spark Training for Developers

Hadoop and Spark Training for Developers

Big Data


The Big Data technologies are evolving rapidly, a voluminous amount of data is generated every day. Because of this increase in data, more and more organizations are adopting the Big Data technologies like Hadoop, Spark, Kafka etc for storing and analyzing Big Data. The main objective of this course is to help you understand Complex Architectures of these technologies and its components, guide you in the right direction.

Course Structure

The program is focussed on ingestion, storage, processing and analysis of Big data using Hadoop, Spark and Kafka Ecosystem i.e. HDFS, MapReduce, YARN, Spark Core, SparkSQL, HBase, Kafka Core, Kafka Connect and Kafka Streams.

  • Holistic Overview of Hadoop, Spark and Kafka Ecosystem
  • Distributed Ingestion, Storage and Processing Concepts
  • Which technology/tool to choose when?
  • Architecture and Internals of key projects
  • How to perform data processing and ingestion using Spark and Kafka?

The intended audience for this course:

  • Big Data Engineers
  • Big Data Developers
  • Big Data Architects
  • Integration Engineers
Introduction to Hadoop and Spark Ecosystem
  • Big Data Overview
  • Key Roles in Big Data Project
  • Key Business Use cases
  • Hadoop and Spark Logical Architecture
  • Typical Big Data Project Pipeline
Basic Concepts of HDFS
  • HDFS Overview
  • Physical Architectures of HDFS
  • The Hadoop Distributed File System Hands-on
MapReduce v1/YARN Frameworks, Architectures and MapReduce API
  • Java Basics for understanding, developing, building and deploying MapReduce Programs
  • Logical Architecture of MapReduce
  • Physical Architecture of MRv1 and YARN
  • Compare MRv1 vs. MRv2 on YARN
  • MapReduce API
  • Hands-on Exercise
Working with HBase
  • HBase Overview
  • Physical Architectures of HBase
  • HBase Table Fundamentals
  • Thinking About Table Design
  • HBase Shell
  • HBase Physical Architecture
  • HBase Schema Design
  • HBase API
  • Hive on HBase
  • Hands-on Exercises
Introduction to Spark
  • Spark Overview
  • Detailed discussion on “Why Spark”
  • Quick Recap of MapReduce
  • Spark vs MapReduce
  • Why Python for Spark?
Spark Core Framework and API
  • High level Spark Architecture
  • Role of Executor, Driver, SparkContext etc.
  • Resilient Distributed Datasets
  • Basic operations in Spark Core API i.e. Actions and Transformations
  • Using the Spark REPL for performing interactive data analysis
  • Hands-on Exercises
Delving Deeper Into Spark API
  • Pair RDDs
  • Implementing MapReduce Algorithms using Spark
  • Ways to create Pair RDDs
  • JSON Processing
  • Code Example on JSON Processing
  • XML Processing
  • Joins
  • Playing with Regular Expressions
  • Log File Processing using Regular Expressions
  • Hands-on Exercises
Executing a Spark Application
  • Writing Standalone Spark Application
  • Various commands to execute and configure Spark Applications in various modes
  • Discussion on Application, Job, Stage, Executor, Tasks
  • Interpreting RDD Metadata/Lineage/DAG
  • Controlling degree of Parallelism in Spark Job
  • Physical execution of a Spark application
  • Discussion on: How Spark is better than MapReduce?
  • Hands-on Exercises
Advanced Features Of Spark
  • Persistence
  • Location
  • Data Format of Persistence
  • Replication
  • Partitioned By
  • Coalesce
  • Accumulators
  • Broadcasting for optimizing performance of Spark jobs
  • Hands-on Exercises
Spark Streaming
  • Analyzing streaming data using Spark
  • Stateless Streaming
  • Stateful Streaming
  • Quick introduction to Kafka Architecture
  • Role of Zookeeper, Brokers etc.
  • Hands-on Exercises
Spark SQL
  • Introduction
  • Dataframe API
  • Performing ad-hoc query analysis using Spark SQL
  • Working with Hive Partitioning
  • Hands-on Exercises
Iterative Processing Using Spark
  • Introduction to Iterative Processing
  • Checkpointing
  • Checkpointing vs Persist
  • Example of Iterative Processing
  • K Means Clustering
  • Hands-on Exercises
Introduction to Kafka
  • Kafka Overview
  • Salient Features of Kafka
  • Kafka Use cases
  • Comparing Kafka with other Key tools
Kafka Connect and Kafka Streams
  • Integrate Kafka with Spark
  • Kafka Connect
  • Kafka Streams
  • Spark Integration Approaches
  • Integrating Kafka with Spark Streaming
  • Hands-on Exercise
Structured Streaming
  • Structured Streaming Overview
  • How it is better than streaming?
  • Structured Streaming API
  • Hands-on Exercises

Participants should preferably have prior Software development experience along with basic knowledge of SQL and Unix commands. Knowledge of Python/Scala would be a plus.

Course Information


4 Days / 5 Days

Mode of Delivery

Instructor led/Virtual



Have more queries?Our representative will got back to you!

Fill up the form to download the course PDF

Your Name (required)

Your Email (required)

Phone (required)

Post a Comment

Need Help? Chat with us
Please accept our privacy policy first to start a conversation.