Spark Optimization

Learn to work with Apache Spark in the right manner…


4 Days


Advanced Level

Design and Tailor this course

As per your team needs


This course is an advanced level course for tuning Spark SQL (batch) applications. The participants will learn –

  • Best practices and techniques to work with YARN
  • How to perform Resource planning of “Spark on YARN” application?
  • How Spark executes physically on cluster?
  • Spark SQL Execution Plan
  • Spark SQL Best Practices
  • How to optimize Spark SQL code?

PySpark hands-on exercises will be performed in Jupyter notebooks integrated with Spark 2.4.x version. This setup will be installed in Pseudo distributed mode on Cloudera platform.

  • Data Engineers
  • Software Developers
  • Logical Architecture of Hadoop and Spark Ecosystem
  • Understand the Scope of Spark Optimization
  • How Spark Optimization and Distributed Storage Layer are related?
  • Quick Recap of MapReduce Concepts
  • Logical and Physical Architectures of MapReduce
  • Limitations of MRv1 Architecture
  • About YARN
  • Why YARN
  • Architecture of YARN
  • YARN UI and Commands
  • Internals of YARN
  • YARN Client vs YARN Cluster modes
  • Experience execution of Spark application on YARN
  • Troubleshooting and Debugging Spark applications on YARN
  • Configurations for optimizing Application Performance
  • Setting up and working with YARN Queues
  • How does Spark determine the number of Partitions?
  • Things to keep in mind while determining Partition
  • Small Partitions Problem
  • Diagnosing & Handling Post Filtering Issues (Skewness)
  • Repartition vs Coalesce
  • Partitioning Strategies
  • Hash Partitioner
  • Use of Range Partitioner
  • Writing and plugging custom partitioner
  • Data Partitioning
  • Query Optimizer: Catalyst Optimizer
  • Logical Plan
  • Physical Plan
  • Key Operations in Physical plan
  • Partitioning in Spark SQL
  • Customizing Physical Plan 
  • Why are Data Formats important for optimization? (if time permits)
  • Key Data Formats (if time permits)
  • Comparisons – which one to choose when? (if time permits)
  • How does Parquet store data?
  • Working with Parquet 
  • Parquet Internal Structure
  • Parquet Optimizations 
  • Parquet Key Configurations
  • When to Cache?
  • How Caching helps?
  • Caching Strategies
  • How does the Spark plan change when Caching is on?
  • Visualizing Cached Dataset in Spark UI
  • Working with On Heap and Off Heap Caching
  • Checkpointing
  • How is Caching different from Checkpointing?
  • Types of Joins
  • Quick Recap of MapReduce MapSide and Reduce Side Joins
  • Broadcasting
  • Optimizing Sort Merge Join
  • Bucketing
  • Key Configurations for better performance
  • Best Practices for writing Spark SQL code
  • Common Production Issues

The participants should have at-least a couple of months of experience developing Spark SQL applications. Knowledge of Hive will be a plus.


we'd love to have your feedback on your experience so far