Spark Optimization

Learn to work with Apache Spark in the right manner…

Duration

4 Days

Level

Advanced Level

Design and Tailor this course

As per your team needs

Edit

This course is an advanced level course for tuning Spark SQL (batch) applications. The participants will learn –

  • Best practices and techniques to work with YARN
  • How to perform Resource planning of “Spark on YARN” application?
  • How Spark executes physically on cluster?
  • Spark SQL Execution Plan
  • Spark SQL Best Practices
  • How to optimize Spark SQL code?

PySpark hands-on exercises will be performed in Jupyter notebooks integrated with Spark 2.4.x version. This setup will be installed in Pseudo distributed mode on Cloudera platform.

Edit
  • Data Engineers
  • Software Developers
Edit
  • Logical Architecture of Hadoop and Spark Ecosystem
  • Understand the Scope of Spark Optimization
  • How Spark Optimization and Distributed Storage Layer are related?
  • Quick Recap of MapReduce Concepts
  • Logical and Physical Architectures of MapReduce
  • Limitations of MRv1 Architecture
  • About YARN
  • Why YARN
  • Architecture of YARN
  • YARN UI and Commands
  • Internals of YARN
  • YARN Client vs YARN Cluster modes
  • Experience execution of Spark application on YARN
  • Troubleshooting and Debugging Spark applications on YARN
  • Configurations for optimizing Application Performance
  • Setting up and working with YARN Queues
  • How does Spark determine the number of Partitions?
  • Things to keep in mind while determining Partition
  • Small Partitions Problem
  • Diagnosing & Handling Post Filtering Issues (Skewness)
  • Repartition vs Coalesce
  • How does Spark determine the number of Partitions?
  • Things to keep in mind while determining Partition
  • Small Partitions Problem
  • Diagnosing & Handling Post Filtering Issues (Skewness)
  • Repartition vs Coalesce
  • Partitioning Strategies
  • Hash Partitioner
  • Use of Range Partitioner
  • Writing and plugging custom partitioner
  • Data Partitioning
  • Query Optimizer: Catalyst Optimizer
  • Logical Plan
  • Physical Plan
  • Key Operations in Physical plan
  • Partitioning in Spark SQL
  • Customizing Physical Plan 
  • Why are Data Formats important for optimization? (if time permits)
  • Key Data Formats (if time permits)
  • Comparisons – which one to choose when? (if time permits)
  • How does Parquet store data?
  • Working with Parquet 
  • Parquet Internal Structure
  • Parquet Optimizations 
  • Parquet Key Configurations
  • When to Cache?
  • How Caching helps?
  • Caching Strategies
  • How does the Spark plan change when Caching is on?
  • Visualizing Cached Dataset in Spark UI
  • Working with On Heap and Off Heap Caching
  • Checkpointing
  • How is Caching different from Checkpointing?
  • Types of Joins
  • Quick Recap of MapReduce MapSide and Reduce Side Joins
  • Broadcasting
  • Optimizing Sort Merge Join
  • Bucketing
  • Key Configurations for better performance
  • Best Practices for writing Spark SQL code
  • Common Production Issues
Edit

The participants should have at-least a couple of months of experience developing Spark SQL applications. Knowledge of Hive will be a plus.

Connect

we'd love to have your feedback on your experience so far