Spark Optimization

Learn to work with Apache Spark in the right manner…

Duration

4 Days

Level

Advanced Level

Design and Tailor this course

As per your team needs

Edit Content

This course is an advanced level course for tuning Spark SQL (batch) applications. The participants will learn –

Best practices and techniques to work with YARN
How to perform Resource planning of “Spark on YARN” application?
How Spark executes physically on cluster?
Spark SQL Execution Plan
Spark SQL Best Practices
How to optimize Spark SQL code?

PySpark hands-on exercises will be performed in Jupyter notebooks integrated with Spark 2.4.x version. This setup will be installed in Pseudo distributed mode on Cloudera platform.

Edit Content

Overview and Recap (To bring everyone on same page)

Logical Architecture of Hadoop and Spark Ecosystem
Understand the Scope of Spark Optimization
How Spark Optimization and Distributed Storage Layer are related?
Quick Recap of MapReduce Concepts
Logical and Physical Architectures of MapReduce
Limitations of MRv1 Architecture

Understanding Spark Execution Environment on Hadoop - YARN

About YARN
Why YARN
Architecture of YARN
YARN UI and Commands
Internals of YARN
YARN Client vs YARN Cluster modes
Experience execution of Spark application on YARN
Troubleshooting and Debugging Spark applications on YARN
Configurations for optimizing Application Performance
Setting up and working with YARN Queues

Spark Physical Execution

How does Spark determine the number of Partitions?
Things to keep in mind while determining Partition
Small Partitions Problem
Diagnosing & Handling Post Filtering Issues (Skewness)
Repartition vs Coalesce

Dealing with Spark Partitions

Understanding and Overriding Shuffle Behaviour using Partitioners

Partitioning Strategies
Hash Partitioner
Use of Range Partitioner
Writing and plugging custom partitioner

Spark SQL Execution Plan

Data Partitioning
Query Optimizer: Catalyst Optimizer
Logical Plan
Physical Plan
Key Operations in Physical plan
Partitioning in Spark SQL
Customizing Physical Plan

Working with Parquet

Why are Data Formats important for optimization? (if time permits)
Key Data Formats (if time permits)
Comparisons – which one to choose when? (if time permits)
How does Parquet store data?
Working with Parquet
Parquet Internal Structure
Parquet Optimizations
Parquet Key Configurations

Caching and Checkpointing

When to Cache?
How Caching helps?
Caching Strategies
How does the Spark plan change when Caching is on?
Visualizing Cached Dataset in Spark UI
Working with On Heap and Off Heap Caching
Checkpointing
How is Caching different from Checkpointing?

Joins and more

Types of Joins
Quick Recap of MapReduce MapSide and Reduce Side Joins
Broadcasting
Optimizing Sort Merge Join
Bucketing
Key Configurations for better performance
Best Practices for writing Spark SQL code
Common Production Issues

Edit Content

FIND YOUR COURSE

Topics

Brands

Spark Optimization

Duration

Level

Design and Tailor this course

Quick Links

our Offerings

Get in touch

Sign up for DataCouch Communications

Connect

we'd love to have your feedback on your experience so far