Big Data Technology
Learn to work with Apache Spark in right manner…
This course is an advanced level course for tuning Spark SQL (batch) applications. The participants will learn –
- Best practices and techniques to work with YARN
- How to perform Resource planning of “Spark on YARN” application?
- How Spark executes physically on cluster?
- Spark SQL Execution Plan
- Spark SQL Best Practices
- How to optimize Spark SQL code?
PySpark hands-on exercises will be performed in Jupyter notebooks integrated with Spark 2.4.x version. This setup will be installed in Pseudo distributed mode on Cloudera platform.
Course Duration: Usually covered in 3 Days. If it needs to be covered in 2 days then the pace will be faster and class should start 1 hour earlier at least.
The purpose of the course is to provide pragmatic exposure to participants for tuning Spark applications. Apart from hands-on experience, the course focuses on how Spark works under the hood.
The intended audience for this course
- Bigdata Developers
- Data Engineers
- Integration Engineers
- Data Steward
Overview and Recap (To bring everyone on same page)
- Logical Architecture of Hadoop and Spark Ecosystem
- Understand Scope of Spark Optimization
- How Spark Optimization and Distributed Storage Layer are related?
- Quick Recap of MapReduce Concepts
- Logical and Physical Architectures of MapReduce
- Limitations of MRv1 Architecture
Understanding Spark Execution Environment on Hadoop - YARN
- About YARN
- Why YARN
- Architecture of YARN
- YARN UI and Commands
- Internals of YARN
- YARN Client vs YARN Cluster modes
- Experience execution of Spark application on YARN
- Troubleshooting and Debugging Spark applications on YARN
- Configurations for optimizing Application Performance
- Setting up and working with YARN Queues
Spark Physical Execution
- Spark Core Plan
- Modes of Execution
- Standalone Mode
- Physical Execution on Cluster
- Narrow vs Wide Dependency
- Spark UI
- Executor Memory Architecture
- Key Properties
- Spark on YARN Detailed Architecture
- Resource Planning
- Discussion on Garbage Collection
Dealing with Spark Partitions
- How does Spark determine the number of Partitions?
- Things to keep in mind while determining Partition
- Small Partitions Problem
- Diagnosing & Handling Post Filtering Issues (Skewness)
- Repartition vs Coalesce
Understanding and Overriding Shuffle Behaviour using Partitioners
- Partitioning Strategies
- Hash Partitioner
- Use of Range Partitioner
- Writing and plugging custom partitioner
Spark SQL Execution Plan
- Data Partitioning
- Query Optimizer: Catalyst Optimizer
- Logical Plan
- Physical Plan
- Key Operations in Physical plan
- Partitioning in Spark SQL
- Customizing Physical Plan
Working with Parquet
- Why are Data Formats important for optimization? (if time permits)
- Key Data Formats (if time permits)
- Comparisons – which one to choose when? (if time permits)
- How Parquet stores data?
- Working with Parquet
- Parquet Internal Structure
- Parquet Optimizations
- Parquet Key Configurations
Caching and Checkpointing
- When to Cache?
- How Caching helps?
- Caching Strategies
- How does the Spark plan change when Caching is on?
- Visualizing Cached Dataset in Spark UI
- Working with On Heap and Off Heap Caching
- How is Caching different from Checkpointing?
Joins and more
- Types of Joins
- Quick Recap of MapReduce MapSide and Reduce Side Joins
- Optimizing Sort Merge Join
- Key Configurations for better performance
- Best Practices for writing Spark SQL code
- Common Production Issues
The participants should have at-least a couple of months of experience developing Spark SQL applications. Knowledge of Hive will be a plus.