Lorem ipsum dolor sit amet, conse ctetur adip elit, pellentesque turpis.

  • No products in the cart.

Image Alt

Spark Optimization

  /    /  Spark Optimization

Spark Optimization

Categories:
Big Data Technology
Reviews:

Learn to work with Apache Spark in right manner…

Course Overview: 

This course is an advanced level course for tuning Spark SQL (batch) applications. The participants will learn –

  • Best practices and techniques to work with YARN
  • How to perform Resource planning of “Spark on YARN” application?
  • How Spark executes physically on cluster?
  • Spark SQL Execution Plan
  • Spark SQL Best Practices
  • How to optimize Spark SQL code?

PySpark hands-on exercises will be performed in Jupyter notebooks integrated with Spark 2.4.x version. This setup will be installed in Pseudo distributed mode on Cloudera platform. 

Course Duration: Usually covered in 3 Days. If it needs to be covered in 2 days then the pace  will be faster and class should start 1 hour earlier at least.

Purpose:  

The purpose of the course is to provide pragmatic exposure to participants for tuning Spark applications. Apart from hands-on experience, the course focuses on how Spark works under the hood.

The intended audience for this course

  • Developers
  • Bigdata Developers
  • Data Engineers
  • Integration Engineers
  • Architects
  • Data Steward
Overview and Recap (To bring everyone on same page)
  • Logical Architecture of Hadoop and Spark Ecosystem
  • Understand Scope of Spark Optimization
  • How Spark Optimization and Distributed Storage Layer are related?
  • Quick Recap of MapReduce Concepts
  • Logical and Physical Architectures of MapReduce
  • Limitations of MRv1 Architecture
Understanding Spark Execution Environment on Hadoop - YARN
  • About YARN
  • Why YARN
  • Architecture of YARN
  • YARN UI and Commands
  • Internals of YARN
  • YARN Client vs YARN Cluster modes
  • Experience execution of Spark application on YARN
  • Troubleshooting and Debugging Spark applications on YARN
  • Configurations for optimizing Application Performance
  • Setting up and working with YARN Queues
Spark Physical Execution
  • Spark Core Plan
  • Modes of Execution
  • Standalone Mode 
  • Physical Execution on Cluster
  • Narrow vs Wide Dependency
  • Spark UI
  • Executor Memory Architecture
  • Key Properties
  • Spark on YARN Detailed Architecture
  • Resource Planning
  • Discussion on Garbage Collection
Dealing with Spark Partitions
  • How does Spark determine the number of Partitions?
  • Things to keep in mind while determining Partition
  • Small Partitions Problem
  • Diagnosing & Handling Post Filtering Issues (Skewness)
  • Repartition vs Coalesce
Understanding and Overriding Shuffle Behaviour using Partitioners
  • Partitioning Strategies
  • Hash Partitioner
  • Use of Range Partitioner
  • Writing and plugging custom partitioner
Spark SQL Execution Plan
    • Data Partitioning
    • Query Optimizer: Catalyst Optimizer
    • Logical Plan
    • Physical Plan
    • Key Operations in Physical plan
    • Partitioning in Spark SQL
    • Customizing Physical Plan
Working with Parquet
  • Why are Data Formats important for optimization? (if time permits)
  • Key Data Formats (if time permits)
  • Comparisons – which one to choose when? (if time permits)
  • How Parquet stores data?
  • Working with Parquet 
  • Parquet Internal Structure
  • Parquet Optimizations 
  • Parquet Key Configurations
Caching and Checkpointing
  • When to Cache?
  • How Caching helps?
  • Caching Strategies
  • How does the Spark plan change when Caching is on?
  • Visualizing Cached Dataset in Spark UI
  • Working with On Heap and Off Heap Caching
  • Checkpointing
  • How is Caching different from Checkpointing?
Joins and more
  • Types of Joins
  • Quick Recap of MapReduce MapSide and Reduce Side Joins
  • Broadcasting
  • Optimizing Sort Merge Join
  • Bucketing
  • Key Configurations for better performance
  • Best Practices for writing Spark SQL code
  • Common Production Issues

The participants should have at-least a couple of months of experience developing Spark SQL applications. Knowledge of Hive will be a plus.

Course Information

Duration

3 Days

Mode of Delivery

Instructor led/Virtual

Reach out to us..Our representative will get back to you!




Post a Comment

Virtual Classroom

Machine Learning

We are conducting virtual training on “Machine learning for Software Engineers” which starts on 29th May, 2020 to 31st May 2020, from 7 AM EST to 3 PM EST. Get 40% off on this virtual class with this code:
DC_ML_2020

E-Learning