Connect with us:

Join Expert-led Snowflake Platform Training in Bengaluru on Aug 1—Enroll now at ₹6249 (with MSS750) + Free $80 SnowPro Voucher! Join Expert-led Snowflake Platform Training in Bengaluru on Aug 1—Enroll now at ₹6249 (with MSS750) + Free $80 SnowPro Voucher!

Apache Spark Development

Develop Applications using Unified distributed Application framework

Duration

5 Days

Level

Advanced Level

Design and Tailor this course

As per your team needs

Edit Content

This training course will teach you how to solve Big Data problems using the Apache Spark framework. The training will cover a wide range of Big Data use cases such as ETL, DWH, Streaming and ML library. It will also demonstrate how Spark integrates with other well established Hadoop ecosystem products. You will learn the course curriculum through theory lectures, live demonstrations and lab exercises. This course will be taught in the Python programming language and we will be using Cloudera 7.x using Spark 3.x

Edit Content

Edit Content

Clustering with Spark

A quick introduction to capabilities of YARN
YARN Client vs YARN Cluster Mode
YARN Multi-tenancy
Running a Spark cluster on YARN cluster
Spark Physical Architecture
Cluster resource requirements
Managing Memory – Driver side & executor side
Managing memory/cores
Best practices
Hands-on Exercise(s) – Spark on YARN – Client vs Cluster modes
Hands-on Exercise(s) – Spark with YARN Queues

Understanding Spark Internals for Performance

Tweaking Degree of Parallelism
Scheduling, jobs, and tasks
Data structures data, sets and data lakes
Shuffle and performance
Understanding data sources and partitions
Handling Data Skew
Data Locality
Hands-on Exercise – Partitions in Spark

High Performance Spark applications

Performance tuning techniques
Caching
Joins optimization
Partitioning
Bucketing
SQL performance tuning using Spark Plans
High performance caching strategies
Best Practices
Common issues in production
Hands-on Exercise – Partitioning
Hands-on Exercise – Caching
Hands-on Exercise – Joins
Hands-on Exercise – Bucketing

Advanced Spark Structured Streaming

Getting started with Spark Streaming
Evolution of Spark Streaming
Types of Streaming
Advanced Stateful Operations (i.e. window aggregations, watermarking, etc.)
Monitoring
Fault Tolerance
Graceful termination
Configuring Kafka
Performance Tips & best practices
Hands-on Exercise(s) – Integration with Kafka

Dealing with Parquet

Comparisons – which one to choose when?
Parquet Overview
Parquet Internal Structure
Parquet Optimizations
Parquet Key Configurations

What is Delta Lake?
Why to use Delta Lakes?
Key Features
Parquet Overview
Parquet v/s Delta Lakes
Delta Lake Architecture
How does Delta Lake work?
Configuration Params of Delta Lake
Delta Lake Hands-on using Spark SQL and Streaming – Loading and Storing Data
Operational challenges of large scale processing

Machine Learning and Statistics Fundamentals

What is Machine Learning?
Supervised vs Unsupervised Machine Learning
3Cs of Machine Learning
Statistics – Inferential and Descriptive
Relation between Statistics and Machine Learning
Relation between Data Science and Machine Learning
Key Platforms for Machine Learning
Bootstrapping Jupyter notebook for Machine Learning

Why Spark for Machine Learning?

Problems with Traditional Machine Learning Frameworks
Machine Learning at Scale – Various options
Why Spark?
How Spark performs well for Iterative Machine Learning Algorithms?

Machine Learning Data Pipeline

Data Acquisition from various data sources
Data Cleansing/Processing at Scale
Feature Engineering – Feature Extraction, Scaling etc.
Modelling the problem
Evaluation – Measuring Accuracy
Optimization and Tuning
Deployment

Data Acquisition

Acquiring Structured content from Relational Databases
Acquiring Semi-structured content from Log Files
Acquiring Unstructured content from other key sources like Web
Hands-on Exercises

Spark Machine Learning using MLLib

Spark ML vs Spark MLLib
Data types and key terms
Feature Extraction
Linear Regression using Spark MLLib
Hands-on Exercises

Spark Machine Learning using ML

Spark ML Overview
Transformers and Estimators
Pipelines
Implementing Decision Trees
K-Means Clustering using Spark ML
Hands-on Exercises

Model Evaluation, Optimization and Deployment

Model Evaluation
Optimizing a Model
Deploying Model
Best Practices

Edit Content

Stay ahead with DataCouch! Your partner in mastering the latest advancements in AI, Data Science, DevOps, and more.

Quick Links

our Offerings

Get in touch

Sign up for DataCouch Communications

Copyright 2025 © DataCouch