Join us for a FREE hands-on Meetup webinar on Streamlining Machine Learning Pipelines using Vertex AI | Sat, JUL 27 · 7:00 PM IST Join us for a FREE hands-on Meetup webinar on Streamlining Machine Learning Pipelines using Vertex AI | Sat, JUL 27 · 7:00 PM IST
Close this search box.
Close this search box.

Apache Spark Development

Develop Applications using Unified distributed Application framework


5 Days


Advanced Level

Design and Tailor this course

As per your team needs

Edit Content

This training course will teach you how to solve Big Data problems using the Apache Spark framework. The training will cover a wide range of Big Data use cases such as ETL, DWH, Streaming and ML library. It will also demonstrate how Spark integrates with other well established Hadoop ecosystem products. You will learn the course curriculum through theory lectures, live demonstrations and lab exercises. This course will be taught in the Python programming language and we will be using Cloudera 7.x using Spark 3.x

Edit Content

This course is designed for : 

  • Application developers, DevOps engineers, Architects, QAs, Technical Managers
  • Developers who are experienced with Spark using Scala programming and want to advance their skills and knowledge in Streaming through Spark & Kafka and Spark Optimization.
Edit Content
  • A quick introduction to capabilities of YARN
  • YARN Client vs YARN Cluster Mode
  • YARN Multi-tenancy
  • Running a Spark cluster on YARN cluster
  • Spark Physical Architecture
  • Cluster resource requirements
  • Managing Memory – Driver side & executor side
  • Managing memory/cores
  • Best practices
  • Hands-on Exercise(s) – Spark on YARN – Client vs Cluster modes
  • Hands-on Exercise(s) – Spark with YARN Queues
  • Tweaking Degree of Parallelism 
  • Scheduling, jobs, and tasks
  • Data structures data, sets and data lakes
  • Shuffle and performance
  • Understanding data sources and partitions
  • Handling Data Skew
  • Data Locality
  • Hands-on Exercise – Partitions in Spark
  • Performance tuning techniques
  • Caching
  • Joins optimization
  • Partitioning 
  • Bucketing
  • SQL performance tuning using Spark Plans
  • High performance caching strategies
  • Best Practices
  • Common issues in production
  • Hands-on Exercise – Partitioning
  • Hands-on Exercise – Caching
  • Hands-on Exercise – Joins
  • Hands-on Exercise – Bucketing
  • Getting started with Spark Streaming
  • Evolution of Spark Streaming
  • Types of Streaming
  • Advanced Stateful Operations (i.e. window aggregations, watermarking, etc.)
  • Monitoring
  • Fault Tolerance
  • Graceful termination
  • Configuring Kafka 
  • Performance Tips & best practices
  • Hands-on Exercise(s) – Integration with Kafka
  • Comparisons – which one to choose when?
  • Parquet Overview
  • Parquet Internal Structure
  • Parquet Optimizations 
  • Parquet Key Configurations
  • What is Delta Lake?
  • Why to use Delta Lakes?
  • Key Features
  • Parquet Overview
  • Parquet v/s Delta Lakes
  • Delta Lake Architecture
  • How does Delta Lake work?
  • Configuration Params of Delta Lake
  • Delta Lake Hands-on using Spark SQL and Streaming – Loading and Storing Data
  • Operational challenges of large scale processing
  • What is Machine Learning?
  • Supervised vs Unsupervised Machine Learning
  • 3Cs of Machine Learning
  • Statistics – Inferential and Descriptive
  • Relation between Statistics and Machine Learning
  • Relation between Data Science and Machine Learning
  • Key Platforms for Machine Learning
  • Bootstrapping Jupyter notebook for Machine Learning
  • Problems with Traditional Machine Learning Frameworks
  • Machine Learning at Scale – Various options
  • Why Spark?
  • How Spark performs well for Iterative Machine Learning Algorithms?
  • Data Acquisition from various data sources
  • Data Cleansing/Processing at Scale
  • Feature Engineering – Feature Extraction, Scaling etc.
  • Modelling the problem
  • Evaluation – Measuring Accuracy
  • Optimization and Tuning
  • Deployment
  • Acquiring Structured content from Relational Databases
  • Acquiring Semi-structured content from Log Files
  • Acquiring Unstructured content from other key sources like Web
  • Hands-on Exercises
  • Spark ML vs Spark MLLib
  • Data types and key terms
  • Feature Extraction
  • Linear Regression using Spark MLLib
  • Hands-on Exercises
  • Spark ML Overview
  • Transformers and Estimators
  • Pipelines
  • Implementing Decision Trees
  • K-Means Clustering using Spark ML
  • Hands-on Exercises
  • Model Evaluation
  • Optimizing a Model
  • Deploying Model
  • Best Practices
Edit Content
  • Basic Knowledge of Big Data Technologies as covered in Big Data Crash Course or equivalent knowledge
  • Basic knowledge of Python 
  • Participants should preferably have basic knowledge of SQL, Scala/Java and Unix commands


we'd love to have your feedback on your experience so far