Connect with us:

Join us for a FREE hands-on Meetup webinar on Mastering Snowflake Certifications: Associate & SnowPro Core Readiness | Friday, July 18th, 2025 · 6:00 PM IST/ 08:30 AM ET Join us for a FREE hands-on Meetup webinar on Mastering Snowflake Certifications: Associate & SnowPro Core Readiness | Friday, July 18th, 2025 · 6:00 PM IST/ 08:30 AM ET

Going beyond Lambda Architecture

Building Real time Big Data Pipelines using Kafka, Spark and Delta Lake

Duration

3 Days

Level

Advanced Level

Design and Tailor this course

As per your team needs

Edit Content

Course Notes:

All labs will be performed on AWS EMR/GCP DataProc in Pseudo Distributed Mode. Each participant will be having his/her own AWS/GCP account to perform the exercises. We will be making use of Confluent Kafka.

Learn about Kafka Basics, Kafka Architecture, Kafka Ecosystem e.g. Schema Registry, Kafka Internals, Kafka Optimization, KStreams, KSQLDB, Spark Delta Lake and how to use Spark internals for working with Kafka and streaming as well debugging and troubleshooting.

Edit Content

Edit Content

Kafka Architecture

Background about Kappa Architecture
End to End Data Pipelines
Reference Architecture
Distributed Log Structure
Physical Architecture of Kafka
- Partitions
- Topics
- Replicas
- Producers & Consumers
- Brokers
Roles and Responsibilities of various components
Key Terminologies

Kafka Internals

Producer API
Internals of Producer Side
Message Acknowledgement
Batching Messages
Keyed and Non-Keyed Messages
Compression
Batching
Consumer API
Replicas & High Watermarks
Ack
Retention
Rebalancing
Key configuration settings
Optimization Tips
Best Practices

Data Modeling & Schema Registry

Why Kafka Connect?
Physical Architecture
Key Configurations for Connect workers
Kafka Connect – Connectors
Kafka Connect – Tasks
Kafka Connect – Workers
Hands-on Exercise – Integrating Confluent Kafka with PostgreSQL

Overview
What is KSQLDB?
Why KSQLDB?
KSQL DB Architecture
KSQL DB Limitations
KSQL DB Key Syntax
Hands-on Exercise – Exploring KsqlDB

Streaming Analytics using KStreams

Streaming Introduction
Parallel Processing in KStreams
KStream
KTable
Caching in KTable
DSL API vs Processor API
Joins
Windowing related Concepts
Various types of Windowing
Hands-on Exercise – Working with KStreams

Clustering with Spark

A quick introduction to capabilities of YARN
YARN Client vs YARN Cluster Mode
YARN Multi-tenancy
Running a Spark cluster on YARN cluster
Spark Physical Architecture
Cluster resource requirements
Managing Memory – Driver side & executor side
Managing memory/cores
Best practices
Hands-on Exercise(s) – Spark on YARN – Client vs Cluster modes
Hands-on Exercise(s) – Spark with YARN Queues

Understanding Spark Internals for Performance

Tweaking Degree of Parallelism
Scheduling, jobs, and tasks
Data structures data, sets and data lakes
Shuffle and performance
Understanding data sources and partitions
Handling Data Skew
Data Locality
Hands-on Exercise – Partitions in Spark

High Performance Spark applications

Performance tuning techniques
Caching
Joins optimization
Partitioning
Bucketing
SQL performance tuning using Spark Plans
High performance caching strategies
Best Practices
Common issues in production
Hands-on Exercise – Partitioning
Hands-on Exercise – Caching
Hands-on Exercise – Joins
Hands-on Exercise – Bucketing

Advanced Spark Structured Streaming

Getting started with Spark Streaming
Evolution of Spark Streaming
Types of Streaming
Advanced Stateful Operations (i.e. window aggregations, watermarking, etc.)
Checkpointing in S3
Monitoring
Fault Tolerance
Graceful termination
Configuring Kafka
Performance Tips & best practices
Hands-on Exercise(s) – Integration with Kafka

Dealing with Parquet

Comparisons – which one to choose when?
Parquet Overview
Parquet Internal Structure
Parquet Optimizations
Parquet Key Configurations

What is Delta Lake?
Why to use Delta Lakes?
Key Features
Parquet Overview
Parquet v/s Delta Lakes
Delta Lake Architecture
How does Delta Lake work?
Configuration Params of Delta Lake
Delta Lake Hands-on using Spark SQL and Streaming – Loading and Storing Data
Operational challenges of large scale processing

Edit Content

Stay ahead with DataCouch! Your partner in mastering the latest advancements in AI, Data Science, DevOps, and more.

Quick Links

our Offerings

Get in touch

Sign up for DataCouch Communications

Copyright 2025 © DataCouch