Advanced Spark

Stream Analytics using Kafka, Spark, Structured Streaming and Delta Lake

Duration

3 Days

Level

Advanced Level

Design and Tailor this course

As per your team needs

Edit Content

Learn about Kafka Basics, Kafka Architecture, Kafka Ecosystem e.g. Schema Registry, Kafka Internals, Kafka Optimization, KStreams, KSQLDB, Spark Delta Lake and how to use Spark internals for working with Kafka and streaming as well debugging and troubleshooting.

Edit Content
  • Software Developer
  • Data Engineer
  • Data Scientist
Edit Content
  • Background about Kappa Architecture 
  • End to End Data Pipelines
  • Reference Architecture
  • Distributed Log Structure 
  • Physical Architecture of Kafka
    • Partitions
    • Topics
    • Replicas
    • Producers & Consumers
    • Brokers
  • Roles and Responsibilities of various components
  • Key Terminologies
  • Producer API
  • Internals of Producer Side
  • Message Acknowledgement
  • Batching Messages
  • Keyed and Non-Keyed Messages
  • Compression
  • Batching
  • Consumer API
  • Replicas & High Watermarks
  • Ack
  • Retention
  • Rebalancing
  • Key configuration settings
  • Optimization Tips 
  • Best Practices
  • Why Kafka Connect?
  • Physical Architecture
  • Key Configurations for Connect workers
  • Kafka Connect – Connectors
  • Kafka Connect – Tasks
  • Kafka Connect – Workers
  • Hands-on Exercise – Integrating Confluent Kafka with PostgreSQL
  • Overview
  • What is KSQLDB?
  • Why KSQLDB?
  • KSQL DB Architecture
  • KSQL DB Limitations
  • KSQL DB Key Syntax
  • Hands-on Exercise – Exploring KsqlDB
  • Streaming Introduction
  • Parallel Processing in KStreams
  • KStream
  • KTable
  • Caching in KTable
  • DSL API vs Processor API
  • Joins
  • Windowing related Concepts
  • Various types of Windowing
  • Hands-on Exercise – Working with KStreams
  • A quick introduction to capabilities of YARN
  • YARN Client vs YARN Cluster Mode
  • YARN Multi-tenancy
  • Running a Spark cluster on YARN cluster
  • Spark Physical Architecture
  • Cluster resource requirements
  • Managing Memory – Driver side & executor side
  • Managing memory/cores
  • Best practices
  • Hands-on Exercise(s) – Spark on YARN – Client vs Cluster modes
  • Hands-on Exercise(s) – Spark with YARN Queues
  • Tweaking Degree of Parallelism 
  • Scheduling, jobs, and tasks
  • Data structures data, sets and data lakes
  • Shuffle and performance
  • Understanding data sources and partitions
  • Handling Data Skew
  • Data Locality
  • Hands-on Exercise – Partitions in Spark
  • Performance tuning techniques
  • Caching
  • Joins optimization
  • Partitioning 
  • Bucketing
  • SQL performance tuning using Spark Plans
  • High performance caching strategies
  • Best Practices
  • Common issues in production
  • Hands-on Exercise – Partitioning
  • Hands-on Exercise – Caching
  • Hands-on Exercise – Joins
  • Hands-on Exercise – Bucketing
  • Getting started with Spark Streaming
  • Evolution of Spark Streaming
  • Types of Streaming
  • Advanced Stateful Operations (i.e. window aggregations, watermarking, etc.)
  • Checkpointing in S3
  • Monitoring
  • Fault Tolerance
  • Graceful termination
  • Configuring Kafka 
  • Performance Tips & best practices
  • Hands-on Exercise(s) – Integration with Kafka
  • Comparisons – which one to choose when?
  • Parquet Overview
  • Parquet Internal Structure
  • Parquet Optimizations 
  • Parquet Key Configurations
  • What is Delta Lake?
  • Why to use Delta Lakes?
  • Key Features
  • Parquet Overview
  • Parquet v/s Delta Lakes
  • Delta Lake Architecture
  • How does Delta Lake work?
  • Configuration Params of Delta Lake
  • Delta Lake Hands-on using Spark SQL and Streaming – Loading and Storing Data
  • Operational challenges of large scale processing
Edit Content

All labs will be performed on AWS EMR in Pseudo Distributed Mode. Each participant will be having his/her own AWS account to perform the exercises. We will be making use of Confluent Kafka.

Connect

we'd love to have your feedback on your experience so far