Advanced Data Engineering with Azure, Spark, and Airflow

Master Data Pipelines, Optimization Techniques, and Real-World Implementations

Duration

5 Day

Level

Advanced Level

Design and Tailor this course

As per your team needs

Overview

This professional data engineering training course provides comprehensive, hands-on experience covering Azure data engineering services, Apache Spark, PySpark best practices, Apache Airflow, PostgreSQL, and effective pipeline management. Participants will engage in practical labs, working through real-world use cases to solidify theoretical concepts, optimize data processing workflows, and orchestrate complex data pipelines using Airflow and Kubernetes.

Audience

This course is tailored for:

Data engineers and technical teams
Teams using Azure infrastructure, PySpark, and Apache Airflow
Architects and DevOps engineers

Prerequisites

Participants should ideally have:

Basic knowledge of Python programming
Fundamental understanding of data engineering concepts
Familiarity with SQL and data manipulation
Basic awareness of cloud platforms, preferably Azure

Curriculum

Module 1: Introduction to Data Engineering

Overview of data engineering roles and responsibilities
Common tools and frameworks in data engineering
Data engineering lifecycle: Ingestion, Storage, Processing, Serving

Module 2: Dockers & Kubernetes Architecture

Understanding Containerization process in detail: creating Dockerfile, building docker images,pushing to repo and running containers.
Kubernetes fundamentals: nodes, pods, containers, ReplicaSets
Workloads: Deployments, StatefulSets, DaemonSets
Understanding persistent volumes.
Configuration: ConfigMaps, Secrets, res
Source requests/limits
Networking: Services
Lab Exercise: In this lab, you will create a simple web application, write a Dockerfile to containerize it, build the Docker image, and push it to a Docker registry like Docker Hub.
Lab Exercise: In this lab, you will deploy a previously containerized application to a local Kubernetes cluster using a Deployment to manage the pods and a Service to expose the application, Inject Configuration into Pods Using ConfigMaps and Secrets and adding Persistent Storage to Your Application Using PersistentVolumes and PersistentVolumeClaims

Module 3: Introduction to Azure Data Engineering

Introduction to Azure ecosystem for data engineering
Lab Exercise: Explore basic data engineering workflows, familiarize with common tools, and set up Azure accounts and environments.

Module 4: Azure Kubernetes Service (AKS) & Azure Integration

AKS architecture: control plane vs. node pools
Cluster provisioning, node autoscaling, and upgrades
Integrating with Azure Container Registry (ACR)
Networking: Azure CNI vs. Kubenet, Load Balancer, ingress setup
Security & Identity: Azure AD integration, Managed Identities, RBAC
Monitoring & Logging: Azure Monitor for containers, Log Analytics
Helm Chart (if time permits)

Lab Exercise: Provision an AKS cluster, configure ACR integration, deploy a sample containerized application, and set up Azure Monitor and Log Analytics.

Module 5: Azure Blob Storage (Enhanced)

Storage account types and configuration
Blob Containers, blob types, and management
Advanced access tiers & lifecycle management policies

Module 6: MinIO for Object Storage

Introduction to MinIO and S3-compatible storage concepts
Deploying MinIO on Kubernetes and standalone environments
Bucket and object management, policies, and access credentials
High availability: distributed mode and erasure coding
Lab Exercise: Deploy MinIO in local minikube cluster, create buckets and access policies, use mc client for various operations, Enable and Use Object Locking and Versioning,setting lifecycle policies, adding replication & remote buckets.

Module 7: Apache Spark Architecture

Core Spark components: Driver, Executors, Tasks
Deployment modes: Standalone, YARN, Kubernetes, Databricks
Resource allocation: executor memory, cores, dynamic resource allocation
Shuffle and partition mechanics, data locality considerations
Performance tuning: caching, join strategies, broadcast variables
Monitoring Spark applications: Spark UI, metrics, logs
Lab Exercise: Run a Spark workload in multiple deployment modes, capture metrics via Spark UI, and apply tuning strategies to improve performance. (this lab will have multiple parts for different execution plan)

Module 8: PySpark Best Practices

Persistence strategies: cache(), persist() and storage levels
DataFrame partitioning: repartition(), coalesce(), partitionBy() best practices
Handling data skew: salting and custom partitioners
Error handling: OOM, serialization issues, common pitfalls
Efficient use of built-in functions vs. UDFs
DataFrame/Dataset internals: Catalyst optimizer & Tungsten execution
Best practices for building efficient DataFrame pipelines

Lab Exercise: Optimize a PySpark job by applying caching, repartitioning, and built-in function replacements to resolve performance and memory issues.

Module 9: Submitting and tuning Spark on Kubernetes

Spark on Kubernetes: submitting applications, driver/executor pod layout
Tuning Spark clusters on K8s: cores, memory, executor and driver settings
Monitoring Spark on Kubernetes: UI access, metrics collection
Integration with MinIO and Spark.

Lab Exercise: Submit Spark application on Kubernetes(minikube), adjusting Spark configurations and analyzing performance improvements

Module 10: Apache Airflow Basics

Airflow architecture: Scheduler, Executor, Webserver, Metadata DB
DAG design patterns: tasks, dependencies, XComs, Variables
Core Operators & Hooks (Python, Bash, Azure integrations)
Monitoring, logging, and alerting in Airflow
Lab Exercise: Build a DAG to implement and use Xcom, variables and core operators and hooks.

Lab Exercise: Build a DAG to orchestrate an end-to-end Spark ETL workflow, including parameterization and basic error handling.

Module 11: Containerized Airflow Deployment

Understand how Apache Airflow runs in containerized environments
Explore Kubernetes-based orchestration for workflow scheduling
Learn basic resource configuration and scaling concepts
Integrate Airflow with AKS and container registries
Get an overview of monitoring and alerting tools for Airflow

Module 12: Advanced Airflow Concepts

Complex scheduling using cron expressions, presets, and execution semantics
Sensors in detail: FileSensor, SqlSensor, ExternalTaskSensor (focus on ExternalTaskSensor)
Introduction to Triggers and Deferrable Operators
Building Dynamic DAGs and leveraging task mapping (with attention to container memory usage)
Performance tuning with executor scaling and resource concurrency limits

Lab Exercise: In this lab, you will combine dynamic DAG generation with sensor-based control flow to orchestrate a parameterized Spark ETL job. You will use conditional branching to choose the type of ETL load, wait for dependencies using deferrable sensors, and execute the Spark job with error handling and retry logic. This lab demonstrates how to build production-grade, intelligent data pipelines in Airflow.

Module 13: PostgreSQL for Data Engineering

Database design: schemas, tables, indexing strategies
Functions in PostgreSQL
Performance tuning: vacuuming, indexing, query optimization
Partitions in PostgreSQL

Lab Exercise: In this lab, you will work with a large orders table to implement range and list partitioning strategies in PostgreSQL. You will then explore performance tuning techniques by analyzing query execution plans using EXPLAIN ANALYZE, applying VACUUM and AUTOVACUUM, and testing the effectiveness of indexes on both partitioned and non-partitioned tables.

Module 14: Capstone Project

Project:

In teams, design and implement a full data engineering pipeline:

Ingest raw data into Azure Blob Storage and MinIO
Process data with Spark on AKS
Store results in PostgreSQL
Orchestrate workflows with Airflow on AKS

Present solutions, discuss design decisions, and share best practices.

Duration

5 Day

Level

Advanced Level

Design and Tailor this course

As per your team needs

FIND YOUR COURSE

Topics

Brands

Advanced Data Engineering with Azure, Spark, and Airflow

Duration

Level

Design and Tailor this course

Overview

Audience

Prerequisites

Curriculum

Duration

Level

Design and Tailor this course

Strategic Capability Areas

Artificial Intelligence

Generative AI

Agentic AI

Data

Cloud

Cyber Security

Blockchain

Agile

DevOps

RPA

QA and Testing

Soft skills

Strategic Capability Areas

Artificial Intelligence

Generative AI

Agentic AI

Data

Cloud

Cyber Security

Blockchain

Agile

DevOps

RPA

QA and Testing

Soft skills

Let’s Build Your Growth Ecosystem.

Get in touch