Advanced Data Engineering with Azure, Spark, and Airflow

Master Data Pipelines, Optimization Techniques, and Real-World Implementations

Duration

5 Day

Level

Advanced Level

Design and Tailor this course

As per your team needs

Overview

This professional data engineering training course provides comprehensive, hands-on experience covering Azure data engineering services, Apache Spark, PySpark best practices, Apache Airflow, PostgreSQL, and effective pipeline management. Participants will engage in practical labs, working through real-world use cases to solidify theoretical concepts, optimize data processing workflows, and orchestrate complex data pipelines using Airflow and Kubernetes.

Audience

This course is tailored for:

  • Data engineers and technical teams
  • Teams using Azure infrastructure, PySpark, and Apache Airflow
  • Architects and DevOps engineers

Prerequisites

Participants should ideally have:

  • Basic knowledge of Python programming
  • Fundamental understanding of data engineering concepts
  • Familiarity with SQL and data manipulation
  • Basic awareness of cloud platforms, preferably Azure

Curriculum

  • Overview of data engineering roles and responsibilities
  • Common tools and frameworks in data engineering
  • Data engineering lifecycle: Ingestion, Storage, Processing, Serving
  • Understanding Containerization process in detail: creating Dockerfile, building docker images,pushing to repo and running containers.
  • Kubernetes fundamentals: nodes, pods, containers, ReplicaSets
  • Workloads: Deployments, StatefulSets, DaemonSets
  • Understanding persistent volumes.
  • Configuration: ConfigMaps, Secrets, res
  • Source requests/limits
  • Networking: Services
  • Lab Exercise: In this lab, you will create a simple web application, write a Dockerfile to containerize it, build the Docker image, and push it to a Docker registry like Docker Hub. 
  • Lab Exercise: In this lab, you will deploy a previously containerized application to a local Kubernetes cluster using a Deployment to manage the pods and a Service to expose the application, Inject Configuration into Pods Using ConfigMaps and Secrets and adding Persistent Storage to Your Application Using PersistentVolumes and PersistentVolumeClaims
  • Introduction to Azure ecosystem for data engineering
  • Lab Exercise: Explore basic data engineering workflows, familiarize with common tools, and set up Azure accounts and environments.
  • AKS architecture: control plane vs. node pools
  • Cluster provisioning, node autoscaling, and upgrades
  • Integrating with Azure Container Registry (ACR)
  • Networking: Azure CNI vs. Kubenet, Load Balancer, ingress setup
  • Security & Identity: Azure AD integration, Managed Identities, RBAC
  • Monitoring & Logging: Azure Monitor for containers, Log Analytics
  • Helm Chart (if time permits)

Lab Exercise: Provision an AKS cluster, configure ACR integration, deploy a sample containerized application, and set up Azure Monitor and Log Analytics.

  • Storage account types and configuration
  • Blob Containers, blob types, and management
  • Advanced access tiers & lifecycle management policies
  • Introduction to MinIO and S3-compatible storage concepts
  • Deploying MinIO on Kubernetes and standalone environments
  • Bucket and object management, policies, and access credentials
  • High availability: distributed mode and erasure coding
  • Lab Exercise: Deploy MinIO in local minikube cluster, create buckets and access policies, use mc client for various operations, Enable and Use Object Locking and Versioning,setting lifecycle policies, adding replication & remote buckets.
  • Core Spark components: Driver, Executors, Tasks
  • Deployment modes: Standalone, YARN, Kubernetes, Databricks
  • Resource allocation: executor memory, cores, dynamic resource allocation
  • Shuffle and partition mechanics, data locality considerations
  • Performance tuning: caching, join strategies, broadcast variables
  • Monitoring Spark applications: Spark UI, metrics, logs
  • Lab Exercise: Run a Spark workload in multiple deployment modes, capture metrics via Spark UI, and apply tuning strategies to improve performance. (this lab will have multiple parts for different execution plan)
  • Persistence strategies: cache(), persist() and storage levels
  • DataFrame partitioning: repartition(), coalesce(), partitionBy() best practices
  • Handling data skew: salting and custom partitioners
  • Error handling: OOM, serialization issues, common pitfalls
  • Efficient use of built-in functions vs. UDFs
  • DataFrame/Dataset internals: Catalyst optimizer & Tungsten execution
  • Best practices for building efficient DataFrame pipelines

Lab Exercise: Optimize a PySpark job by applying caching, repartitioning, and built-in function replacements to resolve performance and memory issues.

  • Spark on Kubernetes: submitting applications, driver/executor pod layout
  • Tuning Spark clusters on K8s: cores, memory, executor and driver settings
  • Monitoring Spark on Kubernetes: UI access, metrics collection
  • Integration with MinIO and Spark.

Lab Exercise: Submit Spark application on Kubernetes(minikube), adjusting Spark configurations and analyzing performance improvements

  • Airflow architecture: Scheduler, Executor, Webserver, Metadata DB
  • DAG design patterns: tasks, dependencies, XComs, Variables
  • Core Operators & Hooks (Python, Bash, Azure integrations)
  • Monitoring, logging, and alerting in Airflow
  • Lab Exercise: Build a DAG to implement and use Xcom, variables and core operators and hooks.

Lab Exercise: Build a DAG to orchestrate an end-to-end Spark ETL workflow, including parameterization and basic error handling.

  • Understand how Apache Airflow runs in containerized environments
  • Explore Kubernetes-based orchestration for workflow scheduling
  • Learn basic resource configuration and scaling concepts
  • Integrate Airflow with AKS and container registries
  • Get an overview of monitoring and alerting tools for Airflow
  • Complex scheduling using cron expressions, presets, and execution semantics
  • Sensors in detail: FileSensor, SqlSensor, ExternalTaskSensor (focus on ExternalTaskSensor)
  • Introduction to Triggers and Deferrable Operators
  • Building Dynamic DAGs and leveraging task mapping (with attention to container memory usage)
  • Performance tuning with executor scaling and resource concurrency limits

Lab Exercise: In this lab, you will combine dynamic DAG generation with sensor-based control flow to orchestrate a parameterized Spark ETL job. You will use conditional branching to choose the type of ETL load, wait for dependencies using deferrable sensors, and execute the Spark job with error handling and retry logic. This lab demonstrates how to build production-grade, intelligent data pipelines in Airflow.

  • Database design: schemas, tables, indexing strategies
  • Functions in PostgreSQL
  • Performance tuning: vacuuming, indexing, query optimization
  • Partitions in PostgreSQL

Lab Exercise: In this lab, you will work with a large orders table to implement range and list partitioning strategies in PostgreSQL. You will then explore performance tuning techniques by analyzing query execution plans using EXPLAIN ANALYZE, applying VACUUM and AUTOVACUUM, and testing the effectiveness of indexes on both partitioned and non-partitioned tables.

Project:

In teams, design and implement a full data engineering pipeline:

  • Ingest raw data into Azure Blob Storage and MinIO
  • Process data with Spark on AKS
  • Store results in PostgreSQL
  • Orchestrate workflows with Airflow on AKS

Present solutions, discuss design decisions, and share best practices.

Let’s Build Your Growth Ecosystem.

Get in touch