Building Smart Data Pipelines & AI Workflows

Hands-on Journey Through Data Engineering, Analytics, and Machine Learning in Databricks

Duration

4 Days

Level

Intermediate Level

Design and Tailor this course

As per your team needs

Overview

This course is designed for beginners and intermediate users looking to build practical expertise in Databricks for data engineering, machine learning, and GenAI workflows. It covers the evolution of cloud-based data architectures, Delta Lake, Apache Spark, and ML fundamentals, while introducing hands-on implementation of scalable ML pipelines, MLflow deployment, and GenAI applications—all with a strong focus on security and governance.

Audience

Data Analysts & Business Analysts
Software Engineers exploring data engineering & analytics
New Data Scientists & ML Engineers
Database Administrators & IT Professionals

Prerequisites

Participants should have:

Basic understanding of Python and SQL (preferred but not mandatory)
Familiarity with cloud computing concepts (recommended)
No prior experience with Databricks is required

Curriculum

Module 1: Foundational Concepts - Cloud & Modern Data Architecture

Core Concepts: On-Prem vs Cloud, Compute vs Storage, Ephemeral vs Persistent
Cloud Service Models: IaaS, PaaS, SaaS
Overview of Major Cloud Providers (Azure, AWS, GCP)
Introduction to Modern Data Stack: Ingestion → Processing → Storage → Consumption
Key Challenges with Hadoop Ecosystem:
- Limitations w.r.t. Scalability, Elasticity and Multi-tenancy
- Issues with Data Lakes and The need for Lakehouses
- Shift from Data Locality to Ephemeral Clusters
- Problems with Parquet leading to Delta Format
Understanding Data Lakehouse Architecture
Introduction to Delta Lake and its role in the modern data landscape
Positioning Databricks in the Cloud & Data Ecosystem

Module 2: Introduction to Databricks & Cloud Computing

Overview of Databricks & its role in modern data workflows
Introduction to Apache Spark and distributed computing
Understanding Databricks Workspace, Clusters, Notebooks & Jobs
Exploring Delta Lake for reliable data management
Hands-On Exercise:
- Setting up a Databricks workspace and configuring a cluster
- Creating & running a Databricks Notebook
- Exploring Spark UI and Job Execution

Module 3: Working with Data in Databricks

Understanding Databricks File System (DBFS)
Loading data into Databricks: CSV, Parquet, JSON, Delta
Querying data using Spark SQL
Introduction to ETL pipelines using Databricks
Hands-On Exercise:
- Uploading and managing datasets in DBFS
- Running SQL queries on structured data
- Creating a basic ETL pipeline in Databricks

Module 4: Data Processing with Apache Spark

Introduction to Spark DataFrames & Transformations
Using PySpark for data manipulation
Optimizing Spark jobs with caching & partitioning
Understanding DataFrame API vs. RDDs
Hands-On Exercise:
- Performing data transformations with Spark DataFrames
- Writing optimized PySpark queries
- Applying caching & partitioning strategies for performance

Module 5: Data Engineering with Delta Lake

Introduction to Delta Lake and ACID Transactions
Schema evolution and time travel in Delta Lake
Managing large-scale data pipelines with Databricks Workflows
Hands-On Exercise:
- Creating and managing Delta Tables
- Implementing time travel and schema evolution
- Building a simple data pipeline using Delta Lake

Module 6: Building Conventional Machine Learning Models in Databricks

Introduction to Machine Learning in Databricks
Overview of ML Libraries: Spark MLlib vs scikit-learn
ML Workflow in Databricks: Data Preparation → Modeling → Evaluation

Supervised Learning:
- Linear & Logistic Regression
- Decision Trees
Unsupervised Learning:
- Clustering with K-Means
- Dimensionality Reduction Basics (PCA)

Feature Engineering & Pipeline APIs in Spark ML
Model Evaluation Metrics: RMSE, Accuracy, Silhouette Score

Module 7: Deploying & Scaling Databricks Workflows

Introduction to MLflow: Tracking, Registry & Deployment
Logging and Registering ML Models in Databricks
Automating ML Pipelines with Databricks Workflows
Best Practices for Scaling ML in Spark (clusters, caching, batch vs real-time)
Hands-On Exercise:
- Track experiments and register a regression or clustering model with MLflow
- Automate training and inference using a Databricks Workflow
- Optimizing Spark and Delta Lake performance

Module 8: Introduction to Generative AI (GenAI) in Databricks

Introduction to Generative AI (GenAI) and Large Language Models (LLMs)
Core Concepts: LLMs, Embeddings, Tokenization, Prompting
Working with Vector Databases and Embedding Stores
Leveraging Pre-trained Foundation Models in Databricks (OpenAI, Hugging Face, MosaicML)
Use Cases of GenAI in Data & Analytics (Synthetic Data Generation, Smart Querying, AI-powered Search)
Security, Governance & Compliance Considerations in GenAI Applications

Hands-On Exercise:
- Exploring pre-trained GenAI models in Databricks
- Generating synthetic data using LLMs
- Enhancing BI dashboards with AI-driven insights

Module 9: Security, Compliance & Governance in Databricks

Role-based access control (RBAC) & Unity Catalog for governance
Compliance considerations for data engineering & AI workflows
Implementing secure data access policies in Databricks
Hands-On Exercise:
- Setting up RBAC and Unity Catalog in Databricks
- Applying compliance policies for GenAI models
- Implementing secure workflows for enterprise data management

Duration

4 Days

Level

Intermediate Level

Design and Tailor this course

As per your team needs

FIND YOUR COURSE

Topics

Brands

Building Smart Data Pipelines & AI Workflows

Duration

Level

Design and Tailor this course

Overview

Audience

Prerequisites

Curriculum

Duration

Level

Design and Tailor this course

Strategic Capability Areas

Artificial Intelligence

Generative AI

Agentic AI

Data

Cloud

Cyber Security

Blockchain

Agile

DevOps

RPA

QA and Testing

Soft skills

Strategic Capability Areas

Artificial Intelligence

Generative AI

Agentic AI

Data

Cloud

Cyber Security

Blockchain

Agile

DevOps

RPA

QA and Testing

Soft skills

Let’s Build Your Growth Ecosystem.

Get in touch