Building Smart Data Pipelines & AI Workflows
Hands-on Journey Through Data Engineering, Analytics, and Machine Learning in Databricks
Duration
4 Days
Level
Intermediate Level
Design and Tailor this course
As per your team needs
Overview
This course is designed for beginners and intermediate users looking to build practical expertise in Databricks for data engineering, machine learning, and GenAI workflows. It covers the evolution of cloud-based data architectures, Delta Lake, Apache Spark, and ML fundamentals, while introducing hands-on implementation of scalable ML pipelines, MLflow deployment, and GenAI applications—all with a strong focus on security and governance.
Audience
- Data Analysts & Business Analysts
- Software Engineers exploring data engineering & analytics
- New Data Scientists & ML Engineers
- Database Administrators & IT Professionals
Prerequisites
Participants should have:
- Basic understanding of Python and SQL (preferred but not mandatory)
- Familiarity with cloud computing concepts (recommended)
- No prior experience with Databricks is required
Curriculum
- Core Concepts: On-Prem vs Cloud, Compute vs Storage, Ephemeral vs Persistent
- Cloud Service Models: IaaS, PaaS, SaaS
- Overview of Major Cloud Providers (Azure, AWS, GCP)
- Introduction to Modern Data Stack: Ingestion → Processing → Storage → Consumption
- Key Challenges with Hadoop Ecosystem:
- Limitations w.r.t. Scalability, Elasticity and Multi-tenancy
- Issues with Data Lakes and The need for Lakehouses
- Shift from Data Locality to Ephemeral Clusters
- Problems with Parquet leading to Delta Format
- Understanding Data Lakehouse Architecture
- Introduction to Delta Lake and its role in the modern data landscape
- Positioning Databricks in the Cloud & Data Ecosystem
- Overview of Databricks & its role in modern data workflows
- Introduction to Apache Spark and distributed computing
- Understanding Databricks Workspace, Clusters, Notebooks & Jobs
- Exploring Delta Lake for reliable data management
- Hands-On Exercise:
- Setting up a Databricks workspace and configuring a cluster
- Creating & running a Databricks Notebook
- Exploring Spark UI and Job Execution
- Understanding Databricks File System (DBFS)
- Loading data into Databricks: CSV, Parquet, JSON, Delta
- Querying data using Spark SQL
- Introduction to ETL pipelines using Databricks
- Hands-On Exercise:
- Uploading and managing datasets in DBFS
- Running SQL queries on structured data
- Creating a basic ETL pipeline in Databricks
- Introduction to Spark DataFrames & Transformations
- Using PySpark for data manipulation
- Optimizing Spark jobs with caching & partitioning
- Understanding DataFrame API vs. RDDs
- Hands-On Exercise:
- Performing data transformations with Spark DataFrames
- Writing optimized PySpark queries
- Applying caching & partitioning strategies for performance
- Introduction to Delta Lake and ACID Transactions
- Schema evolution and time travel in Delta Lake
- Managing large-scale data pipelines with Databricks Workflows
- Hands-On Exercise:
- Creating and managing Delta Tables
- Implementing time travel and schema evolution
- Building a simple data pipeline using Delta Lake
- Introduction to Machine Learning in Databricks
- Overview of ML Libraries: Spark MLlib vs scikit-learn
- ML Workflow in Databricks: Data Preparation → Modeling → Evaluation
- Supervised Learning:
- Linear & Logistic Regression
- Decision Trees
- Unsupervised Learning:
- Clustering with K-Means
- Dimensionality Reduction Basics (PCA)
- Feature Engineering & Pipeline APIs in Spark ML
- Model Evaluation Metrics: RMSE, Accuracy, Silhouette Score
- Introduction to MLflow: Tracking, Registry & Deployment
- Logging and Registering ML Models in Databricks
- Automating ML Pipelines with Databricks Workflows
- Best Practices for Scaling ML in Spark (clusters, caching, batch vs real-time)
- Hands-On Exercise:
- Track experiments and register a regression or clustering model with MLflow
- Automate training and inference using a Databricks Workflow
- Optimizing Spark and Delta Lake performance
- Introduction to Generative AI (GenAI) and Large Language Models (LLMs)
- Core Concepts: LLMs, Embeddings, Tokenization, Prompting
- Working with Vector Databases and Embedding Stores
- Leveraging Pre-trained Foundation Models in Databricks (OpenAI, Hugging Face, MosaicML)
- Use Cases of GenAI in Data & Analytics (Synthetic Data Generation, Smart Querying, AI-powered Search)
- Security, Governance & Compliance Considerations in GenAI Applications
- Hands-On Exercise:
- Exploring pre-trained GenAI models in Databricks
- Generating synthetic data using LLMs
- Enhancing BI dashboards with AI-driven insights
- Role-based access control (RBAC) & Unity Catalog for governance
- Compliance considerations for data engineering & AI workflows
- Implementing secure data access policies in Databricks
- Hands-On Exercise:
- Setting up RBAC and Unity Catalog in Databricks
- Applying compliance policies for GenAI models
- Implementing secure workflows for enterprise data management
Duration
4 Days
Level
Intermediate Level
Design and Tailor this course
As per your team needs