Data Engineering Bootcamp – End-to-End Modern Data Platform
Duration
15 Days
Level
Advanced Level
Design and Tailor this course
As per your team needs
Overview
This 15-day Data Engineering Bootcamp provides a complete end-to-end learning journey across the entire Data Engineering track — from core data fundamentals and architecture principles to distributed systems, streaming, lakehouse implementation, and business intelligence.
Participants will master relational and NoSQL databases, graph databases (Neo4j), event streaming with Kafka, distributed processing with Spark and Databricks, table formats with Iceberg, medallion/lakehouse architecture, governance, orchestration, and Power BI integration.
The program follows a progressive structure (Basic → Intermediate → Advanced) and includes a capstone building a production-grade modern data platform.
Audience
- Aspiring Data Engineers
- Software Engineers transitioning to DE
- Data Analysts moving to Engineering
- Platform & Cloud Engineers
- BI Developers
- Analytics Engineers
Prerequisites
- Basic programming knowledge (Python preferred)
- Basic SQL understanding
- Familiarity with data concepts (tables, files, schemas)
- No formal prerequisites required
Curriculum
- What is Data Engineering?
- Role of Data Engineer
- DE vs DA vs DS
- Modern data ecosystem overview
- Batch vs streaming systems=
- Role of Data Engineer
- Types of Data
- Structured vs semi-structured vs unstructured
- OLTP vs OLAP workloads
- Transactional vs analytical systems
- Structured vs semi-structured vs unstructured
- Data Modeling Fundamentals
- ER modeling
- Normalization & denormalization
- Star & snowflake schemas
- ER modeling
- Hands-on
- Design ER diagram
- Create normalized schema
- Design ER diagram
- RDBMS Architecture
- ACID properties
- Indexing strategies
- Query execution plans
- ACID properties
- Advanced SQL
- Joins & window functions
- CTEs & subqueries
- Performance tuning
- Joins & window functions
- Hands-on
- Build optimized schema
- Analyze execution plans
- Tune slow queries
- Build optimized schema
- NoSQL Landscape
- Key-value
- Document databases
- Column-family stores
- Key-value
- Graph Databases – Neo4j
- Graph modeling principles
- Nodes & relationships
- Cypher query language
- Use cases (fraud detection, recommendation)
- Graph modeling principles
- Hands-on
- Design graph model
- Write Cypher queries
- Build recommendation example
- Design graph model
- Traditional vs Modern Architectures
- Monolithic vs distributed systems
- Lambda & Kappa architectures
- Event-driven architecture
- Monolithic vs distributed systems
- Lakehouse Architecture
- Data lake vs warehouse
- Medallion architecture
- Separation of compute & storage
- Data lake vs warehouse
- Hands-on
- Design enterprise data architecture
- Identify bottlenecks & trade-offs
- Design enterprise data architecture
- Kafka Architecture
- Brokers, partitions, replication
- Producers & consumers
- Offsets & consumer groups
- Brokers, partitions, replication
- Event Streaming Concepts
- Exactly-once semantics
- Log compaction
- Event-driven systems
- Exactly-once semantics
- Hands-on
- Setup Kafka
- Create topics
- Produce & consume events
- Setup Kafka
- Schema Registry & Serialization
- Avro / JSON / Protobuf
- Schema evolution
- Avro / JSON / Protobuf
- Kafka Connect
- Source connectors
- Sink connectors
- CDC pipelines
- Source connectors
- Hands-on
- Build CDC ingestion pipeline
- Monitor consumer lag
- Build CDC ingestion pipeline
- Spark Architecture
- Driver & executors
- DAG execution
- Lazy evaluation
- Driver & executors
- RDDs & DataFrames
- Transformations & actions
- Partitioning strategies
- Transformations & actions
- Hands-on
- Build distributed processing job
- Analyze Spark UI
- Build distributed processing job
- Catalyst Optimizer
- Logical vs physical plans
- Predicate pushdown
- Logical vs physical plans
- Performance Optimization
- Shuffle tuning
- Broadcast joins
- Caching strategies
- Shuffle tuning
- Hands-on
- Optimize large joins
- Benchmark performance
- Optimize large joins
- Streaming Concepts
- Event time vs processing time
- Watermarking
- Stateful aggregations
- Event time vs processing time
- Spark + Kafka Integration
- Stream ingestion
- Exactly-once guarantees
- Stream ingestion
- Hands-on
- Build real-time streaming pipeline
- Handle late-arriving data
- Build real-time streaming pipeline
- Databricks Architecture
- Workspaces & clusters
- Jobs vs interactive clusters
- Autoscaling
- Workspaces & clusters
- Medallion Architecture
- Bronze layer
- Silver layer
- Gold layer
- Bronze layer
- Hands-on
- Implement layered lakehouse
- Build incremental transformations
- Implement layered lakehouse
- Iceberg Internals
- Table metadata
- Snapshots & time travel
- Schema evolution
- Partition evolution
- Table metadata
- Iceberg + Spark Integration
- Incremental reads
- Streaming writes
- Compaction
- Incremental reads
- Hands-on
- Create Iceberg tables
- Perform time travel queries
- Implement CDC ingestion
- Create Iceberg tables
- ETL vs ELT
- Pipeline design principles
- Idempotency
- Error handling
- Pipeline design principles
- Workflow Orchestration
- DAG scheduling
- Dependency management
- Retry strategies
- DAG scheduling
- Hands-on
- Build orchestrated DE pipeline
- Add monitoring & logging
- Build orchestrated DE pipeline
- Data Security
- Encryption at rest & in transit
- Role-based access control
- Encryption at rest & in transit
- Governance & Metadata
- Data lineage
- Catalog management
- Audit trails
- Data lineage
- Performance & Cost Optimization
- File sizing strategy
- Cluster tuning
- Storage optimization
- File sizing strategy
- Hands-on
- Implement access policies
- Tune pipeline performance
- Implement access policies
- Data Modeling for BI
- Star schema
- Aggregations
- Star schema
- Connecting Lakehouse to Power BI
- Direct query vs import
- Incremental refresh
- Direct query vs import
- Dashboard Design Best Practices
- KPI frameworks
- Visualization optimization
- KPI frameworks
- Hands-on
- Connect Iceberg/Spark data to Power BI
- Build executive dashboard
- Connect Iceberg/Spark data to Power BI
- Architecture Design
- Event ingestion via Kafka
- Processing via Spark
- Storage via Iceberg
- Lakehouse layering
- BI visualization
- Event ingestion via Kafka
- Production Considerations
- Scalability
- Fault tolerance
- Cost optimization
- Governance
- Scalability
- Final Presentation
- Architecture walkthrough
- Trade-off discussions
- Production readiness checklist
- Architecture walkthrough
Duration
15 Days
Level
Advanced Level
Design and Tailor this course
As per your team needs