Design Systems for Scale and Resilience

Advanced Distributed Systems Design for Modern Cloud Infrastructure

Duration

2 Day

Level

Design and Tailor this course

As per your team needs

Overview

This 2-day course focuses on practical system design and distributed systems concepts for engineers building backend or platform systems. The training emphasizes designing scalable, resilient, and maintainable systems using modern architectural patterns.

Learning Objectives

  • Architect for High Availability: Master blast-radius isolation and self-healing logic.
  • Scale Data Systems: Implement advanced sharding, tiering, and consistency models.
  • Infrastructure Observability: Debug “Grey Failures” and P99 latency spikes.

Hybrid-Cloud Resilience: Design disaster recovery (DR) orchestrators for multi-region environments.

Audience

  • Backend and Platform Engineers (2–5 years experience)
  • Software developers involved in infrastructure or system design

Prerequisites

Curriculum

  • Verification of the pre-installed distributed environment.
  • Walkthrough of the provided codebase and architectural components.
  • Control plane vs. Data plane separation logic
  • Architectural patterns: Controller nodes and Data Locality
  • RPC overhead and network serialization costs
  • Performance trade-offs of communication protocols at scale
  • Lab: Communication Performance Analysis. Use a provided multi-node environment to measure the latency impact of different serialization formats and identify throughput bottlenecks under heavy concurrent load.
  • Defining and isolating Blast Radius (Node, Rack, Zone)
  • Rack-aware and Zone-aware data placement strategies
  • Consensus fundamentals: Quorum and leader election algorithms
  • Handling “Split-Brain” scenarios and network partitions
  • Lab: Resilience & Election Observation. Observe a live distributed cluster during a simulated network partition. Identify how the system detects failures and triggers automated recovery to maintain data integrity.
  • Consistent Hashing and Virtual Nodes (vnodes)
  • Distributed Hash Tables (DHT) in practice
  • Strategies for re-sharding without system downtime
  • Metadata caching and invalidation patterns
  • Lab: Data Distribution & Sharding Investigation. Analyze how data is distributed across a sharded metadata store. Simulate adding/removing capacity and observe the resulting re-balancing behavior and lookup efficiency.
  • High-cardinality metrics and storage costs
  • Distributed tracing in the storage path
  • Detecting “Grey Failures” (Partial failures that bypass health checks)
  • SLO/SLA tracking for infrastructure components
  • Lab: P99 Latency Bottleneck Identification. Use observability dashboards to investigate a “slow” request path in a distributed stack. Pinpoint the specific component causing tail-latency spikes.
  • Write-Ahead Logging (WAL) and Log-Structured Merge (LSM) trees
  • Data path optimization: Understanding I/O bottlenecks
  • Intelligent Tiering: Performance vs. Capacity storage layers
  • Policy-driven data movement strategies
  • Lab: Storage Tiering Policy Design. Configure and validate data migration policies between different storage performance tiers. Observe how the system handles data placement based on access frequency.
  • Synchronous vs. Asynchronous replication trade-offs
  • RPO (Recovery Point Objective) and RTO (Recovery Time Objective)
  • Disaster Recovery (DR) state-machine design
  • Cross-region data synchronization challenges
  • Lab: Disaster Recovery Failover Scenarios. Execute a cross-region failover sequence. Inject simulated WAN conditions to validate how the orchestrator handles increased latency and maintains service availability.
  • Real-world system design explanation: Deconstructing global platforms
  • Case Study: Global traffic routing and stateless microservices at scale
  • Chaos Engineering in practice: Real-world failure injection lessons
  • Graceful degradation and resilient fallback patterns
  • Lab: Architectural Masterclass (Capstone). Group-based design session to architect a highly resilient management plane for a large-scale infrastructure scenario.
  • Zero-downtime rolling upgrades in stateful clusters
  • Version compatibility and schema evolution during upgrades
  • Incident Response: Automated vs. Manual intervention
  • Continuous improvement: Closing the feedback loop with post-mortems

Let’s Build Your Growth Ecosystem.

Get in touch