Optimizing Data Lakehouses with Starburst

Leverage your knowledge of the Starburst query engine with focus on best practices to optimize data access

Duration

3 Day ( 8 hrs each day)

Level

Intermediate Level

Design and Tailor this course

Reach out to us

Edit Content

This course comprises instructor-led discussions, demonstrations, and hands-on exercises designed to build a working knowledge of the Starburst query engine. Participants will gain a more thorough awareness of Starburst architecture, focusing on best practices for data lake based schemas, including table formats and partitioning, file formats and sizes, and other optimization techniques.

Upon completion of this course, you will be able to:
● Use Starburst as a single point of access for multiple data sources and federate queries across them
● Evaluate and describe how queries are executed within a Starburst cluster
● Use Hive and Iceberg table formats; construct, populate, query, and modify partitioned tables
● Employ file size/format/hierarchy strategies to improve query performance
● Understand the role of the Cost-based optimizer and read query plans to ensure optimizations are occurring as expected and to identify possible issues
● Create role-based access control policies for table operations
● Build a data engineering pipeline with Starburst Galaxy

Edit Content

This course is designed for data engineers, data architects, and experienced data analysts and data scientists.

Edit Content
  • Overview
  • Architecture
  • Web UI
  • Connectors & catalogs
  • Client tools integrations
  • Foundations and use case
  • Limit Data Exchanges
  • File format options
  • Small files problem
  • Partitioning & bucketing
  • Moving beyond Hive
  • Compare/contrast alternatives
  • Explore Delta Lake
  • Creating tables
  • Insert, update & delete
  • CDC with merge
  • Schema & partition evolution
  • Snapshots & compaction
  • Divide & conquer
  • Beyond single-stage queries
  • Benefits of statistics
  • Query plan analysis
  • Configuration options
  • Role-based access control
  • Definition & differentiation
  • Reference architecture
  • Set up your student account in Starburst Galaxy
  • Execute queries in Starburst Galaxy
  • Exploring federated queries
  • Create schema & tables
  • Investigate tables using Hive’s special columns
  • Baseline CSV table performance
  • Create tables with multiple file formats
  • Improve performance: A) by leveraging ORC file format, B) by consolidating small files, C) with partitioning
  • Exploring the Data Lake Table Format
  • Create and populate Iceberg tables
  • Explore partitions with Iceberg
  • Explore snapshots with Iceberg
  • Utilizing Icberg’s MERGE statement
  • Exercise advanced features of Iceberg
  • The EXPLAIN command
  • The EXPLAIN ANALYZE command
  • Explore the impact of statistics on query plans
  • Creating and validating RBAC policies with Starburst Galaxy
  • Construct a pipeline with insert only transactions
  • Construct a pipeline with the MERGE statementz
Edit Content

Intermediate experience with SQL is assumed.

Connect

we'd love to have your feedback on your experience so far