Optimizing Data Lakehouses with Starburst

Leverage your knowledge of the Starburst query engine with focus on best practices to optimize data access

Duration

3 Day ( 8 hrs each day)

Level

Intermediate Level

Design and Tailor this course

Reach out to us

Edit Content

This course comprises instructor-led discussions, demonstrations, and hands-on exercises designed to build a working knowledge of the Starburst query engine. Participants will gain a more thorough awareness of Starburst architecture, focusing on best practices for data lake based schemas, including table formats and partitioning, file formats and sizes, and other optimization techniques.

Upon completion of this course, you will be able to:
● Use Starburst as a single point of access for multiple data sources and federate queries across them
● Evaluate and describe how queries are executed within a Starburst cluster
● Use Hive and Iceberg table formats; construct, populate, query, and modify partitioned tables
● Employ file size/format/hierarchy strategies to improve query performance
● Understand the role of the Cost-based optimizer and read query plans to ensure optimizations are occurring as expected and to identify possible issues
● Create role-based access control policies for table operations
● Build a data engineering pipeline with Starburst Galaxy

Edit Content

Starburst features

Overview
Architecture
Web UI
Connectors & catalogs
Client tools integrations

Data lake performance

Foundations and use case
Limit Data Exchanges
File format options
Small files problem
Partitioning & bucketing

Table Formats

Moving beyond Hive
Compare/contrast alternatives
Explore Delta Lake

Apache Iceberg

Creating tables
Insert, update & delete
CDC with merge
Schema & partition evolution
Snapshots & compaction

Parallel Processing

Divide & conquer
Beyond single-stage queries

Cost-based optimizer

Benefits of statistics
Query plan analysis

Access control

Configuration options
Role-based access control

Data Pipelines

Definition & differentiation
Reference architecture

Lab 1 - Starburst Features

Set up your student account in Starburst Galaxy
Execute queries in Starburst Galaxy
Exploring federated queries

Lab 2 - Data lake performance

Create schema & tables
Investigate tables using Hive’s special columns
Baseline CSV table performance
Create tables with multiple file formats
Improve performance: A) by leveraging ORC file format, B) by consolidating small files, C) with partitioning

Lab 3 - Table Formats