Optimizing Data Lakehouses with Starburst
Duration
3 Day ( 8 hrs each day)
Level
Intermediate Level
Design and Tailor this course
Reach out to us
This course comprises instructor-led discussions, demonstrations, and hands-on exercises designed to build a working knowledge of the Starburst query engine. Participants will gain a more thorough awareness of Starburst architecture, focusing on best practices for data lake based schemas, including table formats and partitioning, file formats and sizes, and other optimization techniques.
Upon completion of this course, you will be able to:
● Use Starburst as a single point of access for multiple data sources and federate queries across them
● Evaluate and describe how queries are executed within a Starburst cluster
● Use Hive and Iceberg table formats; construct, populate, query, and modify partitioned tables
● Employ file size/format/hierarchy strategies to improve query performance
● Understand the role of the Cost-based optimizer and read query plans to ensure optimizations are occurring as expected and to identify possible issues
● Create role-based access control policies for table operations
● Build a data engineering pipeline with Starburst Galaxy
This course is designed for data engineers, data architects, and experienced data analysts and data scientists.
- Overview
- Architecture
- Web UI
- Connectors & catalogs
- Client tools integrations
- Foundations and use case
- Limit Data Exchanges
- File format options
- Small files problem
- Partitioning & bucketing
- Moving beyond Hive
- Compare/contrast alternatives
- Explore Delta Lake
- Creating tables
- Insert, update & delete
- CDC with merge
- Schema & partition evolution
- Snapshots & compaction
- Divide & conquer
- Beyond single-stage queries
- Benefits of statistics
- Query plan analysis
- Configuration options
- Role-based access control
- Definition & differentiation
- Reference architecture
- Set up your student account in Starburst Galaxy
- Execute queries in Starburst Galaxy
- Exploring federated queries
- Create schema & tables
- Investigate tables using Hive’s special columns
- Baseline CSV table performance
- Create tables with multiple file formats
- Improve performance: A) by leveraging ORC file format, B) by consolidating small files, C) with partitioning
- Exploring the Data Lake Table Format
- Create and populate Iceberg tables
- Explore partitions with Iceberg
- Explore snapshots with Iceberg
- Utilizing Icberg’s MERGE statement
- Exercise advanced features of Iceberg
- The EXPLAIN command
- The EXPLAIN ANALYZE command
- Explore the impact of statistics on query plans
- Creating and validating RBAC policies with Starburst Galaxy
- Construct a pipeline with insert only transactions
- Construct a pipeline with the MERGE statementz
Intermediate experience with SQL is assumed.