Pragmatic Apache Spark

how to use Apache Spark to transform and analyze your big data at lightning-fast speeds

Duration

2 Days

Level

Intermediate Level

Design and Tailor this course

As per your team needs

Edit Content

This course will teach you how to use Apache Spark to transform and analyze your big data at lightning-fast speeds.

The Data Engineering team members are eager to learn about Apache Spark and its best practices. Your staff has some experience with Python, so you prefer to use Python for your Spark programming.

During this instructor-led, hands-on course, participants will learn to:

Holistic Overview of Hadoop and Spark Ecosystem
Which technology/tool to choose when?
Architecture and Internals of key projects
How to perform data processing and ingestion using Spark?

Edit Content

Introduction to Hadoop and Spark Ecosystem

Big Data Overview
Key Roles in Big Data Project
Key Business Use cases
Hadoop and Spark Logical Architecture
Typical Big Data Project Pipeline

Basic Concepts of HDFS

HDFS Overview
Physical Architectures of HDFS
The Hadoop Distributed File System Hands-on

Introduction to Spark

High level Spark Architecture
Role of Executor, Driver, SparkContext etc.
Resilient Distributed Datasets
Basic operations in Spark Core API i.e. Actions and Transformations
Using the Spark REPL for performing interactive data analysis
Hands-on Exercises

Spark Core Framework and API

Delving Deeper Into Spark API

Pair RDDs
Implementing MapReduce Algorithms using Spark
Ways to create Pair RDDs
Hands-on Exercises

Executing a Spark Application

Writing Standalone Spark Application
Various commands to execute and configure Spark Applications in various modes
Discussion on Application, Job, Stage, Executor, Tasks
Physical execution of a Spark application
Discussion on: How Spark is better than MapReduce?
Hands-on Exercises

Structured Data Processing in Spark

Spark SQL Overview
Spark SQL Documentation
Loading and processing data into dataframe
Dataframe API
Dataframe Operations
Saving dataframe to file systems
Dataframe internals that makes it fast – Catalyst Optimizer and Tungsten
Hands-on Exercises

Semi-Structured Data Processing in Spark

Processing JSON data
Binary Data Processing
Why Binary Data Processing?
Comparison of various dataformats
Working with Parquet
Working with Avro
Hands-on Exercises

Caching and Persistence

Distributed Persistence
Caching

Edit Content

FIND YOUR COURSE

Topics

Brands

Pragmatic Apache Spark

Duration

Level

Design and Tailor this course

Quick Links

our Offerings

Get in touch

Sign up for DataCouch Communications

Connect

we'd love to have your feedback on your experience so far

Ready for a Career in the AI Cloud?

Get Snowflake Certified in Just One Day.