Pragmatic Apache Spark

how to use Apache Spark to transform and analyze your big data at lightning-fast speeds

Duration

2 Days

Level

Intermediate Level

Design and Tailor this course

As per your team needs

Edit Content

This course will teach you how to use Apache Spark to transform and analyze your big data at lightning-fast speeds.

The Data Engineering team members are eager to learn about Apache Spark and its best practices. Your staff has some experience with Python, so you prefer to use Python for your Spark programming. 

During this instructor-led, hands-on course, participants will learn to:

  • Holistic Overview of Hadoop and Spark Ecosystem
  • Which technology/tool to choose when?
  • Architecture and Internals of key projects
  • How to perform data processing and ingestion using Spark?
Edit Content

This course is designed for application developers, DevOps engineers, Architects, QAs, Technical Managers.

Edit Content
  • Big Data Overview
  • Key Roles in Big Data Project
  • Key Business Use cases
  • Hadoop and Spark Logical Architecture
  • Typical Big Data Project Pipeline
  • HDFS Overview
  • Physical Architectures of HDFS 
  • The Hadoop Distributed File System Hands-on
  • High level Spark Architecture
  • Role of Executor, Driver, SparkContext etc.
  • Resilient Distributed Datasets
  • Basic operations in Spark Core API i.e. Actions and Transformations
  • Using the Spark REPL for performing interactive data analysis
  • Hands-on Exercises
  • Pair RDDs
  • Implementing MapReduce Algorithms using Spark
  • Ways to create Pair RDDs
  • Hands-on Exercises
  • Writing Standalone Spark Application
  • Various commands to execute and configure Spark Applications in various modes
  • Discussion on Application, Job, Stage, Executor, Tasks
  • Physical execution of a Spark application
  • Discussion on: How Spark is better than MapReduce?
  • Hands-on Exercises
  • Spark SQL Overview
  • Spark SQL Documentation
  • Loading and processing data into dataframe
  • Dataframe API
  • Dataframe Operations
  • Saving dataframe to file systems
  • Dataframe internals that makes it fast – Catalyst Optimizer and Tungsten 
  • Hands-on Exercises
  • Processing JSON data
  • Binary Data Processing
  • Why Binary Data Processing?
  • Comparison of various dataformats
  • Working with Parquet
  • Working with Avro
  • Hands-on Exercises
  • Distributed Persistence
  • Caching
Edit Content

Participants should preferably have prior Software development experience along with basic knowledge of SQL and Unix commands. Knowledge of Python/Scala would be a plus.

Connect

we'd love to have your feedback on your experience so far