Introduction to Spark
Duration
3 Day
Level
Basic Level
Design and Tailor this course
As per your team needs
The Introduction to Apache Spark training course is designed to demonstrate the necessary skills to work with Apache Spark, an open-source engine for data in the Hadoop ecosystem optimized for speed and advanced analytics.
The course begins by examining how to use Spark as an alternative to traditional MapReduce processing. Next, it explores how Spark supports streamed data processing and iterative algorithms. The course concludes with a lesson on how Spark enables jobs to run faster than traditional Hadoop MapReduce.
After this course, you will be able to:
○ Describe how Apache Spark,Yarn and Hadoop fit together
○ Understand Spark Internals and architecture.
○ Work with Dataframes & SparkSQL
○ Implement an application using the key Spark concepts.
○ Writing and running spark application on cluster
○ Understand Spark Streaming basics.
This course is designed for application Developers, DevOps Engineers, Architects.
- What is Apache Spark?
- Spark versus MapReduce
- Using the Spark Shell
- Why HDFS?
- HDFS Architecture
- Using HDFS
- What is Yarn?
- How does Spark run with Yarn?
- Understand Transformations & Actions
- Spark Partitions
- Drivers & Executors
- RDD vs Dataframes/Datasets
- Working with different file formats
- Working with Dataframes API
- Introducing Spark SQL
- SparkContext
- Spark Properties
- Building and Running a Spark Application
- Logging
- Running Spark on Cluster
- Spark Web UI walkthrough
- What is stage,task & jobs
- Understanding execution plan
- Caching
- Aggregations
- Joins
- Streaming Overview
- Sliding Window Operations
- Basic Spark Streaming Applications
Participants should preferably have prior Software development experience along with basic knowledge of SQL and Unix commands. Knowledge of Python/Scala would be a plus.