Hadoop and Spark Training

Develop Applications using Unified distributed Application framework

Duration

4 Days

Level

Advance Level

Design and Tailor this course

As per your team needs

Edit

Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis. It also supports a rich set of higher-level tools including Spark SQL for SQL and DataFrames, Spark ML for machine learning, and Spark Streaming for live data stream processing. With Spark running on Apache Hadoop YARN, developers can create applications to derive actionable insights within a single, shared dataset in Hadoop. The participants will also learn how to work with HDFS, YARN, Sqoop, Hive, Data Formats and HBase.

This training course will teach you how to solve Big Data problems using the Apache Spark framework. The training will cover a wide range of Big Data use cases such as ETL, DWH and streaming. It will also demonstrate how Spark integrates with other well established Hadoop ecosystem products. You will learn the course curriculum through theory lectures, live demonstrations and lab exercises. This course will be taught in the Scala programming language and we will be using Cloudera 7.1.4 using Spark 3.x.

Edit

The intended audience for this course:

  • Software Developer
  • Data Engineer
  • Data Scientist
Edit
  • Big Data Overview
  • Key Roles in Big Data Project
  • Key Business Use cases
  • Hadoop Architecture
  • Typical Big Data Project Pipeline
  • Lambda Architecture
  • Kappa Architecture
  • Delta Architecture
  • HDFS Overview
  • Why HDFS?
  • HDFS High-level architecture 
  • HDFS Commands 
  • Challenges with HDFS –
    • Small File Problem
    • Updates not supported.. 
    • and more
  • Hadoop 2 vs Hadoop 3
  • Hands-on Exercise(s)
  • Apache Sqoop Overview
  • Map Reduce as a Concept – Logical and Physical Architecture
  • Sqoop Capabilities
  • Ways to import data through Sqoop
  • Sqoop Commands
  • Hands-on Exercise(s)
  • Apache Sqoop Overview
  • Map Reduce as a Concept – Logical and Physical Architecture
  • Sqoop Capabilities
  • Ways to import data through Sqoop
  • Sqoop Commands
  • Hands-on Exercise(s)
  • What Is Hive?
  • Hive Schema and Data Storage
  • Comparing Hive to Traditional Databases
  • Hive vs. Pig
  • Hive Use Cases
  • Interacting with Hive
  • Hive Databases and Tables
  • Basic HiveQL Syntax
  • Data Types
  • Joining Data Sets
  • Common Built-in Functions
  • Hands-on Exercise(s)
  • Why Binary Data Processing?
  • Comparison of various data formats
  • Which one to choose when?
  • Structure of these formats
  • What Configurations can be changed to get best performance
    • ORC
    • Parquet
    • Avro
    • Comparison of different data formats
  • Utilities to work with these formats
  • Creating Databases and Hive-Managed Tables
  • Loading Data into Hive
  • Altering Databases and Tables
  • Self-Managed Tables
  • Storing Query Results
  • Hands-on Exercise(s)
  • Understanding Query Performance
  • Partitioning
  • Bucketing
  • Indexing Data
  • Functional Programming
  • Passing Function as arguments
  • Anonymous Functions
  • Why Scala?
  • Basic Syntax
  • User-Defined Functions within Scala 
  • Hands-on Exercise(s)
  • What is Apache Spark- the story of the evolution from Hadoop 
  • Why Spark?
  • Advantages of Spark over Hadoop Map Reduce 
  • Logical Architecture of Spark
  • Programming Languages in Spark 
  • Functional Programming with Spark
  • Lambda architecture for enterprise data and analytics services 
  • Data sources for Spark application
  • Hands-on Exercise(s)
  • Which language to choose when? 
  • Why not Java?
  • Spark on Scala
  • Scala vs Python vs Java vs R
  • RDD – Basics
  • Spark Context and Spark Session
  • RDD – operations – read from the file, transforming and saving persistent 
  • Leveraging in memory processing 
  • Pair RDD – operations 
  • Working with semi structured data formats using regex, json, xml libraries
  • MapReduce Operations
  • Joins
  • Hands-on Exercise(s)
  • Anatomy of Spark jobs on YARN, Standalone and Mesos 
  • RDD partitions 
  • Partitioning
  • Repartitioning, coalesce etc.
  • Spark literature: Narrow, wide operations, shuffle, DAG, Shuffle, Stages, and Tasks 
  • Job metrics 
  • Fault Tolerance
  • Hands-on Exercise(s)
  • Overview
  • Spark on YARN
  • YARN client vs YARN cluster modes
  • Deployment modes – YARN, Standalone, Mesos 
  • The Spark Standalone Web UI
  • Hands-on Exercise(s)
  • Spark Applications vs. Spark Shell
  • Developing a Spark Application
  • Modifying Spark Key Configurations
  • Hands-on Exercise(s)
  • Caching
  • Distributed Persistence
  • Storage Levels
  • Broadcast using RDD
  • Accumulator using RDD
  • Hands-on Exercise(s)
  • Spark SQL Overview
  • Spark SQL Documentation
  • Sub-modules of Spark SQL
  • Difference between RDD and Dataframe 
  • Loading and processing data into dataframe
  • Dataset API
  • Dataframe API
  • Dataframe Operations
  • Saving dataframe to file systems
  • Dataframe internals that makes it fast – Catalyst Optimizer and Tungsten 
  • Hands-on Exercise(s)
  • Processing JSON data
  • Binary Data Processing
  • Working with Parquet
  • Working with ORC
  • Working with Avro
  • Hands-on Exercise(s)
  • Hive Context vs Spark SQL Context 
  • Integrating Spark SQL with Hive
  • Working with Hive Tables 
  • Working with JDBC data source 
  • Data formats – text format such csv, json, xml, binary formats such as parquet, ORC
  • UDF in Spark Dataframe 
  • Spark SQL as JDBC service and its benefits and limitations 
  • Analytical queries in Spark
  • Hands-on Exercise(s)
  • What is the Catalog API?
  • Catalog API Demo
  • Table, View
  • Key considerations for Performance Tuning 
  • Hands-on Exercise(s)
  • Capturing job metrics using Spark History Server 
  • Benefits of shared variables – Accumulator, Broadcast
  • Types of spark caching and their use cases
  • Understanding Query Performance
  • Partitioning
  • Bucketing
  • Indexing Data
  • Understanding Catalyst Optimizer
  • Operators in Spark Plan e.g. Exchange, HashAggregate, Filter, Pruning etc.
  • Exploring and Interpreting Spark SQL Plan
  • Partitions
  • Partition Pruning
  • Coalesce
  • Broadcast Join
  • Hands-on Exercise(s)
  • Features Overview
  • Dynamic Partition Pruning
  • Adaptive Query Executions
  • SQL Join Hints etc.
  • Spark Streaming Overview
  • Streaming concepts and Terminologies
  • Architecture of streaming application 
  • Streaming Context – initialization, configuration, characteristics 
  • First Streaming Application
  • Stateless Operations
  • Stateful Operations
  • Output Operations
  • Integrating with Ingestion Frameworks
  • Dstream – operations
  • Window operation – batch internal, window length, sliding interval 
  • Fault Tolerance using checkpointing and replication 
  • Partition behavior of Dstream 
  • Hands-on Exercise(s)
  • Structured Streaming Overview
  • Programming model for Structured streaming
  • Transformations and Actions
  • Output Modes
  • Structure Streaming sources and sinks
  • Using Event Time in Streaming
  • Using Processing Time
  • Handling late data
  • Exploring Watermarks
  • Time-Based Window Aggregations
  • Record De-duplications
  • Triggers
  • Fault Tolerance
  • Continuous process
  • Hands-on Exercise(s)
  • What is Delta Lake?
  • Why use Delta Lakes?
  • Key Features
  • Parquet Overview
  • Parquet v/s Delta Lakes
  • Delta Lake Architecture
  • How does Delta Lake work?
  • Configuration Params of Delta Lake
  • Delta Lake Hands-on using Spark SQL and Streaming – Loading and Storing Data
  • Operational challenges of large scale processing
Edit

Following are the prerequisites for the course. 

  • Basic knowledge of Databases and SQL
  • Basic knowledge of Linux commands
  • Basic knowledge of Programming
  • Knowledge of Scala is helpful

Connect

we'd love to have your feedback on your experience so far