Hadoop and Spark Training

Develop Applications using Unified distributed Application framework

Duration

4 Days

Level

Advanced Level

Design and Tailor this course

As per your team needs

Edit Content

Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis. It also supports a rich set of higher-level tools including Spark SQL for SQL and DataFrames, Spark ML for machine learning, and Spark Streaming for live data stream processing. With Spark running on Apache Hadoop YARN, developers can create applications to derive actionable insights within a single, shared dataset in Hadoop.

This training course will teach you how to solve Big Data problems using the Apache Spark framework. The training will cover a wide range of Big Data use cases such as ETL, DWH and streaming. It will also demonstrate how Spark integrates with other well established Hadoop ecosystem products. You will learn the course curriculum through theory lectures, live demonstrations and lab exercises. This course will be taught in the Python programming language and we will be using Cloudera 7.1.4 using Spark 3.x.

Edit Content

Introduction to Hadoop Ecosystem

Big Data Overview
Key Roles in Big Data Project
Key Business Use cases
Hadoop Architecture
Typical Big Data Project Pipeline
Lambda Architecture
Kappa Architecture

Basic Concepts of HDFS

HDFS Overview
Why HDFS?
HDFS High-level architecture
HDFS Commands
Hadoop 2 vs Hadoop 3
Hands-on Exercise

MapReduce v1/YARN Frameworks and Architectures

Logical Architecture of MapReduce
Logical Architecture of YARN
High-level Architecture of MRv1 and YARN
Compare MRv1 vs. MRv2 on YARN
Hands-on Exercise

Distributed Ingestion Using Sqoop

Apache Sqoop Overview
Map Reduce as a Concept – Logical and Physical Architecture
Sqoop Capabilities
Ways to import data through Sqoop
Sqoop Commands
Hands-on Exercise

Introduction to Hive

What Is Hive?
Hive Schema and Data Storage
Comparing Hive to Traditional Databases
Hive vs. Pig
Hive Use Cases
Interacting with Hive

Relational Data Analysis with Hive

Hive Databases and Tables
Basic HiveQL Syntax
Data Types
Joining Data Sets
Common Built-in Functions
Hands-on Exercise

Hive Data Management

Hive Data Formats
Creating Databases and Hive-Managed Tables
Loading Data into Hive
Altering Databases and Tables
Self-Managed Tables
Storing Query Results
Hands-on Exercise

Hive Optimization

Understanding Query Performance
Partitioning
Bucketing
Indexing Data

HBase Introduction

What Is HBase?
Why Use HBase?
Strengths of HBase
HBase in Production
Weaknesses of HBase
Comparison of HBase with other products
HBase Vs. RDBMS

HBase Tables

HBase Concepts
HBase Table Fundamentals
Thinking About Table Design

The HBase Shell

Creating Tables with the HBase Shell
Working with Tables
Working with Table Data
Hands-on Exercise

Functional Programming in Scala

Functional Programming
Passing Function as arguments
Anonymous Functions
Why Scala?
Basic Syntax
User-Defined Functions within Scala
Hands-on Exercise

Spark Basics

What is Apache Spark- the story of the evolution from Hadoop
Why Spark?
Advantages of Spark over Hadoop Map Reduce
Logical Architecture of Spark
Programming Languages in Spark
Functional Programming with Spark
Lambda architecture for enterprise data and analytics services
Data sources for Spark application
Hands-on Exercise

Choosing Language in Spark

Which language to choose when?
Why not Java?
Spark on Scala
Scala vs Python vs Java vs R

Resilient Distributed Dataset(RDD)

RDD – Basics
Spark Context and Spark Session
RDD – operations – read from the file, transforming and saving persistent
Leveraging in memory processing
Pair RDD – operations
Working with semi structured data formats using regex, json, xml libraries
MapReduce Operations
Joins
Hands-on Exercise

Spark Internals

Anatomy of Spark jobs on YARN, Standalone and Mesos
RDD partitions
Spark literature: Narrow, wide operations, shuffle, DAG, Shuffle, Stages, and Tasks
Job metrics
Fault Tolerance
Hands-on Exercise

Spark Physical Execution

Overview
Spark on YARN
YARN client vs YARN cluster modes
Deployment modes – YARN, Standalone, Mesos
The Spark Standalone Web UI
Hands-on Exercise

Developing Spark Standalone Applications

Spark Applications vs. Spark Shell
Developing a Spark Application
Modifying Spark Key Configurations
Hands-on Exercise

Advanced RDD

Caching
Distributed Persistence
Storage Levels
Broadcast using RDD
Accumulator using RDD
Hands-on Exercise

Structured Data Processing in Spark

Spark SQL Overview
Spark SQL Documentation
Sub-modules of Spark SQL
Difference between RDD and Dataframe
Loading and processing data into dataframe
Dataframe API
Dataframe Operations
Saving dataframe to file systems
Dataframe internals that makes it fast – Catalyst Optimizer and Tungsten
Hands-on Exercise

Semi-Structured Data Processing in Spark

Processing JSON data
Binary Data Processing
Why Binary Data Processing?
Comparison of various dataformats
Working with Parquet
Working with Avro
Hands-on Exercise

Advanced Dataframe

Hive Context vs Spark SQL Context
Integrating Spark SQL with Hive
Working with Hive Tables
Working with JDBC data source
Data formats – text format such csv, json, xml, binary formats such as parquet, ORC
UDF in Spark Dataframe
Spark SQL as JDBC service and its benefits and limitations
Analytical queries in Spark
Hands-on Exercise

Edit Content

FIND YOUR COURSE

Topics

Brands

Hadoop and Spark Training

Duration

Level

Design and Tailor this course

Quick Links

our Offerings

Get in touch

Sign up for DataCouch Communications

Connect

we'd love to have your feedback on your experience so far