Managing Big Data using Hadoop and Spark

Learn ingestion, storage, processing and analysis of Big data using Hadoop and Spark Ecosystem

Duration

4 Days

Level

Intermediate Level

Design and Tailor this course

As per your team needs

Edit Content

The program is focussed on ingestion, storage, processing and analysis of Big data using Hadoop and Spark Ecosystem i.e. HDFS, MapReduce, YARN, Sqoop, Flume, Hive, Spark Core, Pig, Impala, HBase and Kafka.

  • Holistic Overview of Hadoop and Spark Ecosystem
  • Distributed Storage and Processing Concepts
  • Which technology/tool to choose when?
  • Architecture and Internals of key projects
  • How to perform data processing/analysis using Spark, Pig and Hive?
Edit Content

The intended audience for this course:

  • Application Developers
  • DevOps Engineers
  • Architects
  • System Engineers
  • Technical Managers
Edit Content
  • Big Data Overview
  • Key Roles in Big Data Project
  • Key Business Use cases
  • Hadoop and Spark Logical Architecture
  • Typical Big Data Project Pipeline
  • HDFS Overview
  • Physical Architectures of HDFS
  • The Hadoop Distributed File System Hands-on
  • Logical Architecture of MapReduce
  • Logical Architecture of YARN
  • High-level Architecture of MRv1 and YARN 
  • Compare MRv1 vs. MRv2 on YARN
  • Hands-on Exercise
  • Sqoop Basics
  • Sqoop Internals
  • Sqoop 1 vs Sqoop 2
  • Key Sqoop Commands
  • Hands-on Exercise
  • Flume Overview
  • Physical Architectures of Flume
  • Source, Sink and Channel
  • Building Data Pipeline using Flume
  • Hands-on Exercise
  • Hive Overview and Use cases
  • How Hive Differ from Relational Databases
  • Basic Syntax in Hive
  • External and Managed Tables
  • Key Built-In functions in Hive
  • Hive vs. HiveServer2
  • Hands-on Exercise
  • Partitioning – Static and Dynamic
  • Hive UDFs
  • Hands-on Exercises
  • HBase Overview
  • Physical Architectures of HBase
  • HBase Table Fundamentals
  • Thinking About Table Design
  • HBase Shell
  • HBase Physical Architecture
  • HBase Schema Design
  • HBase API
  • Hive on HBase
  • Hands-on Exercises
  • What Is Pig?
  • Pig’s Features
  • Pig Use Cases
  • Interacting with Pig
  • Pig Latin Syntax
  • Loading Data
  • Simple Data Types
  • Field Definitions
  • Data Output
  • Viewing the Schema
  • Filtering and Sorting Data
  • Commonly-Used Functions
  • HandsOn Exercise: Using Pig for ETL Processing
  • What is Impala?
  • How Impala Differs from Hive and Pig
  • How Impala Differs from Relational Databases
  • Limitations and Future Directions
  • Using the Impala Shell
  • Basic Syntax
  • Data Types
  • Filtering, Sorting, and Limiting Results
  • Joining and Grouping Data
  • Improving Impala Performance
  • HandsOn Exercise: Interactive Analysis with Impala
  • What is Spark?
  • Why Spark?
  • Data Abstraction – RDD
  • Logical Architecture of Spark
  • Programming Languages in Spark
  • Functional Programming with Spark
  • Hands-on Exercise
  • Key RDD API Operations
  • Pair RDDs
  • MapReduce Operations
  • Join of Sqoop and Flume data
  • Spark on YARN
  • YARN client vs YARN cluster modes
  • Hands-on Exercise
  • Kafka Overview
  • Kafka Architecture
  • Kafka Producer Consumer API
  • Flume vs Kafka
  • Hands-on Exercise
Edit Content

Participants should preferably have prior Software development experience along with basic knowledge of SQL and Unix commands. Knowledge of Python/Scala would be a plus.

Connect

we'd love to have your feedback on your experience so far