Hadoop and Spark Training for Analysts
Categories:
Big Data
Reviews:

The program is focussed on processing Big data using HDFS, HBase, Impala, Hive, Kudu and Spark SQL. Below points provide a high-level overview of the course:
- Learn how to work with HDFS
- Understand the purpose of HBase and use HBase with other ecosystem projects
- Create tables, insert, read and delete data from HBase
- Get an all-round understanding of Kudu and its role in the Hadoop ecosystem
- Perform Interactive analysis using Hive and Impala
- Understand the role of Spark
- Use Spark SQL for analyzing data at scale
This program is designed for the below roles:
- Big Data Analysts
- Big Data Engineers
- Big Data Scientists
- Big Data Developers
Introduction to Hadoop and Spark Ecosystem
- Big Data Overview
- Key Roles in Big Data Project
- Key Business Use cases
- Hadoop and Spark Logical Architecture
- Typical Big Data Project Pipeline
Basic Concepts of HDFS
- HDFS Overview
- Why HDFS?
- High-level architecture
- HDFS Commands
- Working with HUE
- The Hadoop Distributed File System Hands-on
MapReduce v1/YARN Frameworks and Architectures
- Logical Architecture of MapReduce
- High-level Architecture of MRv1 and YARN
- Compare MRv1 vs. MRv2 on YARN
- Hands-on Exercise
HBase Introduction
- What Is HBase?
- Why Use HBase?
- Strengths of HBase
- HBase in Production
- Weaknesses of HBase
- Comparison of HBase with other products
- HBase Vs. RDBMS
HBase Tables
- HBase Concepts
- HBase Table Fundamentals
- Thinking About Table Design
The HBase Shell
- Creating Tables with the HBase Shell
- Working with Tables
- Working with Table Data
Introduction to Impala and Kudu
- What is Impala?
- How Impala Differs from Hive and Pig
- How Impala Differs from Relational Databases
- Limitations and Future Directions
- Using the Impala Shell
- Why Kudu
- Kudu Architecture
- Kudu Use cases
- Comparing Kudu with other frameworks
- Impala with Kudu
Using Hive and Impala with HBase
- Using Hive and Impala with HBase
Introduction to Hive
- What Is Hive?
- Hive Schema and Data Storage
- Comparing Hive to Traditional Databases
- Hive vs. Pig
- Hive Use Cases
- Interacting with Hive
Relational Data Analysis with Hive
- Hive Databases and Tables
- Basic HiveQL Syntax
- Data Types
- Joining Data Sets
- Common Built-in Functions
- HandsOn Exercise: Running Hive Queries on the Shell, Scripts, and Hue
Hive Data Management
- Hive Data Formats
- Creating Databases and Hive-Managed Tables
- Loading Data into Hive
- Altering Databases and Tables
- Self-Managed Tables
- Simplifying Queries with Views
- Storing Query Results
- Controlling Access to Data
- HandsOn Exercise: Data Management with Hive
Hive Optimization
- Understanding Query Performance
- Controlling Job Execution Plan
- Partitioning
- Bucketing
- Indexing Data
Extending Hive
- SerDes
- Data Transformation with Custom Scripts
- User-Defined Functions
- Parameterized Queries
- HandsOn Exercise: Data Transformation with Hive
Spark Overview
- What is Spark?
- Why Spark?
- Data Abstraction – RDD
- Logical Architecture of Spark
- Programming Languages in Spark
- Functional Programming with Spark
- Hands-on Exercise
Developing Spark Standalone Applications
- Spark Applications vs. Spark Shell
- Developing a Spark Application
- Hands-on Exercise
Spark SQL
- Introduction
- Why Spark SQL?
- Working with Zeppelin Notebook
- Role of Catalyst Optimizer
- Hive and Spark Integration
- Dataframe API
- DataSet API
- Joins
- Performing ad-hoc query analysis using Spark SQL
- Hands-on on Dataframe API
- Hands-on on Using Avro and Parquet with Dataframe API
- Hands-on: Integrating Hive with Spark SQL
- Hands-on: Hive Partitioned Using Spark
- Hands-on: Snappy Compression
- Hands-on: SQL functions
Advanced Features of Spark
- Persistence
- Hands-on: Persistence
- Coalesce
- Accumulators
- Broadcasting
- Hands-on: Broadcasting
- Other optimization techniques
This is a hands-on training therefore it is advisable to have some basic knowledge of Hadoop like Hive queries, HDFS commands etc.