Hadoop and Spark Training

Develop Applications using Unified distributed Application framework

Duration

4 Days

Level

Advanced Level

Design and Tailor this course

As per your team needs

Edit Content

Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis. It also supports a rich set of higher-level tools including Spark SQL for SQL and DataFrames, Spark ML for machine learning, and Spark Streaming for live data stream processing. With Spark running on Apache Hadoop YARN, developers can create applications to derive actionable insights within a single, shared dataset in Hadoop. 

This training course will teach you how to solve Big Data problems using the Apache Spark framework. The training will cover a wide range of Big Data use cases such as ETL, DWH and streaming. It will also demonstrate how Spark integrates with other well established Hadoop ecosystem products. You will learn the course curriculum through theory lectures, live demonstrations and lab exercises. This course will be taught in the Python programming language and we will be using Cloudera 7.1.4 using Spark 3.x.

Edit Content
  • Software Developer
  • Data Engineer
  • Data Scientist 
Edit Content
  • Big Data Overview
  • Key Roles in Big Data Project
  • Key Business Use cases
  • Hadoop Architecture
  • Typical Big Data Project Pipeline
  • Lambda Architecture
  • Kappa Architecture
  • HDFS Overview
  • Why HDFS?
  • HDFS High-level architecture 
  • HDFS Commands 
  • Hadoop 2 vs Hadoop 3
  • Hands-on Exercise
  • Logical Architecture of MapReduce
  • Logical Architecture of YARN
  • High-level Architecture of MRv1 and YARN 
  • Compare MRv1 vs. MRv2 on YARN
  • Hands-on Exercise
  • Apache Sqoop Overview
  • Map Reduce as a Concept – Logical and Physical Architecture
  • Sqoop Capabilities
  • Ways to import data through Sqoop
  • Sqoop Commands
  • Hands-on Exercise
  • What Is Hive?
  • Hive Schema and Data Storage
  • Comparing Hive to Traditional Databases
  • Hive vs. Pig
  • Hive Use Cases
  • Interacting with Hive
  • Hive Databases and Tables
  • Basic HiveQL Syntax
  • Data Types
  • Joining Data Sets
  • Common Built-in Functions
  • Hands-on Exercise
  • Hive Data Formats
  • Creating Databases and Hive-Managed Tables
  • Loading Data into Hive
  • Altering Databases and Tables
  • Self-Managed Tables
  • Storing Query Results
  • Hands-on Exercise
  • Understanding Query Performance
  • Partitioning
  • Bucketing
  • Indexing Data
  • What Is HBase?
  • Why Use HBase?
  • Strengths of HBase
  • HBase in Production
  • Weaknesses of HBase
  • Comparison of HBase with other products
  • HBase Vs. RDBMS
  • HBase Concepts
  • HBase Table Fundamentals
  • Thinking About Table Design
  • Creating Tables with the HBase Shell
  • Working with Tables
  • Working with Table Data
  • Hands-on Exercise
  • Functional Programming
  • Passing Function as arguments
  • Anonymous Functions
  • Why Scala?
  • Basic Syntax
  • User-Defined Functions within Scala 
  • Hands-on Exercise
  • What is Apache Spark- the story of the evolution from Hadoop 
  • Why Spark?
  • Advantages of Spark over Hadoop Map Reduce 
  • Logical Architecture of Spark
  • Programming Languages in Spark 
  • Functional Programming with Spark
  • Lambda architecture for enterprise data and analytics services 
  • Data sources for Spark application
  • Hands-on Exercise
  • Which language to choose when? 
  • Why not Java?
  • Spark on Scala
  • Scala vs Python vs Java vs R
  • RDD – Basics
  • Spark Context and Spark Session
  • RDD – operations – read from the file, transforming and saving persistent 
  • Leveraging in memory processing 
  • Pair RDD – operations 
  • Working with semi structured data formats using regex, json, xml libraries
  • MapReduce Operations
  • Joins
  • Hands-on Exercise
  • Anatomy of Spark jobs on YARN, Standalone and Mesos 
  • RDD partitions 
  • Spark literature: Narrow, wide operations, shuffle, DAG, Shuffle, Stages, and Tasks 
  • Job metrics 
  • Fault Tolerance
  • Hands-on Exercise
  • Overview
  • Spark on YARN
  • YARN client vs YARN cluster modes
  • Deployment modes – YARN, Standalone, Mesos 
  • The Spark Standalone Web UI
  • Hands-on Exercise
  • Spark Applications vs. Spark Shell
  • Developing a Spark Application
  • Modifying Spark Key Configurations
  • Hands-on Exercise
  • Caching
  • Distributed Persistence
  • Storage Levels
  • Broadcast using RDD
  • Accumulator using RDD
  • Hands-on Exercise
  • Spark SQL Overview
  • Spark SQL Documentation
  • Sub-modules of Spark SQL
  • Difference between RDD and Dataframe 
  • Loading and processing data into dataframe
  • Dataframe API
  • Dataframe Operations
  • Saving dataframe to file systems
  • Dataframe internals that makes it fast – Catalyst Optimizer and Tungsten 
  • Hands-on Exercise
  • Processing JSON data
  • Binary Data Processing
  • Why Binary Data Processing?
  • Comparison of various dataformats
  • Working with Parquet
  • Working with Avro
  • Hands-on Exercise
  • Hive Context vs Spark SQL Context 
  • Integrating Spark SQL with Hive
  • Working with Hive Tables 
  • Working with JDBC data source 
  • Data formats – text format such csv, json, xml, binary formats such as parquet, ORC
  • UDF in Spark Dataframe 
  • Spark SQL as JDBC service and its benefits and limitations 
  • Analytical queries in Spark
  • Hands-on Exercise
Edit Content

Following are the prerequisites for the course. 

  • Programming knowledge in Python is required
  • Basic Knowledge of big data use-cases.
  • Basic knowledge of databases, OLAP/OLTP use cases, SQL
  • Knowledge of Java stack – JVM is helpful

Connect

we'd love to have your feedback on your experience so far