Lorem ipsum dolor sit amet, conse ctetur adip elit, pellentesque turpis.

  • No products in the cart.

Image Alt

Advanced Course on Hadoop, Spark and Kafka for Non-Developers

  /    /  Advanced Course on Hadoop, Spark and Kafka for Non-Developers

Advanced Course on Hadoop, Spark and Kafka for Non-Developers

Categories:
Big Data
Reviews:

This is an advanced training course on some of key Big Data projects i.e. YARN, Hive, HBase, Spark Core, Spark SQL, Spark Streaming, Kafka Core, Kafka Connect, Kafka Streams, Ni-Fi, Druid and Apache Atlas. During the course, participants will learn Scala programming language for implementing Spark programs. The course will use the HDP 3.x version along with Zeppelin notebooks for performing interactive data exploration through Spark.

This intensive training course encompasses lectures and hands-on labs that help participants learn theoretical knowledge and gain practical experience of above open source projects.

Topics

  • Overview
  • YARN Overview, Architecture and Concepts
  • Intermediate to Advanced Hive
  • In-depth HBase
  • Druid
  • Scala for Spark
  • Spark Core
  • Spark SQL
  • Spark Streaming
  • Kafka Core
  • Kafka Connect
  • Kafka Streams
  • NiFi
  • Atlas

The intended audience for this course:

  • Architects
  • Analysts
  • Team Leads
Overview
  • Course Overview
  • Lab Environment Walkthrough
  • Hadoop Ecosystem Overview
  • Key Big Data Architectures
  • What’s new in Hadoop 3?
Distributed OS - Yarn
  • Role of Distributed OS
  • Responsibilities of YARN
  • YARN Architecture
  • YARN Concepts
  • YARN Timeline Service
  • Hands-on: Experience running an application on YARN
Hive
  • Hands-on: Hive Exercise to bring everyone on same page
  • Hive LLAP Overview
  • Why Hive LLAP?
  • Hive LLAP Architecture
  • Data Formats
  • HDFS file layout using ORC and Snappy
  • Hive with various Data Formats
  • Working with compressed data
  • Converting data from one format to another
  • Partitioning
  • Joining
  • Bucketing
  • Indexing
  • De-Duplication
  • Processing Semi-structured data
  • Extending Hive using User Defined Functions
  • Hands-on: Advanced exercises
  • Summary
In-depth HBase
  • Why HBase?
  • Limitations of HBase
  • Physical Architecture
  • Schema Design
  • Performance Optimization Techniques
  • Hands-on Exercises
Druid
  • Druid Introduction
  • Why Druid?
  • Physical Architecture
  • Working with Druid
  • Performance Optimization Techniques
  • Hands-on Exercises
Apache Spark Overview
  • Apache Spark – The Unified Platform
  • The Spark Platform
  • Common Spark Use Cases
  • Languages Supported by Spark
  • Running Spark on a Cluster
  • The Driver Process
  • Spark Applications
  • Spark Shell
  • The spark-submit Tool
  • The spark-submit Tool Configuration
  • The Executor and Worker Processes
  • The Spark Application Architecture
  • Interfaces with Data Storage Systems
  • Limitations of Hadoop’s MapReduce
  • Spark vs MapReduce
  • Spark as an Alternative to Apache Tez
  • The Resilient Distributed Dataset (RDD)
  • Spark Streaming (Micro-batching)
  • Continuous Applicatiions
  • Spark SQL
  • Example of Spark SQL
  • Spark Machine Learning Library
Interactive Data Exploration
  • The Spark Shell
  • The Spark Shell UI
  • Spark Shell Options
  • Getting Help
  • The Spark Context (sc) and SQL Context (SQLContext)
  • The Shell Spark Context
  • Loading Files
  • Saving Files
  • Basic Spark ETL Operations
  • Summary
Spark RDD
  • The Resilient Distributed Dataset (RDD)
  • Ways to Create an RDD
  • Custom RDDs
  • Supported Data Types
  • RDD Operations
  • RDDs are Immutable
  • Spark Actions
  • RDD Transformations
  • Other RDD Operations
  • Chaining RDD Operations
  • RDD Lineage
  • Checkpointing RDDs
  • Local Checkpointing
  • Parallelized Collections
  • More on parallelize() Method
  • The Pair RDD
  • Where do I use Pair RDDs?
  • Miscellaneous Pair RDD Operations
  • RDD Caching
  • RDD Persistence
  • The Tachyon Storage
  • Summary
Shared Variables in Spark
  • Shared Variables in Spark
  • Broadcast Variables
  • Creating and Using Broadcast Variables
  • Example of Using Broadcast Variables
  • Accumulators
  • Creating and Using Accumulators
  • Example of Using Accumulators
  • Custom Accumulators
  • Summary
Parallel Data Processing with Spark
  • Running Spark on a Cluster
  • Spark Stand-alone Option
  • The High-Level Execution Flow in Stand-alone Spark Cluster
  • Data Partitioning
  • Data Partitioning Diagram
  • Single Local File System RDD Partitioning
  • Multiple File RDD Partitioning
  • Special Cases for Small-sized Files
  • Parallel Data Processing of Partitions
  • Spark Application, Jobs, and Tasks
  • Stages and Shuffles
  • The “Big Picture”
  • Summary
Working with Spark SQL
  • What is Spark SQL?
  • Uniform Data Access with Spark SQL
  • Hive Integration
  • Hive Interface
  • Integration with BI Tools
  • Spark SQL is No Longer Experimental Developer API!
  • What is a DataFrame?
  • The SQLContext Object
  • The SQLContext API
  • Changes Between Spark SQL 1.3 to 1.4
  • Example of Spark SQL (Scala Example)
  • Example of Working with a JSON File
  • Example of Working with a Parquet File
  • Using JDBC Sources
  • JDBC Connection Example
  • Performance & Scalability of Spark SQL
  • Summary
Enterprise Kafka
  • Kafka Overview
  • Kafka vs Other messaging sytems
  • Kafka Terms
  • Kafka Usecases
  • Introduction to Confluent Platform?
  • Why Confluent Kafka?
  • Kafka+
  • Key Products
  • Configuration Files
  • Using Kafka Command Line Client Tools
  • Summary
Integrating Kafka with Other Systems
  • Introduction to Kafka Integration
  • Kafka Connect
  • Kafka Connect (Contd.)
  • Running Kafka Connect
  • Key Configurations for Connect workers:
  • Kafka Connect API
  • Kafka Connect Example – File Source
  • Kafka Connect Example – File Sink
  • Kafka Connector Example – MySQL to Elasticsearch
  • Kafka Connector Example – MySQL to Elasticsearch (Contd.)
  • Write the data to Elasticsearch
  • Building Custom Connectors
  • Kafka Connect – Connectors
  • Kafka Connect – Tasks
  • Kafka Connect – Workers
  • Kafka Connect – Workers (Contd.)
  • Kafka Connect – Converters and Connect’s data model
  • Kafka Connect – Offset management
  • Alternatives to Kafka Connect
  • Alternatives to Kafka Connect (Contd.)
  • Introduction to Hadoop
  • Hadoop Components
  • Integrating Hadoop with Kafka
  • Hadoop Consumers
  • Hadoop Consumers – Produce Topic
  • Hadoop Consumers – Fetch Generated Topic
  • Summary
Schema Management in Kafka
  • Introduction to Schema Registry
  • Quick introduction to Data Formats
  • An Introduction to Avro
  • Avro Schemas
  • Using the Schema Registry
  • Summary
Kafka Stream Processing
  • Why Kafka Streams?
  • Kafka Streams Fundamentals
  • Investigating a Kafka Streams Application
  • KSQL for Apache Kafka
  • Writing KSQL Queries
  • Summary
Spark Streaming
  • What is Spark Streaming?
  • Spark Streaming as Micro-batching
  • Use Cases
  • Some “Competition”
  • Spark Streaming Features
  • How It Works
  • Basic Data Stream Sources
  • Advanced Data Stream Sources
  • The DStream Object
  • DStream – RDD Diagram
  • The Operational DStream API
  • DStream Output Operations
  • StreamingContext Object
  • TCP Text Streams Example (in Scala)
  • Accessing the Underlying RDDs
  • The Sliding Window Concept
  • The Sliding Window Diagram
  • The Window Operations
  • A Windowed Computation Example (Scala)
  • Analyzing streaming data using Spark
  • Stateless Streaming
  • Stateful Streaming
  • Structured Streaming
  • Summary
Ni-Fi
  • Introduction to Ni-Fi
  • Capabilities of Ni-Fi
  • Architecture of Ni-Fi
  • Components in Ni-Fi
  • Working with Ni-Fi
  • Summary
Data Governance using Atlas
  • Overview of Atlas
  • Capabilities of Atlas
  • Metadata Management using Atlas
  • Data Governance using Atlas
  • End to end Data Lineage
  • Summary
Lab Exercises
  • Lab 1. Learning the Lab Environment
  • Lab 2. Running application on YARN
  • Lab 3. Hive Hands-on Exercise
  • Lab 4. Table Partitioning in Hive
  • Lab 5. HBase Row Key Design Exercise
  • Lab 6. Working with Druid
  • Lab 7. The Spark Shell
  • Lab 8. Spark ETL and HDFS Interface
  • Lab 9. Using Broadcast Variables
  • Lab 10. Using Accumulators
  • Lab 11. Common Map / Reduce Programs in Spark
  • Lab 12. Spark SQL
  • Lab 13. Kafka (multiple exercises)
  • Lab 14. Spark Streaming: Part 1
  • Lab 15. Spark Streaming: Part 2
  • Lab 16. Integrating Kafka and Spark Streaming
  • Lab 17. Building Dataflow Pipelines using Ni-Fi
  • Lab 18. Working with Atlas

Participants should have the general knowledge of programming and SQL as well as experience working in Unix environments (e.g. running shell commands, etc.). Participants should be familiar with HDFS and Hive basics.

Course Information

Duration

7 Days

Mode of Delivery

Instructor led/Virtual

Reach out to us..Our representative will get back to you!



Post a Comment