Big Data Ingestion and Processing
Ingest, Store, Process and Analyze data using Real time Big Data Pipelines
Duration
4 Days
Level
Intermediate Level
Design and Tailor this course
As per your team needs
Edit Content
Big Data Ingestion involves connecting to various data sources, extracting the data, and detecting the changed data. or we can say that data Ingestion means taking data coming from multiple sources and putting it somewhere it can be accessed. In Big Data processing system the collected last layer processing data or collected data to be processed and classify the data flow.
- The program is focussed on Data Ingestion and Processing using Sqoop, Flume, Kafka and Spark streaming. This program covers
- Flume and Sqoop Fundamentals
- Architectures of Flume, Sqoop
- Kafka Fundamentals, Architecture, API, Kafka Connect, Kafka Streams, Spark Micro-batch processing and Structured Streaming Processing
- Hands-on exercises related to Kafka APIs will be in Java/Scala and Scala language will be used for Spark related exercises
Edit Content
The intended audience for this course:
- Application Developers
- DevOps Engineers
- Architects
- System Engineers
- Technical Managers
Edit Content
- Data Ingestion Overview
- Key Ingestion Frameworks
- Key Business Use cases
- Typical Big Data Project Pipeline
- Sqoop Basics
- Sqoop Internals
- Sqoop 1 vs Sqoop 2
- Key Sqoop Commands
- Hands-on Exercise
- Flume Overview
- Physical Architectures of Flume
- Source, Sink and Channel
- Building Data Pipeline using Flume
- Hands-on Exercise
- Kafka Overview
- Salient Features of Kafka
- Kafka Use cases
- Comparing Kafka with other Key tools
- Logical Architecture of Kafka
- Physical Architecture of Kafka
- Partitions
- Topics
- Replicas
- Producers & Consumers
- Brokers
- Roles and Responsibilities of various components
- Replication mechanism
- Message Delivery Semantic
- Key Terminologies
- Key configurations settings of Brokers, Producers, Consumers etc.
- Hands-on exercises
- Role of Zookeeper
- Zookeeper Basic Operations
- Apache Kafka – Zookeeper Role
- End to End Data Pipeline using Kafka
- Kafka Connect
- Integrate Kafka with Spark
- Hands-on Exercises
- Overview
- Producer API
- Sync Producers
- Async Producers
- Message Acknowledgement
- Batching Messages
- Keyed and Non-Keyed Messages
- Compression
- Batching
- Consumer API
- Hands-on Exercises
- What is Spark?
- Why Spark?
- Data Abstraction – RDD
- Logical Architecture of Spark
- Programming Languages in Spark
- Functional Programming with Spark
- Hands-on Exercise
- Introduction
- Dataframe API
- Performing ad-hoc query analysis using Spark SQL
- Hands-on Exercises
- Analyzing streaming data using Spark
- Stateless Streaming
- Hands-on: Stateless Streaming
- Stateful Streaming
- Hands-on: Stateful Streaming
- Structured Streaming
- Hands-on exercises
- Hands-on: Integrating Kafka with Spark Streaming
- Overview
- What is Kafka Streams
- Why Kafka Streams
- Kafka Streams Architecture
- Hands-on Exercise
Edit Content
Participants should preferably have prior Software development experience along with basic knowledge of SQL and Unix commands. Knowledge of Python/Scala would be a plus.