Big Data Ingestion and Processing

Ingest, Store, Process and Analyze data using Real time Big Data Pipelines

Duration

4 Days

Level

Intermediate Level

Design and Tailor this course

As per your team needs

Edit Content

Big Data Ingestion involves connecting to various data sources, extracting the data, and detecting the changed data. or we can say that data Ingestion means taking data coming from multiple sources and putting it somewhere it can be accessed. In Big Data processing system the collected last layer processing data or collected data to be processed and classify the data flow.

The program is focussed on Data Ingestion and Processing using Sqoop, Flume, Kafka and Spark streaming. This program covers
Flume and Sqoop Fundamentals
Architectures of Flume, Sqoop
Kafka Fundamentals, Architecture, API, Kafka Connect, Kafka Streams, Spark Micro-batch processing and Structured Streaming Processing
Hands-on exercises related to Kafka APIs will be in Java/Scala and Scala language will be used for Spark related exercises

Edit Content

The intended audience for this course:

Application Developers
DevOps Engineers
Architects
System Engineers
Technical Managers

Edit Content

Introduction to Ingestion

Data Ingestion Overview
Key Ingestion Frameworks
Key Business Use cases
Typical Big Data Project Pipeline

Sqoop

Sqoop Basics
Sqoop Internals
Sqoop 1 vs Sqoop 2
Key Sqoop Commands
Hands-on Exercise

Ingesting Data with Flume

Flume Overview
Physical Architectures of Flume
Source, Sink and Channel
Building Data Pipeline using Flume
Hands-on Exercise

Introduction to Apache Kafka

Kafka Overview
Salient Features of Kafka
Kafka Use cases
Comparing Kafka with other Key tools

Kafka Fundamentals & Internals

Logical Architecture of Kafka
Physical Architecture of Kafka
- Partitions
- Topics
- Replicas
- Producers & Consumers
- Brokers
Roles and Responsibilities of various components
Replication mechanism
Message Delivery Semantic
Key Terminologies
Key configurations settings of Brokers, Producers, Consumers etc.
Hands-on exercises

Zookeeper

Role of Zookeeper
Zookeeper Basic Operations
Apache Kafka – Zookeeper Role

Kafka Integrations

End to End Data Pipeline using Kafka
Kafka Connect
Integrate Kafka with Spark
Hands-on Exercises

Kafka API

Overview
Producer API
- Sync Producers
- Async Producers
- Message Acknowledgement
- Batching Messages
- Keyed and Non-Keyed Messages
- Compression
- Batching
Consumer API
Hands-on Exercises

Spark Overview

What is Spark?
Why Spark?
Data Abstraction – RDD
Logical Architecture of Spark
Programming Languages in Spark
Functional Programming with Spark
Hands-on Exercise

Spark SQL

Introduction
Dataframe API
Performing ad-hoc query analysis using Spark SQL
Hands-on Exercises

Spark Streaming

Analyzing streaming data using Spark
Stateless Streaming
Hands-on: Stateless Streaming
Stateful Streaming
Hands-on: Stateful Streaming
Structured Streaming
Hands-on exercises
Hands-on: Integrating Kafka with Spark Streaming

Kafka Streams

Overview
What is Kafka Streams
Why Kafka Streams
Kafka Streams Architecture
Hands-on Exercise

Edit Content

Participants should preferably have prior Software development experience along with basic knowledge of SQL and Unix commands. Knowledge of Python/Scala would be a plus.

FIND YOUR COURSE

Topics

Brands

Big Data Ingestion and Processing

Duration

Level

Design and Tailor this course

Quick Links

our Offerings

Get in touch

Sign up for DataCouch Communications

Connect

we'd love to have your feedback on your experience so far