Connect with us:

Join us for a FREE hands-on Meetup webinar on Mastering Snowflake Certifications: Associate & SnowPro Core Readiness | Friday, July 18th, 2025 · 6:00 PM IST/ 08:30 AM ET Join us for a FREE hands-on Meetup webinar on Mastering Snowflake Certifications: Associate & SnowPro Core Readiness | Friday, July 18th, 2025 · 6:00 PM IST/ 08:30 AM ET

Big Data Analytics Project Based Training

Learn how to Build Big Data Analytics end to end data pipeline from scratch.

Duration

5 Days

Level

Intermediate Level

Design and Tailor this course

As per your team needs

Edit Content

Project Background

“DataCouch Power Ltd.” (fictitious energy utility company) has installed Smart meters at customers’ homes/premises across London, UK. These meters emit energy consumption at small interval of time. Such smart meter readings are sent to the utility company, for them to charge based on how much electrical energy a customer consumes.

In order to optimize cost of operation and provide better Customer experience, this utility company wants to determine overuse of sanctioned load, perform proactive maintenance like identifying faulty meters, plan service outages etc. so that there is minimal impact to Customers.

Apart from above use cases, energy disaggregation service is available from third party that can analyze the energy smart meter logs and give deeper insights about the appliances and other settings that consumes the energy. Such disaggregated data can be used to cross sell information like customer lifestyle, customer profile, appliances under use and their usage pattern to third party.

Project Objectives

Analyze batch data and provide insights about energy usage pattern in last couple of years.
Identify Faulty Meters which require attention. This data should be captured in streaming mode.
Find out average energy consumption by each consumer in every 10 seconds time frame.
Segregate meters which are consuming energy over and above their sanctioned load.
Display Dashboard for business to perform Cube Analytics on data.

This project will provide exposure to participants to build Big Data Analytics end to end data pipeline from scratch.

This intensive training course encompasses lectures and hands-on labs that help participants learn theoretical knowledge and gain practical experience of Spark Core, SQL and Streaming. Hands-on exercises will enable participants to work with various common data sources like HDFS, MySQL, HBase, Kafka etc. Also during the course, participants will get exposure to deal with variety of Dataformats including CSV, JSON, XML, Log files, Avro, Parquet, ORC etc. using Spark Framework.

Learn basics of Scala
Understand Spark Core, SQL and Streaming Architecture and APIs
Get Practical exposure to key projects like Spark, Kafka etc.
Develop Distributed code using the Scala programming language
Optimize Spark jobs through partitioning, caching, and other techniques
Build, deploy, and run Spark code on Hadoop clusters
Transform structured data using SparkSQL and DataFrames
Process and Analyze Streaming Data using Spark
Integrate Spark with Kafka, HBase etc.
Work with key dataformats like Avro, Parquet etc.

Edit Content

Edit Content

Introduction to Scala

History of Scala Language
What is Scala?
Design Goals of Scala
Advantages of Functional Programming
Scala vs Java
Scala and Java
Introduction to Eclipse IDE
Scala Shell Overview
Scala with Zeppelin Notebooks

Quick Recap of Hadoop for Spark

Recap of HDFS for Spark
Recap of YARN w.r.t. Spark
Recap of HBase
How to use YARN Commands?
Recap of MapReduce Logical Architecture
Hands-on Exercise

User Defined Functions
Anonymous Functions
Classes and Objects
Packages
Traits
Ways to compile Scala Code
Compiling and Deploying Scala Code
Hands-on: Exercises

Dealing With Key Data Formats in Scala

Processing CSV data using Scala
Dealing with XML files in Scala
JSON processing using Scala
Regular expressions
Processing Semi-structured data
Extending Hive using Use
Hands-on: Exercises

Introduction to Spark

Spark Overview
Detailed discussion on “Why Spark”
Quick Recap of MapReduce
Spark vs MapReduce
Why Scala for Spark?

Spark Core Framework and API

High level Spark Architecture
Role of Executor, Driver, SparkContext etc.
Resilient Distributed Datasets
Basic operations in Spark Core API i.e. Actions and Transformations
Using the Spark REPL for performing interactive data analysis
Hands-on Exercises

Delving Deeper Into Spark API

Pair RDDs
Implementing MapReduce Algorithms using Spark
Ways to create Pair RDDs
JSON Processing
Code Example on JSON Processing
XML Processing
Joins
Playing with Regular Expressions
Log File Processing using Regular Expressions
Hands-on Exercises

Executing a Spark Application

Writing Standalone Spark Application
Building Standalone Scala Spark Application using Maven
Various commands to execute and configure Spark Applications in various modes
Discussion on Application, Job, Stage, Executor, Tasks
Interpreting RDD Metadata/Lineage/DAG
Controlling degree of Parallelism in Spark Job
Physical execution of a Spark application
Discussion on: How Spark is better than MapReduce?
Hands-on Exercises

Advanced Features Of Spark

Persistence
Location
Data Format of Persistence
Replication
Partitioned By
Coalesce
Accumulators
Broadcasting for optimizing performance of Spark jobs
Hands-on Exercises

Spark Streaming

Analyzing streaming data using Spark
Stateless Streaming
Stateful Streaming
Quick introduction to Kafka Architecture
Role of Zookeeper, Brokers etc.
Hands-on Exercises

Introduction
Dataframe API
Performing ad-hoc query analysis using Spark SQL
Working with Hive Partitioning
Hands-on Exercises

Iterative Processing Using Spark

Introduction to Iterative Processing
Checkpointing
Checkpointing vs Persist
Example of Iterative Processing
K Means Clustering
Hands-on Exercises

Introduction to Datasets
Why Datasets?
Datasets vs Dataframes
Using Dataset API
Hands-on Exercises

Structured Streaming

Structured Streaming Overview
How it is better than streaming?
Structured Streaming API
Hands-on Exercises

Edit Content

Stay ahead with DataCouch! Your partner in mastering the latest advancements in AI, Data Science, DevOps, and more.

Quick Links

our Offerings

Get in touch

Sign up for DataCouch Communications

Copyright 2025 © DataCouch