Big Data Analytics Project Based Training
Duration
5 Days
Level
Intermediate Level
Design and Tailor this course
As per your team needs
Project Background
“DataCouch Power Ltd.” (fictitious energy utility company) has installed Smart meters at customers’ homes/premises across London, UK. These meters emit energy consumption at small interval of time. Such smart meter readings are sent to the utility company, for them to charge based on how much electrical energy a customer consumes.
In order to optimize cost of operation and provide better Customer experience, this utility company wants to determine overuse of sanctioned load, perform proactive maintenance like identifying faulty meters, plan service outages etc. so that there is minimal impact to Customers.
Apart from above use cases, energy disaggregation service is available from third party that can analyze the energy smart meter logs and give deeper insights about the appliances and other settings that consumes the energy. Such disaggregated data can be used to cross sell information like customer lifestyle, customer profile, appliances under use and their usage pattern to third party.
Project Objectives
- Analyze batch data and provide insights about energy usage pattern in last couple of years.
- Identify Faulty Meters which require attention. This data should be captured in streaming mode.
- Find out average energy consumption by each consumer in every 10 seconds time frame.
- Segregate meters which are consuming energy over and above their sanctioned load.
- Display Dashboard for business to perform Cube Analytics on data.
This project will provide exposure to participants to build Big Data Analytics end to end data pipeline from scratch.
This intensive training course encompasses lectures and hands-on labs that help participants learn theoretical knowledge and gain practical experience of Spark Core, SQL and Streaming. Hands-on exercises will enable participants to work with various common data sources like HDFS, MySQL, HBase, Kafka etc. Also during the course, participants will get exposure to deal with variety of Dataformats including CSV, JSON, XML, Log files, Avro, Parquet, ORC etc. using Spark Framework.
- Learn basics of Scala
- Understand Spark Core, SQL and Streaming Architecture and APIs
- Get Practical exposure to key projects like Spark, Kafka etc.
- Develop Distributed code using the Scala programming language
- Optimize Spark jobs through partitioning, caching, and other techniques
- Build, deploy, and run Spark code on Hadoop clusters
- Transform structured data using SparkSQL and DataFrames
- Process and Analyze Streaming Data using Spark
- Integrate Spark with Kafka, HBase etc.
- Work with key dataformats like Avro, Parquet etc.
This program is designed for :
- Developers
- Analysts
- Architects
- Team Leads
- Data Scientists
- History of Scala Language
- What is Scala?
- Design Goals of Scala
- Advantages of Functional Programming
- Scala vs Java
- Scala and Java
- Introduction to Eclipse IDE
- Scala Shell Overview
- Scala with Zeppelin Notebooks
- Recap of HDFS for Spark
- Recap of YARN w.r.t. Spark
- Recap of HBase
- How to use YARN Commands?
- Recap of MapReduce Logical Architecture
- Hands-on Exercise
- User Defined Functions
- Anonymous Functions
- Classes and Objects
- Packages
- Traits
- Ways to compile Scala Code
- Compiling and Deploying Scala Code
- Hands-on: Exercises
- Processing CSV data using Scala
- Dealing with XML files in Scala
- JSON processing using Scala
- Regular expressions
- Processing Semi-structured data
- Extending Hive using Use
- Hands-on: Exercises
- Spark Overview
- Detailed discussion on “Why Spark”
- Quick Recap of MapReduce
- Spark vs MapReduce
- Why Scala for Spark?
- High level Spark Architecture
- Role of Executor, Driver, SparkContext etc.
- Resilient Distributed Datasets
- Basic operations in Spark Core API i.e. Actions and Transformations
- Using the Spark REPL for performing interactive data analysis
- Hands-on Exercises
- Pair RDDs
- Implementing MapReduce Algorithms using Spark
- Ways to create Pair RDDs
- JSON Processing
- Code Example on JSON Processing
- XML Processing
- Joins
- Playing with Regular Expressions
- Log File Processing using Regular Expressions
- Hands-on Exercises
- Writing Standalone Spark Application
- Building Standalone Scala Spark Application using Maven
- Various commands to execute and configure Spark Applications in various modes
- Discussion on Application, Job, Stage, Executor, Tasks
- Interpreting RDD Metadata/Lineage/DAG
- Controlling degree of Parallelism in Spark Job
- Physical execution of a Spark application
- Discussion on: How Spark is better than MapReduce?
- Hands-on Exercises
- Persistence
- Location
- Data Format of Persistence
- Replication
- Partitioned By
- Coalesce
- Accumulators
- Broadcasting for optimizing performance of Spark jobs
- Hands-on Exercises
- Analyzing streaming data using Spark
- Stateless Streaming
- Stateful Streaming
- Quick introduction to Kafka Architecture
- Role of Zookeeper, Brokers etc.
- Hands-on Exercises
- Introduction
- Dataframe API
- Performing ad-hoc query analysis using Spark SQL
- Working with Hive Partitioning
- Hands-on Exercises
- Introduction to Iterative Processing
- Checkpointing
- Checkpointing vs Persist
- Example of Iterative Processing
- K Means Clustering
- Hands-on Exercises
- Introduction to Datasets
- Why Datasets?
- Datasets vs Dataframes
- Using Dataset API
- Hands-on Exercises
- Structured Streaming Overview
- How it is better than streaming?
- Structured Streaming API
- Hands-on Exercises
This is a hands-on training therefore it is advisable to have some basic knowledge of Hadoop like Hive queries, HDFS commands etc.