Connect with us:

Join us for a FREE hands-on Meetup webinar on Mastering Snowflake Certifications: Associate & SnowPro Core Readiness | Friday, July 18th, 2025 · 6:00 PM IST/ 08:30 AM ET Join us for a FREE hands-on Meetup webinar on Mastering Snowflake Certifications: Associate & SnowPro Core Readiness | Friday, July 18th, 2025 · 6:00 PM IST/ 08:30 AM ET

Scalable Machine Learning

Implement the Scalable Machine Learning using the Hadoop and Spark framework in either Scala or Python language

Duration

3 Days

Level

Intermediate Level

Design and Tailor this course

As per your team needs

Edit Content

Edit Content

Edit Content

Introduction to Scalable Machine Learning

What is Scalable Machine Learning?
Why it is required?
Key platforms for performing Scalable Machine Learning
Scalable Machine Learning Project End to End Pipeline
Spark Introduction
Why Spark for Scalable Machine Learning?
Databricks Platform Demo
Approaches for scaling sci-kit learn code
Hands-on Exercise(s): Experiencing the first notebook

Why Spark for Scalable Machine Learning (SML)?

Problems with Traditional Machine Learning Frameworks
Machine Learning at Scale – Various options
Iterative Algorithms
How Spark performs well for Iterative Machine Learning Algorithms?
Hands-on Exercise(s)

Scalable Machine Learning on Enterprise Platform

Acquiring Structured content from Relational Databases
Acquiring Semi-structured content from Log Files
Acquiring Unstructured content from other key sources like Web
Tools for Performing Data acquisition at Scale
Sqoop, Flume and Kafka Introduction, use cases and architectures
Hands-on Exercise(s)

Data Acquisition at Scale

Data Pre-Processing for Modeling

Using the Spark Shell
Resilient Distributed Datasets (RDDs)
Functional Programming with Spark
RDD Operations
Key-Value Pair RDDs
MapReduce and Pair RDD Operations
Building and Running a Spark Application
Performing Data Validation
Data De-Duplication
Detecting Outliers
Hands-on Exercise(s)

Working with Iterative Algorithms

Dealing with RDD Infinite Lineages
Caching Overview
Distributed Persistence
Checkpointing of an Iterative Machine Learning Algorithm
Hands-on Exercise(s)

Introduction
Dataframe API
Performing ad-hoc query analysis using Spark SQL
Hands-on Exercise(s)

Spark Machine Learning using MLLib

Spark ML vs Spark MLLib
Data types and key terms
Feature Extraction
Linear Regression using Spark MLLib
Hands-on Exercise(s)

Spark Machine Learning using ML

Spark ML Overview
Transformers and Estimators
Pipelines
Implementing Decision Trees
K-Means Clustering using Spark ML
Hands-on Exercise(s)

Decision Trees and Random Forest

Types – Classification and Regression trees
Gini Index, Entropy and Information Gain
Building Decision Trees
Pruning the trees
Prediction using Trees
Ensemble Models
Bagging and Boosting
Advantages of using Random Forest
Working with Random Forest
Ensemble Learning
How ensemble learning works
Building models using Bagging
Random Forest algorithm
Random Forest model building
Fine tuning hyper-parameters
Hands-on Exercise(s)

Model Evaluation, Optimization and Deployment

Model Evaluation
Optimizing a Model
Deploying Model
Best Practices

Edit Content

Stay ahead with DataCouch! Your partner in mastering the latest advancements in AI, Data Science, DevOps, and more.

Quick Links

our Offerings

Get in touch

Sign up for DataCouch Communications

Copyright 2025 © DataCouch