Creating & Monitoring Big Data Pipelines with Apache Airflow
Duration
3 Days
Level
Intermediate Level
Design and Tailor this course
As per your team needs
Big data systems are becoming more and more complex each day. Even simpler big data system involves various stages such as ingestion, transformations, analytics etc. and also involves various stakeholders such as big data engineers, data scientists and data analytics. So it becomes necessary to stitch together all big data tasks into a pipeline and monitor them to make them a more scalable, less error prone and autonomous system.
The Creating & Monitoring Big Data Pipelines with Apache Airflow training course is designed to teach data engineers what they need to know to create, schedule and monitor data pipelines using the de facto platform known as Apache Airflow by programmatically authoring, scheduling and creating workflows. The course begins with the core functionalities of Apache Airflow and then moves on to building data pipelines. The course then moves into more advanced topics around Apache Airflow such as start_date and schedule_time, dealing with time zones, alerting on failures and much more. The course concludes with a look at how to handle monitoring and security with Apache Airflow, as well as managing and deploying workflows in the cloud.
Purpose:
Promote an in-depth understanding of how to use Apache Airflow to create, schedule and monitor data pipelines.
Productivity Objectives:
Upon completion of this course, you should be able to:
- Code production-grade data pipelines with Airflow
- Scheduling & monitoring data pipelines using Apache Airflow
- Understand and apply core/advanced concepts of Apache Airflow.
- Create data pipelines using AWS MWAA (Managed Workflow for Apache Airflow)
- Big Data Engineers
- Data Scientists
- DevOps
- Data Analytics
- What is Apache Airflow
- How Apache Airflow works?
- Installation & Setup
- Understanding Airflow Architecture
- Understanding core concepts – DAGS/ Tasks/ Operators.
- Understanding interface – Airflow UI Tour
- Using CLI
- Sqoop operator – Ingest Data from RDBMS
- Http Sensor – checking API availability
- File Sensor – Checking File
- Python Operator – Download Data
- Bash Operator – Move data to HDFS
- Hive Operator – Create Hive tables
- Spark Submit Operator – Run Spark Job
- Email Operator – Send email notifications
- Data pipeline in action
- Understanding logging system
- Setting up custom logging
- Storing logs in S3
- Encrypting sensitive data with Fernet Keys
- Rotating Fernet Keys
- Hide Variables
- Enable Password authentication
- Using Amazon Managed Workflows for Apache Airflow
- Deploying Airflow on Kubernetes cluster on AWS (EKS)
- Basic knowledge of python
- Basic understanding of big data tools such as Spark, Hive etc