Big Data for Architects
This Program is Focus on Key Architectures and Pipelines in Big Data Ecosystem
Duration
3 Days
Level
Intermediate Level
Design and Tailor this course
As per your team needs
Edit Content
- Understand thought process behind choosing Big Data Ingestion, Storage, Processing and Analysis related Technologies
- Focus on Key Architectures and Pipelines in Big Data Ecosystem
- Which Big Data Technology to choose when?
- Covering Breadth of Big Data Technologies
- Hands-on on Google Cloud DataProc Pseudo Distributed Cluster
- Understand Hortonworks Security Modules
Edit Content
- Engineers/Scientists in understanding role of various Big Data Technologies
- Big Data Leads/Architects who want to enhance their Big Data knowledge
- Big Data Engineers planning to appear for Professional level Certifications like DE575, Google Certified Data Professional etc.
Edit Content
- Evolution of Big Data Technologies
- Big Data Technologies Landscape in Hortonworks Stack
- Key Big Data Architectures
- Deployment Architecture of Data Lake
- Typical Big Data Batch Pipeline
- More Examples of Big Data Batch Pipeline
- Typical Big Data Streaming Pipeline
- More Examples of Streaming Pipeline
- Factors to consider while comparing Ingestion frameworks
- Loading data into Data Lake
- Sqoop Internals
- Loading data using Sqoop
- Sqoop vs Kafka Connect
- High Level Introduction to NiFi
- Hands-on Confluent Kafka Installation
- Interoperability
- Text vs Binary
- Row oriented vs Column oriented
- Splittable
- Schema Evolution
- Avro Data Format
- ORC Data Format
- Comparing Data Formats – which one to choose when?
- Hands-on Big Data Batch Pipeline Use Avro format
- Factors to consider while comparing Processing frameworks
- Introduction to YARN
- YARN Architecture
- YARN Internals
- How to troubleshoot issues on cluster
- Things to consider for performance tuning
- Spark vs Tez
- MR vs Spark
- MR vs Spark Logical Architecture Perspective
- MR vs Spark Performance Perspective
- Why Spark?
- Spark Physical Architecture
- Spark Internals
- Spark Optimizations
- Things to consider when implementing Spark on YARN
- Kafka Stream vs Spark Streaming
- Spark Core vs Spark SQL
- Spark Execution Modes: YARN Client vs YARN Cluster
- Spark 2.x Streaming vs Spark 1.x Streaming
- KStreams vs Spark
- Hands-on Spark on YARN
- Hands-on Kafka & Spark Streaming Integration
- Factors to consider while comparing Storage frameworks
- Why Hive?
- Hive Architecture
- Hive LLAP Architecture
- Spark Sql vs Hive
- KSQL vs Hive
- Hands-on exercises for Spark SQL and Hive
- Hands-on Big Data Batch Pipeline Use Avro format
- Hands-on Kafka, NiFi, HBase & Hive Integration
- Implementing Change Data Capture using Kafka
- Building ETL pipeline
- Introduction to Hortonworks Security
- Security Aspects: Key things to consider
- Securing the ecosystem
- Discussion on various tools/frameworks/technologies related to security in below areas
- Authentication – Kerberos
- Authorization – Ranger (also masking)
- Auditing
- Encryption – Data at Rest and Data in Motion
Edit Content
Participants should preferably have basic knowledge of Unix/Linux administration. Basic knowledge of Hadoop/Spark/Kafka etc. will be helpful