Big Data for Architects

This Program is Focus on Key Architectures and Pipelines in Big Data Ecosystem

Duration

3 Days

Level

Intermediate Level

Design and Tailor this course

As per your team needs

Edit Content
  • Understand thought process behind choosing Big Data Ingestion, Storage, Processing and Analysis related Technologies
  • Focus on Key Architectures and Pipelines in Big Data Ecosystem
  • Which Big Data Technology to choose when?
  • Covering Breadth of Big Data Technologies 
  • Hands-on on Google Cloud DataProc Pseudo Distributed Cluster 
  • Understand Hortonworks Security Modules
Edit Content
  • Engineers/Scientists in understanding role of various Big Data Technologies
  • Big Data Leads/Architects who want to enhance their Big Data knowledge 
  • Big Data Engineers planning to appear for Professional level Certifications like DE575, Google Certified Data Professional etc.
Edit Content
  • Evolution of Big Data Technologies
  • Big Data Technologies Landscape in Hortonworks Stack
  • Key Big Data Architectures
  • Deployment Architecture of Data Lake
  • Typical Big Data Batch Pipeline
  • More Examples of Big Data Batch Pipeline
  • Typical Big Data Streaming Pipeline
  • More Examples of Streaming Pipeline
  • Factors to consider while comparing Ingestion frameworks
  • Loading data into Data Lake
  • Sqoop Internals
  • Loading data using Sqoop
  • Sqoop vs Kafka Connect
  • High Level Introduction to NiFi
  • Hands-on Confluent Kafka Installation
  • Interoperability
  • Text vs Binary 
  • Row oriented vs Column oriented
  • Splittable
  • Schema Evolution
  • Avro Data Format
  • ORC Data Format 
  • Comparing Data Formats – which one to choose when?
  • Hands-on Big Data Batch Pipeline Use Avro format
  • Factors to consider while comparing Processing frameworks
  • Introduction to YARN
  • YARN Architecture
  • YARN Internals
  • How to troubleshoot issues on cluster
  • Things to consider for performance tuning
  • Spark vs Tez
  • MR vs Spark
  • MR vs Spark Logical Architecture Perspective
  • MR vs Spark Performance Perspective
  • Why Spark?
  • Spark Physical Architecture
  • Spark Internals
  • Spark Optimizations
  • Things to consider when implementing Spark on YARN
  • Kafka Stream vs Spark Streaming
  • Spark Core vs Spark SQL
  • Spark Execution Modes: YARN Client vs YARN Cluster
  • Spark 2.x Streaming vs Spark 1.x Streaming
  • KStreams vs Spark 
  • Hands-on Spark on YARN
  • Hands-on Kafka & Spark Streaming Integration
  • Factors to consider while comparing Storage frameworks
  • Why Hive?
  • Hive Architecture
  • Hive LLAP Architecture
  • Spark Sql vs Hive
  • KSQL vs Hive
  • Hands-on exercises for Spark SQL and Hive
  • Hands-on Big Data Batch Pipeline Use Avro format 
  • Hands-on Kafka, NiFi, HBase & Hive Integration
  • Implementing Change Data Capture using Kafka 
  • Building ETL pipeline
  • Introduction to Hortonworks Security
  • Security Aspects: Key things to consider
  • Securing the ecosystem
  • Discussion on various tools/frameworks/technologies related to security in below areas 
    • Authentication – Kerberos
    • Authorization – Ranger (also masking)
    • Auditing 
    • Encryption – Data at Rest and Data in Motion
Edit Content

Participants should preferably have basic knowledge of Unix/Linux administration. Basic knowledge of Hadoop/Spark/Kafka etc. will be helpful

Connect

we'd love to have your feedback on your experience so far