Get in Touch

Course Outline

Introduction:

  • Apache Spark within the Hadoop Ecosystem
  • Brief overview of Python and Scala

Foundational Concepts (Theory):

  • Architecture
  • Resilient Distributed Datasets (RDDs)
  • Transformations and Actions
  • Stages, Tasks, and Dependencies

Hands-on Workshop: Mastering the Basics in the Databricks Environment

  • Exercises utilizing the RDD API
  • Core action and transformation functions
  • PairRDDs
  • Join operations
  • Caching strategies
  • Exercises utilizing the DataFrame API
  • SparkSQL
  • DataFrame operations: select, filter, group, sort
  • User-Defined Functions (UDFs)
  • Exploration of the Dataset API
  • Streaming

Hands-on Workshop: Deployment in the AWS Environment

  • Fundamentals of AWS Glue
  • Comparing AWS EMR and AWS Glue
  • Practical job examples in both environments
  • Analysis of advantages and disadvantages

Additional Content:

  • Introduction to Apache Airflow orchestration

Requirements

Programming skills (preferably in Python and Scala)

Foundational knowledge of SQL

 21 Hours

Testimonials (3)

Related Categories