Get in Touch

Course Outline

PySpark & Machine Learning 

Module 1: Big Data & Spark Foundations

  • Overview of the Big Data ecosystem and Spark's role in contemporary data platforms
  • Understanding Spark architecture: drivers, executors, cluster managers, lazy evaluation, DAG, and execution planning
  • Distinctions between RDD and DataFrame APIs and when to utilise each approach
  • Creating and configuring SparkSession, alongside fundamental application configuration

Module 2: PySpark DataFrames

  • Reading and writing data from enterprise sources and formats (CSV, JSON, Parquet, Delta)
  • Working with PySpark DataFrames: transformations, actions, column expressions, filtering, joins, and aggregations
  • Implementing advanced operations such as window functions, timestamp handling, and working with nested data
  • Applying data quality checks and writing reusable, maintainable PySpark code

Module 3: Processing Large Datasets Efficiently

  • Understanding performance fundamentals: partitioning strategies, shuffle behaviour, caching, and persistence
  • Using optimisation techniques including broadcast joins and execution plan analysis
  • Efficient processing of large datasets and best practices for scalable data workflows
  • Understanding schema evolution and modern storage formats used in enterprise environments

Module 4: Feature Engineering at Scale

  • Performing feature engineering with Spark MLlib: handling missing values, encoding categorical variables, and feature scaling
  • Designing reusable preprocessing steps and preparing datasets for Machine Learning pipelines
  • Introduction to feature selection and handling imbalanced datasets

Module 5: Machine Learning with Spark MLlib

  • Understanding MLlib architecture and the Estimator/Transformer pattern
  • Training regression and classification models at scale (Linear Regression, Logistic Regression, Decision Trees, Random Forest)
  • Comparing models and interpreting results within distributed Machine Learning workflows

Module 6: End-to-End ML Pipelines

  • Building end-to-end Machine Learning pipelines combining preprocessing, feature engineering, and modelling
  • Applying train/validation/test split strategies
  • Performing cross-validation and hyperparameter tuning using grid search and random search
  • Structuring reproducible Machine Learning experiments

Module 7: Model Evaluation & Practical ML Decision Making

  • Applying appropriate evaluation metrics for regression and classification problems
  • Identifying overfitting and underfitting and making practical model selection decisions
  • Interpreting feature importance and understanding model behaviour

Module 8: Production & Enterprise Practices

  • Persisting and loading models in Spark
  • Implementing batch inference workflows on large datasets
  • Understanding the Machine Learning lifecycle in enterprise environments
  • Introduction to versioning, experiment tracking concepts, and basic testing strategies

 

Practical Outcome

  • Ability to work autonomously with PySpark
  • Ability to process large datasets efficiently
  • Ability to perform feature engineering at scale
  • Ability to build scalable Machine Learning pipelines

Requirements

Participants should possess the following background knowledge:

Fundamental Python programming skills, including proficiency with functions, data structures, and libraries
A solid grasp of data analysis concepts such as datasets, transformations, and aggregations
Basic SQL knowledge and understanding of relational data concepts
Introductory knowledge of Machine Learning concepts, including training datasets, features, and evaluation metrics
Familiarity with command line interfaces and basic software development practices is advisable

Prior experience with Pandas, NumPy, or similar data processing libraries is beneficial but not required.

 21 Hours

Testimonials (1)

Related Categories