PySpark and Machine Learning Training Course

This training offers a hands-on introduction to constructing scalable data processing and Machine Learning workflows using PySpark. Participants will learn how Apache Spark functions within contemporary Big Data ecosystems and how to efficiently process extensive datasets by applying distributed computing principles.

The course progresses from Spark architecture and DataFrame operations to advanced subjects like feature engineering, Machine Learning model training, and the construction of end-to-end ML pipelines using Spark MLlib. Participants will also delve into performance optimisation techniques, model evaluation strategies, and enterprise best practices for deploying Machine Learning workflows at scale.

Through practical exercises and real-world inspired scenarios, participants will acquire the skills to design efficient data pipelines, prepare datasets for Machine Learning, and build distributed ML models capable of managing the large volumes of data typical in enterprise settings.

By the end of the training, participants will understand how to integrate PySpark into modern data platforms and apply scalable Machine Learning techniques in production-oriented environments.

This course is available as onsite live training in Kenya or online live training.

Thank you for sending your enquiry! One of our team members will contact you shortly.

Thank you for sending your booking! One of our team members will contact you shortly.

Course Outline

PySpark & Machine Learning

Module 1: Big Data & Spark Foundations

Overview of the Big Data ecosystem and Spark's role in contemporary data platforms
Understanding Spark architecture: drivers, executors, cluster managers, lazy evaluation, DAG, and execution planning
Distinctions between RDD and DataFrame APIs and when to utilise each approach
Creating and configuring SparkSession, alongside fundamental application configuration

Module 2: PySpark DataFrames

Reading and writing data from enterprise sources and formats (CSV, JSON, Parquet, Delta)
Working with PySpark DataFrames: transformations, actions, column expressions, filtering, joins, and aggregations
Implementing advanced operations such as window functions, timestamp handling, and working with nested data
Applying data quality checks and writing reusable, maintainable PySpark code

Module 3: Processing Large Datasets Efficiently

Understanding performance fundamentals: partitioning strategies, shuffle behaviour, caching, and persistence
Using optimisation techniques including broadcast joins and execution plan analysis
Efficient processing of large datasets and best practices for scalable data workflows
Understanding schema evolution and modern storage formats used in enterprise environments

Module 4: Feature Engineering at Scale

Performing feature engineering with Spark MLlib: handling missing values, encoding categorical variables, and feature scaling
Designing reusable preprocessing steps and preparing datasets for Machine Learning pipelines
Introduction to feature selection and handling imbalanced datasets

Module 5: Machine Learning with Spark MLlib

Understanding MLlib architecture and the Estimator/Transformer pattern
Training regression and classification models at scale (Linear Regression, Logistic Regression, Decision Trees, Random Forest)
Comparing models and interpreting results within distributed Machine Learning workflows

Module 6: End-to-End ML Pipelines

Building end-to-end Machine Learning pipelines combining preprocessing, feature engineering, and modelling
Applying train/validation/test split strategies
Performing cross-validation and hyperparameter tuning using grid search and random search
Structuring reproducible Machine Learning experiments

Module 7: Model Evaluation & Practical ML Decision Making

Applying appropriate evaluation metrics for regression and classification problems
Identifying overfitting and underfitting and making practical model selection decisions
Interpreting feature importance and understanding model behaviour

Module 8: Production & Enterprise Practices

Persisting and loading models in Spark
Implementing batch inference workflows on large datasets
Understanding the Machine Learning lifecycle in enterprise environments
Introduction to versioning, experiment tracking concepts, and basic testing strategies

Practical Outcome

Ability to work autonomously with PySpark
Ability to process large datasets efficiently
Ability to perform feature engineering at scale
Ability to build scalable Machine Learning pipelines

Requirements

Participants should possess the following background knowledge:

Fundamental Python programming skills, including proficiency with functions, data structures, and libraries
A solid grasp of data analysis concepts such as datasets, transformations, and aggregations
Basic SQL knowledge and understanding of relational data concepts
Introductory knowledge of Machine Learning concepts, including training datasets, features, and evaluation metrics
Familiarity with command line interfaces and basic software development practices is advisable

Prior experience with Pandas, NumPy, or similar data processing libraries is beneficial but not required.

21 Hours

Need help picking the right course?
southafrica@nobleprog.co.za or +27 (0)10 005 5793

Testimonials (1)

I liked that it was practical. Loved to apply the theoretical knowledge with practical examples.

PySpark and Machine Learning Training Course

Course Outline

Requirements

Testimonials (1)

Aurelia-Adriana - Allianz Services Romania

Course - Python and Spark for Big Data (PySpark)

Related Categories

This site in other countries/regions

Europe

Asia Pacific

North America

South America

Africa / Middle East

Other sites

PySpark and Machine Learning Training Course

Course Outline

Requirements

Testimonials (1)

Aurelia-Adriana - Allianz Services Romania

Course - Python and Spark for Big Data (PySpark)

Related Courses

Python and Spark for Big Data (PySpark)

Stratio: Rocket and Intelligence Modules with PySpark

Related Categories

PySpark

This site in other countries/regions

Europe

Asia Pacific

North America

South America

Africa / Middle East

Other sites