Get in Touch

Course Outline

Introduction to AIOps

  • Defining AIOps and its significance
  • Comparing traditional monitoring with AIOps-driven observability
  • Overview of AIOps architecture and essential components

Collecting and Normalizing Operational Data

  • Categories of observability data: metrics, logs, and traces
  • Ingesting data from diverse sources (servers, containers, cloud)
  • Leveraging agents and exporters (Prometheus, Beats, Fluentd)

Data Correlation and Anomaly Detection

  • Time series correlation and statistical approaches
  • Applying ML models for anomaly detection
  • Identifying incidents across distributed systems

Alerting and Noise Reduction

  • Crafting intelligent alert rules and setting thresholds
  • Techniques for suppression, deduplication, and alert grouping
  • Integrating with platforms like Alertmanager, Slack, PagerDuty, or Opsgenie

Root Cause Analysis and Visualization

  • Utilizing dashboards to visualize metrics and identify trends
  • Investigating events and timelines for RCA
  • Tracing issues across layers using distributed tracing tools

Automation and Remediation

  • Initiating automated scripts or workflows triggered by incidents
  • Connecting with ITSM systems (ServiceNow, Jira)
  • Use cases: self-healing, scaling, and traffic rerouting

Open Source and Commercial AIOps Platforms

  • Overview of tools: Prometheus, Grafana, ELK, Moogsoft, Dynatrace
  • Criteria for selecting the right AIOps platform
  • Demo and hands-on practice with a chosen stack

Summary and Next Steps

Requirements

  • A solid understanding of IT operations and system monitoring concepts
  • Practical experience with monitoring tools or dashboards
  • Familiarity with fundamental log and metric formats

Audience

  • Operations teams managing infrastructure and applications
  • Site Reliability Engineers (SREs)
  • IT monitoring and observability teams
 14 Hours

Related Categories