Get in Touch

Course Outline

Tencent Hunyuan Production Fundamentals

  • Overview of Tencent Hunyuan model serving scenarios.
  • Production characteristics of large and MoE models.
  • Common bottlenecks regarding latency, throughput, and cost.
  • Defining service-level objectives for inference workloads.

Deployment Architecture and Serving Flow

  • Core components of a production inference stack.
  • Selecting between containerized, on-premise, and cloud deployment models.
  • Fundamentals of model loading, request routing, and GPU allocation.
  • Designing for reliability and operational simplicity.

Latency Optimization in Practice

  • Utilizing optimized inference engines such as TensorRT where appropriate.
  • Understanding KV-cache concepts and practical cache tuning.
  • Reducing startup, warmup, and response overhead.
  • Measuring time to first token and token generation speed.

Throughput, Batching, and GPU Efficiency

  • Strategies for continuous batching and request batching.
  • Managing concurrency and queue behavior.
  • Enhancing GPU utilization without compromising user experience.
  • Handling long-context and mixed-workload requests.

Quantization and Cost Control

  • The importance of quantization for production serving.
  • Practical trade-offs of FP16, INT8, and other common precision options.
  • Balancing model quality, latency, and infrastructure cost.
  • Creating a simple checklist for cost optimization.

Operations, Monitoring, and Readiness Review

  • Autoscaling triggers for inference services.
  • Monitoring latency, throughput, cache usage, and GPU health.
  • Basics of logging, alerting, and incident response.
  • Reviewing a reference deployment and developing an improvement plan.

Requirements

  • Fundamental understanding of large language model deployment and inference workflows.
  • Experience with containers, cloud or on-premise infrastructure, and API-based services.
  • Proficiency in Python or experience with system engineering tasks.

Audience

  • Machine learning engineers deploying LLMs into production.
  • Platform engineers responsible for GPU-based inference services.
  • Solution architects designing scalable AI serving platforms.
 14 Hours

Related Categories