Triton Inference Server¶

Triton Inference Server is an inference solution designed for large-scale deployment of AI models, optimized for execution on both CPUs and GPUs. It enables efficient processing of inference requests received through HTTP/REST, gRPC, or the C API, directing them to the appropriate model scheduler depending on model type.

The server streamlines the deployment of diverse machine learning models, offering both flexibility and high performance across a wide range of AI applications.

The app is a customized implementation of NVIDIA Triton Inference Server. By default, it launches a JupyterLab interface, providing an interactive environment for code execution, result visualization, and model management. This interface is particularly suited for development and testing, facilitating experimentation with different models and configurations within a web-based notebook environment.

NVIDIA maintains a dedicated github repository with learning materials for the Triton Inference Server. This repository includes conceptual guides for configuring and deploying various types of models.

Initialization¶

Detailed instructions for using the Initialization parameter are available in the following sections:

Model Repository¶

The Model repository parameter defines the set of models served by Triton. All models present in the specified directory are initialized at startup. By default, Triton operates in explicit control mode.

Triton supports multiple scheduling and batching strategies, configurable on a per-model basis. Each model’s scheduler can batch inference requests and dispatch them to the corresponding backend for execution.

Models in the repository typically follow a standardized directory structure. For example:

model_repository
├── simple-onnx-model
│   ├── 1
│   │   └── model.onnx
│   └── config.pbtxt
├── simple-pytorch-model
│   ├── 1
│   │   └── model.pt
│   └── config.pbtxt
...

In this structure:

config.pbtxt defines model-specific configuration parameters.
Model files (model.onnx, model.pt, etc.) are placed in version-specific subdirectories (e.g., /1).

Hint

To disable model loading during development, specify an empty directory as the Model Repository. This prevents the server from starting any models while keeping the JupyterLab interface available for development and testing.

Stop the Server¶

All running inference server instances can be stopped with the following command executed in the integrated terminal:

$ stop_tritonserver

To restart the server in the background, use a command such as:

$ nohup tritonserver --http-port 8000 --grpc-port 8001 --metrics-port 8002 --model-repository=/work/model_repository &> /work/triton-server-log.txt &

This command starts Triton with the specified ports and redirects output to /work/triton-server-log.txt.

Check Loaded Models¶

The list of models currently loaded and available for inference can be retrieved with the following command executed in the integrated terminal:

$ curl -X POST http://localhost:8000/v2/repository/index -H "Content-Type: application/json" -d '{"ready": true}'| jq '.[] | select(.state == "READY")'

This command returns all models in the READY state, indicating that they are fully initialized and available to process inference requests.

Load and Unload Models¶

Triton supports dynamic loading and unloading of models without restarting the server. This capability is useful for updating models or reallocating resources.

To load a model:

$ curl -X POST http://localhost:8000/v2/repository/models/simple-onnx-model/load

To unload a model:

$ curl -X POST http://localhost:8000/v2/repository/models/simple-onnx-model/unload?unload_dependents=true

Further details on model management through the server API are available in the model repository API documentation.

Leverage GPU and CPU for Inference¶

Triton Inference Server supports execution of models on both GPUs and CPUs, enabling flexibility based on machine type selection and performance requirements.

GPU Acceleration: Execution on GPUs leverages parallel processing to achieve significant performance gains. GPU acceleration is supported for models developed with frameworks such as PyTorch, ONNX, TensorRT, and TensorRT-LLM.
CPU Optimization: For CPU-only deployments, performance can be enhanced using the OpenVINO backend, which provides optimizations tailored for inference workloads.

Disable JupyterLab¶

The Disable JupyterLab interface parameter can be used to disable the lab environment and run only the Triton server.

When the interface is disabled, the server operates independently and focuses exclusively on handling inference requests. If the application is deployed with a public URL, the Triton server is accessible to external clients via the HTTP/REST protocol. This mode is well suited for production environments in which the server must process requests from distributed clients.

When making API calls from a remote location, update the endpoint URL to match the deployment address. For example:

http://localhost:8000/v2/repository/models

with a custom remote address, such as:

https://app-custom_link.cloud.sdu.dk/v2/repository/models

Example Tutorials¶

The following tutorials provide practical guidance for deploying models with Triton Inference Server:

A step-by-step guide to deploying Hugging Face models using the Python Backend and Triton Ensembles is available here.
Examples of models supported by TensorRT-LLM can be found here.
For implementations integrated into the UCloud workflow, see the use cases, which demonstrate embedding Triton in practical scenarios, and the webinars section for additional learning material.