Triton Inference Server¶

Triton Inference Server is a powerful inferencing solution designed to support the deployment of AI models at scale, optimized for both CPUs and GPUs. It facilitates the efficient execution of inference requests arriving via HTTP/REST, gRPC, or the C API, routing them to the appropriate model scheduler based on the model type.

Triton simplifies the deployment of various machine learning models, providing flexibility and performance for a wide range of AI applications.

The app is a customized implementation of NVIDIA Triton Inference Server. By default, it starts with a JupyterLab interface, which provides a user-friendly environment for interacting with Triton, running code, visualizing results, and managing models. This interface is particularly useful for development and testing purposes, allowing users to easily experiment with different models and configurations directly within a web-based notebook environment.

NVIDIA has a dedicated github repository with learning materials for the Triton Inference Server. This repository contains conceptual guides for setting up different types of models.

Initialization¶

For instructions on how to use the Initialization parameter, please refer to the following sections:

Model Repository Directory¶

The Model repository directory parameter allows to specify which models the Triton server will serve. All the models present in the repository will be initialized during startup. By default, Triton will operate in explicit control mode.

Triton supports various scheduling and batching algorithms, configurable on a per-model basis. Each model's scheduler can batch inference requests and route them to the appropriate backend for processing.

Models in the repository typically follow a standardized directory structure. Here's an example:

model_repository
├── simple-onnx-model
│   ├── 1
│   │   └── model.onnx
│   └── config.pbtxt
├── simple-pytorch-model
│   ├── 1
│   │   └── model.pt
│   └── config.pbtxt
...

In this structure:

config.pbtxt is the configuration file containing model-specific details.
Model files (model.onnx, model.pt, etc.) are stored in version-specific subdirectories (e.g., /1).

Hint

During development, you can prevent the Triton server from starting by specifying an empty folder in the Model Repository Directory parameter. This allows you to run the environment without loading any models, while still being able to access the JupyterLab interface for development and testing.

Stopping the Server¶

To stop all running inference server instances, execute the following command in the app's integrated terminal:

$ stop_tritonserver

To restart the server in the background, you can use a command like this:

$ nohup tritonserver --http-port 8000 --grpc-port 8001 --metrics-port 8002 --model-repository=/work/model_repository &> /work/triton-server-log.txt &

This will run the server with the specified ports and log output to the file /work/triton-server-log.txt.

Checking Loaded Models¶

To check the list of models currently loaded and ready for inference, run the following command in the app's integrated terminal:

$ curl -X POST http://localhost:8000/v2/repository/index -H "Content-Type: application/json" -d '{"ready": true}'| jq '.[] | select(.state == "READY")'

This will list all models that are in the "READY" state, indicating they are fully loaded and ready to handle inference requests.

Loading and Unloading Models¶

Triton allows dynamic loading and unloading of models without requiring a server restart. This is useful when updating models or adjusting resources.

To load a model:

$ curl -X POST http://localhost:8000/v2/repository/models/simple-onnx-model/load

To unload a model:

$ curl -X POST http://localhost:8000/v2/repository/models/simple-onnx-model/unload

For more information on using the server API for model management, refer to the model repository API documentation.

Leveraging GPU and CPU for Inference¶

Triton Inference Server supports running models on both GPUs and CPUs, providing flexibility based on the selected machine type and performance requirements.

GPU Acceleration: Models running on GPUs can benefit from parallel processing, which significantly boosts performance. Triton supports GPU acceleration for models built with frameworks like PyTorch, TensorFlow, ONNX, TensorRT, and TensorRT-LLM.
CPU Optimization: For CPU-only deployments, Triton supports optimization via the OpenVINO backend, which enhances performance for inference tasks.

Disable JupyterLab¶

In some scenarios, you may want to disable the lab interface and run only the Triton server. You can do this by using the Disable JupyterLab interface optional parameter.

When the lab interface is disabled, the server operates independently, focusing solely on serving inference requests. In this mode, if the application is deployed with a public URL, the Triton server becomes accessible to external clients via HTTP/REST protocol. This setup allows external applications or systems to send inference requests directly to the Triton server, making it ideal for production environments where the server is expected to handle incoming requests from distributed clients.

Note

When deployed with a public URL, the server HTTP port must be set to 7681.

When making API calls from a remote location, update the endpoint URL in your requests to reflect the actual deployment address. For instance, replace:

http://localhost:7681/v2/repository/models

with your custom remote address, such as:

https://app-custom_link.cloud.sdu.dk/v2/repository/models

Example Tutorials¶

The following tutorials provide practical guidance for deploying models on Triton Inference Server:

A step-by-step guide for deploying Hugging Face models using the Python Backend and Triton Ensembles is available here.
A few examples of TensorRT-LLM supported models.
For implementations integrated within the UCloud workflow, refer to these use cases, which offer examples of embedding Triton in various real-world scenarios.