Serving a Large Language Model on UCloud

In this tutorial, we will deploy several Large Language Models (LLMs) using the Triton Inference Server on UCloud. We will leverage TensorRT-LLM backend to create an engine that runs the model efficiently on GPUs.

We will begin by launching the Triton Inference Server on UCloud, selecting a version with TensorRT-LLM support (TRT-LLM). We will instantiate a u3-gpu-4 machine type, which deploys 4 NVIDIA H100 GPUs.

Download the Model Weights

The model weights can be downloaded directly from Hugging Face using the huggingface-cli tool.

First, we need to authenticate with Hugging Face:

$ huggingface-cli login --token <HF_TOKEN>

Replace <HF_TOKEN> with your Hugging Face access token. This command authenticates your session, allowing you to download models from Hugging Face directly.

We are considering four LLMs in this tutorial. The Hugging Face model names and the download directories are shown below.

$ huggingface-cli download meta-llama/Meta-Llama-3.1-8B-Instruct --local-dir llama-3.1/8B/hf
$ huggingface-cli download meta-llama/Meta-Llama-3.1-70B-Instruct --local-dir llama-3.1/70B/hf
$ huggingface-cli download meta-llama/CodeLlama-34b-Instruct-hf  --local-dir codellama/34B/hf
$ huggingface-cli download mistralai/Mixtral-8x7B-Instruct-v0.1  --local-dir mixtral-0.1/8x7B/hf

Build the Model Engine

To run these models efficiently on GPU, we need to convert them into the TensorRT-LLM format. We will then use the trtllm-build command-line tool to build the optimized TensorRT engine from the Hugging Face checkpoints.

We will first convert each model into the required format by running the script located in the $HOME directory. Then, trtllm-build is used to build the TensorRT engine from the model checkpoint. Normally trtllm-build only requires a single GPU, but if more GPUs are available, it is possible to enable parallel building to make the engine building process faster by adding --workers argument.

Note

It may be necessary for the Llama models to edit the corresponding model configuration file llama-3.1/{8B,70B}/hf/config.json and change the rope_scaling settings to:

"rope_scaling": {
    "factor": 8.0,
    "type": "dynamic"
}

The following commands are used to build the engine for each model.

$ python ~/llama/convert_checkpoint.py --model_dir ./llama-3.1/8B/hf --output_dir ./llama-3.1/8B/trt_ckpts/tp1 --dtype float16 --tp_size 1

This converts the Llama 8B model to TensorRT format, using FP16 precision. tp_size is set to 1, meaning we are using single GPU tensor parallelism.

Next, we build the model engine:

$ trtllm-build --checkpoint_dir ./llama-3.1/8B/trt_ckpts/tp1 --output_dir ./llama-3.1/8B/trt_engines/fp16/1-gpu/ --gemm_plugin auto

This command builds the TensorRT engine using the converted checkpoints and stores the result in the specified directory.

$ python ~/llama/convert_checkpoint.py --model_dir ./llama-3.1/70B/hf --output_dir ./llama-3.1/70B/trt_ckpts/tp2_pp2 --dtype bfloat16 --tp_size 2 --pp_size 2 --load_by_shard --workers 2

This command converts the larger Llama model to TensorRT format, using bfloat16 precision. We also enable tensor parallelism (tp_size=2) and pipeline parallelism (pp_size=2), meaning the model will run across two GPUs for each method.

Then, build the engine:

$ trtllm-build --checkpoint_dir ./llama-3.1/70B/trt_ckpts/tp2_pp2 --output_dir ./llama-3.1/70B/trt_engines/bp16/4-gpu/ --max_num_tokens 4096 --max_input_len 255000 --max_seq_len 256000 --use_paged_context_fmha enable --workers 4

This builds the engine, enabling optimized GPU usage across four GPUs.

$ python ~/llama/convert_checkpoint.py --model_dir ./codellama/34B/hf/ --output_dir ./codellama/34B/trt_ckpts/tp4 --dtype float16 --tp_size 4 --use_parallel_embedding  --workers 2

Here, we convert the CodeLlama 34B model with FP16 precision, using tp_size=4 to distribute the model across four GPUs and enable parallel embedding.

$ trtllm-build --checkpoint_dir ./codellama/34B/trt_ckpts/tp4 --output_dir ./codellama/34B/trt_engines/fp16/4-gpu/ --gemm_plugin auto --max_input_len 15360 --max_seq_len 16384 --max_batch_size 4 --workers 4

This command builds the engine for CodeLlama across four GPUs, with a maximum batch size of 4.

$ python ~/llama/convert_checkpoint.py --model_dir ./mixtral-0.1/8x7B/hf/ --output_dir ./mixtral-0.1/8x7B/trt_ckpts/tp2 --dtype float16 --tp_size 2 --moe_tp_size 2 --workers 2

For Mixtral, we are using tensor parallelism and mixture of experts (moe_tp_size=2) to spread the computation across two GPUs.

$ trtllm-build --checkpoint_dir ./mixtral-0.1/8x7B/trt_ckpts/tp2 --output_dir ./mixtral-0.1/8x7B/trt_engines/2-gpu/ --gemm_plugin float16 --workers 4

This builds the engine for Mixtral using float16 precision and parallel processing across two GPUs.

Test the Engine

Once the model engine is built, you can test it by running the model with the following commands:

$ python ~/run.py --max_output_len=160 --tokenizer_dir ./llama-3.1/8B/hf/ --engine_dir llama-3.1/8B/trt_engines/fp16/1-gpu/ --input_text "What is the capital of Denmark?"

This runs the Llama 8B model on a single GPU, generating a response of up to 160 tokens.

$ mpirun -n 4 python ~/run.py --max_output_len=160 --tokenizer_dir ./llama-3.1/70B/hf/ --engine_dir llama-3.1/70B/trt_engines/bp16/4-gpu/ --input_text "What is the capital of Denmark?"

This command runs the model in parallel across four GPUs.

$ mpirun -n 4 python ~/run.py --max_output_len=160 --tokenizer_dir ./codellama/34B/hf/ --engine_dir ./codellama/34B/trt_engines/fp16/4-gpu/ --input_text "In python, write a function for binary searching an element in an integer array."

This runs CodeLlama across four GPUs, generating a Python function as output.

$ mpirun -n 2 python ~/run.py --max_output_len=160 --tokenizer_dir ./mixtral-0.1/8x7B/hf/ --engine_dir ./mixtral-0.1/8x7B/trt_engines/2-gpu/ --input_text "What is the Capital of Denmark?"

This tests the Mixtral model on two GPUs.

Deploy the Model with Triton

The final step in deploying the LLM is to set up a repository that Triton will use to serve the model. For each model, you need to create a directory in the repository and place the relevant config.pbtxt and engine files there.

Stop any running inference server

Before deploying the new models, make sure to stop any running Triton server instances to avoid conflicts:

$ stop_tritonserver

Set up the model repository

Triton requires a model repository to serve models. In this case, we can utilize model ensembles located in $HOME/all_models. This directory contains two groups of models:

  • gpt: Using TensorRT-LLM pure Python runtime.

  • inflight_batcher_llm: Using the C++ TensorRT-LLM backend with the executor API, which includes the latest features including inflight batching.

For this illustration, we'll focus on deploying the Llama 3.1:8B model.

Create the Triton repository directory

Start by creating a directory for the Llama 3.1:8B model in the Triton model repository:

$ mkdir -p llama-3.1/8B/triton

Next, copy the necessary model files from the appropriate directory (in this case, inflight_batcher_llm):

$ cp -r ~/all_models/inflight_batcher_llm/* llama-3.1/8B/triton/

The directory structure of the Llama 3.1:8B model deployment will look like this:

llama-3.1/8B/triton/
├── ensemble
│   ├── 1   └── config.pbtxt
├── postprocessing
│   ├── 1      └── model.py
│   └── config.pbtxt
├── preprocessing
│   ├── 1      └── model.py
│   └── config.pbtxt
├── tensorrt_llm
│   ├── 1      └── model.py
│   └── config.pbtxt
└── tensorrt_llm_bls
    ├── 1
       ├── lib
          ├── decode.py
          └── triton_decoder.py
       └── model.py
    └── config.pbtxt

Configure the model parameters

The next step is to define the model parameters and apply them to the configuration files (config.pbtxt). This can be done using the fill_template.py script provided in the environment.

Here are the environment variables you'll use to configure the model:

$ ENGINE_DIR=llama-3.1/8B/trt_engines/fp16/1-gpu/
$ TOKENIZER_DIR=llama-3.1/8B/hf/
$ MODEL_FOLDER=llama-3.1/8B/triton/
$ TRITON_MAX_BATCH_SIZE=4
$ INSTANCE_COUNT=1
$ MAX_QUEUE_DELAY_MS=0
$ MAX_QUEUE_SIZE=0
$ FILL_TEMPLATE_SCRIPT=${HOME}/tools/fill_template.py
$ DECOUPLED_MODE=false

Now, use the fill_template.py script to apply these parameters to the relevant config files:

$ python ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/ensemble/config.pbtxt triton_max_batch_size:${TRITON_MAX_BATCH_SIZE}
$ python ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/preprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},preprocessing_instance_count:${INSTANCE_COUNT}
$ python ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},engine_dir:${ENGINE_DIR},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MS},batching_strategy:inflight_fused_batching,max_queue_size:${MAX_QUEUE_SIZE}
$ python ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/postprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},postprocessing_instance_count:${INSTANCE_COUNT},max_queue_size:${MAX_QUEUE_SIZE}
$ python ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},bls_instance_count:${INSTANCE_COUNT}

Start the Triton server

Now that the model repository is set up, start the Triton server and point it to the newly created model repository:

$ nohup tritonserver --model-repository=$MODEL_FOLDER &> /work/triton-server-log.txt &

This command runs the Triton server in the background, and the logs are stored in /work/triton-server-log.txt.

Inference test with Triton

Once the ensemble model is deployed on the Triton server and running, you can test it using a simple curl command to send a request to the server and get an inference response.

Here’s how to test the Llama 3.1:8B model via Triton:

$ curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "What is ML?", "max_tokens": 50, "bad_words": "", "stop_words": ""}' | jq -r '.text_output'

This command sends a request to the Triton server at localhost:8000, processes the input, and returns the model's generated output.

Explanation of Parameters:

  • text_input: The input text that you want the model to process. In this example, we're asking the model "What is ML?".

  • max_tokens: The maximum number of tokens that the model is allowed to generate in its response. Here, it's set to 50.

  • bad_words: Words to avoid in the model’s response (if any). This is an empty string in this example.

  • stop_words: Words that will signal the model to stop generating. This is also left as an empty string in this case.

For more information about Triton ensemble models check here.

Evaluating model performace

After deploying the model, you can evaluate its performance using Gen-AI Perf. This tool assesses various performance metrics, including throughput and latency:

$ genai-perf profile -m ensemble --service-kind triton --backend tensorrtllm --num-prompts 100 --random-seed 123 --synthetic-input-tokens-mean 200 --synthetic-input-tokens-stddev 0 --output-tokens-mean 100 --output-tokens-stddev 0 --output-tokens-mean-deterministic --tokenizer $TOKENIZER_DIR --concurrency 1 --measurement-interval 4000 --profile-export-file model_profile.json --url localhost:8001 --generate-plots

You should expect an output that looks like this:

                                                  LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃              Statistic     avg     min     max     p99     p90     p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│   Request latency (ms)  677.55  675.86  699.82  695.50  677.71  676.83 │
│ Output sequence length  299.95  299.00  301.00  301.00  300.00  300.00 │
│  Input sequence length  199.95  199.00  201.00  201.00  200.00  200.00 │
└────────────────────────┴────────┴────────┴────────┴────────┴────────┴────────┘
Output token throughput (per sec): 442.65
Request throughput (per sec): 1.48

This table summarizes metrics such as latency, input/output sequence length, and throughput, providing an indication of how well the model performs under specific conditions.