Chat UI

type access

  • Operating System:

  • Terminal:

  • Shell:

  • Editor:

  • Package Manager:

  • Programming Language:

  • Utility:

  • Extension:


UCloud Chat UI is a versatile, feature-rich, and user-friendly self-hosted web chat interface designed to function completely offline. It efficiently supports various Large Language Model (LLM) runners, including Ollama and OpenAI-compatible APIs.

The app is a customized implementation of Open WebUI.

Initialization

For guidance on utilizing the Initialization parameter, please refer to the Initialization - Bash script section of the documentation.

Data Management

Upon the initial launch, users are prompted to import a directory from UCloud. If the directory is empty, the app will automatically create a structured folder system for storing models, caches, configurations, and more:

my_data_volume/
├── cache
│   ├── embedding
│      └── models
│   ├── huggingface
│      └── hub
│   ├── image
│      └── generations
│   └── whisper
│       └── models
├── config.json
├── docs
├── litellm
│   └── config.yaml
├── models
│   └── blobs
├── readme.txt
├── uploads
├── vector_db
│   └── chroma.sqlite3
└── webui.db

This structured configuration is preserved for future data imports of the same data directory. For convenience, the path to the imported data volume is stored in the DATA_DIR environment variable.

The application initiates an Ollama server on port 11434 in the backend while the frontend UI can be accessed by clicking

Backend logs are stored at /work/ollama-logs.txt.

Users can interact with the models through the frontend UI and inspect the API calls generated, which can be exported as Python, Go, TypeScript code snippets, or shell scripts.

User Authentication

Once the app's data repository is set up, users must register an admin account via the web interface. Additional accounts can be created subsequently on the same page but require admin approval to access.

The admin panel and UI settings can be accessed by clicking on the user's name at the bottom left of the web interface.

Note

To prevent unauthorized sign-ups, use the Disable signup option, which is recommended if sharing the application through a custom public link.

Model Integration

Downloading and managing LLMs is seamless using the app web interface. For this, the user needs to access Settings > Models and input the model's name:tag as demostrated below:

drawing

Alternatively, models can be downloaded directly via the app’s terminal using the Ollama API:

$ ollama pull llama2:7b-chat

Tip

pulling manifest
pulling 8934d96d3f08... 100% ▕████████████████████████████████████████▏ 3.8 GB
pulling 8c17c2ebb0ea... 100% ▕████████████████████████████████████████▏ 7.0 KB
pulling 7c23fb36d801... 100% ▕████████████████████████████████████████▏ 4.8 KB
pulling 2e0493f67d0c... 100% ▕████████████████████████████████████████▏ 59 B
pulling fa304d675061... 100% ▕████████████████████████████████████████▏ 91 B
pulling 42ba7f8a01dd... 100% ▕████████████████████████████████████████▏ 557 B
verifying sha256 digest
writing manifest
removing any unused layers
success

By default, models are stored within the imported data volume as shown here:

my_data_volume/models/
├── blobs
│   ├── sha256-2e0493f67d0c8c9c68a8aeacdf6a38a2151cb3c4c1d42accf296e19810527988
│   ├── sha256-42ba7f8a01ddb4fa59908edd37d981d3baa8d8efea0e222b027f29f7bcae21f9
│   ├── sha256-7c23fb36d80141c4ab8cdbb61ee4790102ebd2bf7aeff414453177d4f2110e5d
│   ├── sha256-8934d96d3f08982e95922b2b7a2c626a1fe873d7c3b06e8e56d7bc0a1fef9246
│   ├── sha256-8c17c2ebb0ea011be9981cc3922db8ca8fa61e828c5d3f44cb6ae342bf80460b
│   └── sha256-fa304d6750612c207b8705aca35391761f29492534e90b30575e4980d6ca82f6
└── manifests
    └── registry.ollama.ai
        └── library
            └── llama2
                └── 7b-chat

The user can specify a different directory for models using the Import Ollama models optional parameter. The list of all models supported by Ollama is available here.

Furthermore, it is possible to upload a GGUF model file from the web interface, using this experimental feature in Settings > Models:

drawing

Hint

Utilize the Update All Models button beside the Ollama base URL within the Settings > Models to fetch the latest release of the imported models.

Model Loading

Once the models are integrated into the system, users can select one or more models from a drop-down menu at the top of the chat interface. This functionality is designed to provide flexibility in switching between different language models based on the user's needs and the specific tasks at hand.

The loading of a specific model can be monitored looking at the Ollama log file via the app's terminal interface, for example:

$ tail -f /work/ollama-log.txt

Tip

....................................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   320.00 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =   320.00 MiB
llama_new_context_with_model: KV self size  =  640.00 MiB, K (f16): 320.00 MiB, V (f16): 320.00 MiB
llama_new_context_with_model:  CUDA_Host output buffer size = 0.52 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=4)
llama_new_context_with_model:      CUDA0 compute buffer size =   400.01 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =   400.02 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   32.02 MiB
llama_new_context_with_model: graph nodes  = 2566
llama_new_context_with_model: graph splits = 3

Loading time and performance considerations

Loading LLMs can be resource-intensive and time-consuming. These models require significant computational power and occupy considerable VRAM when hosted on GPU-equipped compute nodes. Users can also deploy models on CPU-based nodes, but the best performance is typically achieved on GPU nodes due to faster processing capabilities.

The time required to load a model varies based on its size and complexity, and it may take a few minutes for the LLM to be fully operational.

Troubleshooting slow load times

In instances where the model is used for the first time and appears to be stuck during loading, the recommended approach is to edit the chat prompt and resubmit it. This action typically resolves any temporary glitches and completes the loading process.

Enhanced Document Interaction

The app supports Retrieval Augmented Generation (RAG) with local documents and web pages, allowing for the integration of diverse content sources into chats.

Document embedding model

Users can select their preferred document embedding model via the corresponding optional parameter in the app settings. By default, the app utilizes intfloat/multilingual-e5-large from Hugging Face. However, any model supported by Sentence-Transformers can be configured for use.

When a specific model is selected, it is automatically downloaded if not already available and stored locally in:

my_data_volume/cache/embedding/
└── models
    └── models--intfloat--multilingual-e5-large

Additionally, users have the option to choose between specialized embedding models from Ollama and OpenAI directly in the Document Settings section of the web interface:

drawing

For example, available model options include:

These selections allow users to tailor the embedding functionality to better suit their linguistic and computational needs.

Important

If you change the default embedding model, you are not able to use RAG chat with your previous documents. You need to re-embed them.

Uploading documents

Document files can be uploaded via the Documents tab on the web interface:

drawing
The user can check the progress in uploading and embedding documents directly from the app's standard output, e.g.:

Tip

INFO: Started server process [981]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:3000 (Press CTRL+C to quit)
Fetching 19 files: 100%|██████████| 19/19 [00:11<00:00, 1.70it/s]
Batches: 100%|██████████| 25/25 [01:44<00:00, 4.20s/it]

Once uploaded, documents are stored in the uploads folder within the imported data volume, as in this example:

my_data_volume/
└── uploads
    └── my_document.pdf

Files can also be imported from UCloud. In this case they must be copied in the docs folder:

my_data_volume/
└── docs
    └── my_document.pdf

To embed documents in the vector database, users need to access Document Settings and click the scan button.

Sourcing external content

RAG is activated by starting the prompt with the # symbol. This displays a list of sources for inclusion in chats:

drawing

Hint

It is also possible to source web pages by typing # followed by the target URL. The app fetches and parses the URL.

Voice Input Capabilities

The app allows voice interactions with LLMs using built-in voice input support. The system auto-detects silence and can initiate responses after 3 seconds.

By default, speech-to-text transcription is executed by the built-in app's web API. Alternatively, is it possibile to use OpenAI Whisper in the backend to perform transcription locally. By default the large Whisper model is downloaded and stored in the imported data volume when the app starts:

my_data_volume/cache/whisper/
└── models
    └── models--Systran--faster-whisper-large-v3

A different Whisper model can be specified via the corresponding optional parameter in the app settings.

Image Generation Capabilities

The app integrates image generation capabilities through two backends: AUTOMATIC1111 and OpenAI DALL·E. This feature enhances the chat experience by allowing dynamic text-to-image creation.

To utilize AUTOMATIC1111, users must first activate the Enable text-to-image generation option in the app settings. Upon activation, the app downloads the Stable Diffusion repository to the imported data folder and installs all necessary dependencies for the first time. API activity and operational logs are maintained at /work/stable-diffusion-log.txt.

Once configured, the image generation engine is enabled by default. Users can further customize their experience by adjusting additional image configuration parameters in Settings > Images, as show here:

drawing

This functionality significantly broadens the application’s utility, making it an ideal tool for users seeking to combine textual and visual content seamlessly.

Custom Configuration

Users have the ability to fine-tune the application's behavior by modifying environmental settings through specific variables. These variables can be passed via a text file using the Local settings optional parameter in the app configuration.

For instance, by incorporating the following settings in the configuration file, users can ensure automatic updates for the RAG embedding and Whisper speech-to-text models at startup. Additionally, including the OpenAI API key enables the selection of advanced AI models such as OpenAI GPT-3.5 and GPT-4 Turbo:

# Custom app settings
OPENAI_API_KEY=<add_your_openai_api_key>

RAG_EMBEDDING_MODEL_AUTO_UPDATE=True
WHISPER_MODEL_AUTO_UPDATE=True

These configurations facilitate a more responsive and customized user experience, allowing for the integration of cutting-edge models and updates as they become available.

API Integration

The app provides seamless integration with OpenAI-compatible APIs, allowing for flexible and rich conversations alongside Ollama models, enhancing the overall user experience.

Example usage

To use the Ollama API, begin by accessing the app’s terminal interface. Here’s how to execute a command that interacts with the API:

curl http://localhost:11434/api/generate -d '{
  "model": "llama2:7b-chat",
  "prompt": "Write the first 10 Fibonacci numbers",
  "stream": false,
  "options": {
    "seed": 101,
    "temperature": 0
  }
}' | jq -r '.response'

Tip

  % Total    % Received % Xferd  Average Speed   Time    Time    Time  Current
                                 Dload  Upload   Total   Spent   Left  Speed
100   992  100   810    0   182      3      0  0:04:30  0:04:21 0:00:09   256

Certainly! The first 10 Fibonacci numbers are:

1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89

This command sends a request to generate responses based on the specified prompt, utilizing the already downloaded model. It demonstrates setting parameters such as seed for reproducibility and temperature to control the level of creativity in responses.

Remote API probing

For remote interactions, ensure the app is deployed with a publicly accessible link and has the frontend disabled, using the Disable UI optional parameter. This setup is particularly beneficial for server-side operations or integration with other systems in a headless environment, where the graphical user interface is not required.

When making API calls from a remote location, update the endpoint URL in your requests to reflect the actual deployment address. For instance, replace:

http://localhost:11434/api/generate

with your custom remote address, such as:

https://app-custom_link.cloud.sdu.dk/api/generate

Here, custom_link should be replaced with the specific URL name associated with your app's deployment, ensuring secure and accurate API interactions in remote setups.