Chat UI¶
Operating System:
Terminal:
Shell:
Editor:
Package Manager:
Programming Language:
Utility:
Extension:
Operating System:
Terminal:
Shell:
Editor:
Package Manager:
Programming Language:
Utility:
Extension:
UCloud Chat UI is a versatile, feature-rich, and user-friendly self-hosted web chat interface designed to function completely offline. It efficiently supports various Large Language Model (LLM) runners, including Ollama and OpenAI-compatible APIs.
The app is a customized implementation of Open WebUI.
Initialization¶
For guidance on utilizing the Initialization parameter, please refer to the Initialization: Bash Script section of the documentation.
Data Management¶
Upon the initial launch, users are prompted to import a directory from UCloud. If the directory is empty, the app will automatically create a structured folder system for storing models, caches, configurations, and more:
my_data_volume/
├── cache
| ├── audio
│ │ └── speech
│ ├── embedding
│ │ └── models
│ ├── huggingface
│ │ └── hub
│ ├── image
│ │ └── generations
│ └── whisper
│ └── models
├── config.json
├── docs
├── functions
├── litellm
│ └── config.yaml
├── models
│ └── blobs
├── readme.txt
├── tools
├── uploads
├── vector_db
│ └── chroma.sqlite3
└── webui.db
This structured configuration is preserved for future data imports of the same data directory.
For convenience, the path to the imported data volume is stored in the DATA_DIR
environment variable.
The application initiates an Ollama server on port 11434
in the backend while the frontend UI can be accessed by clicking
Backend logs are stored at /work/ollama-logs.txt
.
Users can interact with the models through the frontend UI and inspect the API calls generated, which can be exported as Python, Go, TypeScript code snippets, or shell scripts.
User Authentication¶
Once the app's data repository is set up, users must register an admin account via the web interface. Additional accounts can be created subsequently on the same page but require admin approval to access.
The admin panel and UI settings can be accessed by clicking on the user's name at the bottom left of the web interface.
Note
To prevent unauthorized sign-ups, use the Disable signup option, which is recommended if sharing the application through a custom public link.
Model Integration¶
Downloading and managing LLMs is seamless using the app web interface. For this, the user needs to access Settings > Models and input the model's name:tag
as demostrated below:
Alternatively, models can be downloaded directly via the app’s terminal using the Ollama API:
$ ollama pull llama3:8b-instruct-fp16
Tip
pulling manifest
pulling f2296999531d... 100% ▕████████████████████████████████████████▏ 16 GB
pulling 4fa551d4f938... 100% ▕████████████████████████████████████████▏ 12 KB
pulling 8ab4849b038c... 100% ▕████████████████████████████████████████▏ 254 B
pulling 577073ffcc6c... 100% ▕████████████████████████████████████████▏ 110 B
pulling c3dce7e8d73b... 100% ▕████████████████████████████████████████▏ 484 B
verifying sha256 digest
writing manifest
removing any unused layers
success
By default, models are stored within the imported data volume as shown here:
my_data_volume/models/
├── blobs
│ ├── sha256-4fa551d4f938f68b8c1e6afa9d28befb70e3f33f75d0753248d530364aeea40f
│ ├── sha256-577073ffcc6ce95b9981eacc77d1039568639e5638e83044994560d9ef82ce1b
│ ├── sha256-8ab4849b038cf0abc5b1c9b8ee1443dca6b93a045c2272180d985126eb40bf6f
│ ├── sha256-c3dce7e8d73b1b31faa28ba7b91242c4fcc102de60eddeac1ec84fa10f86e4d4
│ └── sha256-f2296999531d6120801529a45b1d103f7370c5970be939ebfc2ba5d0833e9e1e
└── manifests
└── registry.ollama.ai
└── library
└── llama3
└── 8b-instruct-fp16
The user can specify a different directory for models using the Import Ollama models optional parameter. The list of all models supported by Ollama is available here.
Furthermore, it is possible to upload a GGUF model file from the web interface, using this experimental feature in Settings > Models:
Hint
Utilize the Update All Models button beside the Ollama base URL within the Settings > Models to fetch the latest release of the imported models.
Model Loading¶
Once the models are integrated into the system, users can select one or more models from a drop-down menu at the top of the chat interface. This functionality is designed to provide flexibility in switching between different language models based on the user's needs and the specific tasks at hand.
The loading of a specific model can be monitored looking at the Ollama log file via the app's terminal interface, for example:
$ tail -f /work/ollama-log.txt
Tip
....................................................................................................
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 320.00 MiB
llama_kv_cache_init: CUDA1 KV buffer size = 320.00 MiB
llama_new_context_with_model: KV self size = 640.00 MiB, K (f16): 320.00 MiB, V (f16): 320.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.52 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=4)
llama_new_context_with_model: CUDA0 compute buffer size = 400.01 MiB
llama_new_context_with_model: CUDA1 compute buffer size = 400.02 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 32.02 MiB
llama_new_context_with_model: graph nodes = 2566
llama_new_context_with_model: graph splits = 3
Loading time and performance considerations¶
Loading LLMs can be resource-intensive and time-consuming. These models require significant computational power and occupy considerable VRAM when hosted on GPU-equipped compute nodes. Users can also deploy models on CPU-based nodes, but the best performance is typically achieved on GPU nodes due to faster processing capabilities.
The time required to load a model varies based on its size and complexity, and it may take a few minutes for the LLM to be fully operational.
Troubleshooting slow load times¶
In instances where the model is used for the first time and appears to be stuck during loading, the recommended approach is to edit the chat prompt and resubmit it. This action typically resolves any temporary glitches and completes the loading process.
Hint
Consider enhancing the user experience by allowing a model to be pre-loaded before the app interface starts. This can be achieved by utilizing the Select Ollama model option. If the selected model isn't available in the imported repository, the system should automatically download it. This ensures that the chat interface is immediately ready for interaction, providing a smoother and more efficient user experience.
Enhanced Document Interaction¶
The app supports Retrieval Augmented Generation (RAG) with local documents and web pages, allowing for the integration of diverse content sources into chats.
Document embedding model¶
Users can select their preferred document embedding model via the corresponding optional parameter in the app settings. By default, the app utilizes intfloat/multilingual-e5-large from Hugging Face. However, any model supported by Sentence-Transformers can be configured for use.
When a specific model is selected, it is automatically downloaded if not already available and stored locally in:
my_data_volume/cache/embedding/
└── models
└── models--intfloat--multilingual-e5-large
Additionally, users have the option to choose between specialized embedding models from Ollama and OpenAI directly in the Document Settings section of the web interface:
For example, available model options include:
nomic-embed-text from Ollama.
text-embedding-3-large from OpenAI.
These selections allow users to tailor the embedding functionality to better suit their linguistic and computational needs.
Important
If you change the default embedding model, you are not able to use RAG chat with your previous documents. You need to re-embed them.
Uploading documents¶
Document files can be uploaded via the Workspace > Documents tab on the web interface:
The user can check the progress in uploading and embedding documents directly from the app's standard output, e.g.:
Tip
INFO: Started server process [981]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:3000 (Press CTRL+C to quit)
Fetching 19 files: 100%|██████████| 19/19 [00:11<00:00, 1.70it/s]
Batches: 100%|██████████| 25/25 [01:44<00:00, 4.20s/it]
Once uploaded, documents are stored in the uploads
folder within the imported data volume, as in this example:
my_data_volume/
└── uploads
└── my_document.pdf
Files can also be imported from UCloud. In this case they must be copied in the docs
folder:
my_data_volume/
└── docs
└── my_document.pdf
To embed documents in the vector database, users need to access Document Settings and click the scan button.
Sourcing external content¶
RAG is activated by starting the prompt with the #
symbol. This displays a list of sources for inclusion in chats:
Hint
It is also possible to source web pages by typing #
followed by the target URL. The app fetches and parses the URL.
Voice Input Capabilities¶
The app allows voice interactions with LLMs using built-in voice input support. The system auto-detects silence and can initiate responses after 3 seconds.
By default, speech-to-text transcription is executed by the built-in app's web API. Alternatively, is it possibile to use OpenAI Whisper in the backend to perform transcription locally. By default the large Whisper model is downloaded and stored in the imported data volume when the app starts:
my_data_volume/cache/whisper/
└── models
└── models--Systran--faster-whisper-large-v3
A different Whisper model can be specified via the corresponding optional parameter in the app settings.
Image Generation Capabilities¶
The app integrates image generation capabilities through two backends: AUTOMATIC1111 and OpenAI DALL·E. This feature enhances the chat experience by allowing dynamic text-to-image creation.
To utilize AUTOMATIC1111, users must first activate the Enable text-to-image generation option in the app settings.
Upon activation, the app downloads the Stable Diffusion repository to the imported data folder and installs all necessary dependencies for the first time.
API activity and operational logs are maintained at /work/stable-diffusion-log.txt
.
Once configured, the image generation engine is enabled by default. Users can further customize their experience by adjusting additional image configuration parameters in Settings > Images, as show here:
This functionality significantly broadens the application’s utility, making it an ideal tool for users seeking to combine textual and visual content seamlessly.
Custom Configuration¶
Users have the ability to fine-tune the application's behavior by modifying environmental settings through specific variables. These variables can be passed via a text file using the Local settings optional parameter in the app configuration.
For instance, by incorporating the following settings in the configuration file, users can ensure automatic updates for the RAG embedding and Whisper speech-to-text models at startup. Additionally, including the OpenAI API key enables the selection of advanced AI models such as OpenAI GPT-3.5 and GPT-4 Turbo:
# Custom app settings
OPENAI_API_KEY=<add_your_openai_api_key>
RAG_EMBEDDING_MODEL_AUTO_UPDATE=True
WHISPER_MODEL_AUTO_UPDATE=True
These configurations facilitate a more responsive and customized user experience, allowing for the integration of cutting-edge models and updates as they become available.
API Integration¶
The app provides seamless integration with OpenAI-compatible APIs, allowing for flexible and rich conversations alongside Ollama models, enhancing the overall user experience.
Example usage¶
To use the Ollama API, begin by accessing the app’s terminal interface. Here’s how to execute a command that interacts with the API:
curl http://localhost:11434/api/generate -d '{
"model": "llama2:7b-chat",
"prompt": "Write the first 10 Fibonacci numbers",
"stream": false,
"options": {
"seed": 101,
"temperature": 0
}
}' | jq -r '.response'
Tip
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 992 100 810 0 182 3 0 0:04:30 0:04:21 0:00:09 256
Certainly! The first 10 Fibonacci numbers are:
1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89
This command sends a request to generate responses based on the specified prompt, utilizing the already downloaded model. It demonstrates setting parameters such as seed
for reproducibility and temperature
to control the level of creativity in responses.
Remote API probing¶
For remote interactions, ensure the app is deployed with a publicly accessible link and has the frontend disabled, using the Disable UI optional parameter. This setup is particularly beneficial for server-side operations or integration with other systems in a headless environment, where the graphical user interface is not required.
When making API calls from a remote location, update the endpoint URL in your requests to reflect the actual deployment address. For instance, replace:
http://localhost:11434/api/generate
with your custom remote address, such as:
https://app-custom_link.cloud.sdu.dk/api/generate
Here, custom_link
should be replaced with the specific URL name associated with your app's deployment, ensuring secure and accurate API interactions in remote setups.
Contents