Text Generation

type access

  • Operating System:

  • Terminal:

  • Shell:

  • Editor:

  • Package Manager:

  • Programming Language:

  • Utility:

  • Extension:

type access

  • Operating System:

  • Terminal:

  • Shell:

  • Editor:

  • Package Manager:

  • Programming Language:

  • Utility:

  • Extension:

type access

  • Operating System:

  • Terminal:

  • Shell:

  • Editor:

  • Package Manager:

  • Programming Language:

  • Utility:

  • Extension:

type access

  • Operating System:

  • Terminal:

  • Shell:

  • Editor:

  • Package Manager:

  • Programming Language:

  • Utility:

  • Extension:

type access

  • Operating System:

  • Terminal:

  • Shell:

  • Editor:

  • Package Manager:

  • Programming Language:

  • Utility:

  • Extension:

type access

  • Operating System:

  • Terminal:

  • Shell:

  • Editor:

  • Package Manager:

  • Programming Language:

  • Utility:

  • Extension:

type access

  • Operating System:

  • Terminal:

  • Shell:

  • Editor:

  • Package Manager:

  • Programming Language:

  • Utility:

  • Extension:

type access

  • Operating System:

  • Terminal:

  • Shell:

  • Editor:

  • Package Manager:

  • Programming Language:

  • Utility:

  • Extension:

Text Generation is an advanced Gradio web application designed for interfacing with Large Language Models (LLMs). It is engineered to be compatible with a diverse array of model backends, including notable projects such as:

The application's source code and associated licensing terms are accessible through this GitHub repository.

Initialization

For information on how to use the Initialization parameter, please refer to the Initialization: Bash Script, Initialization: Conda Packages, and Initialization: PyPI Packages section of the documentation.

Data Persistence

The application mandates the use of the Data volume parameter for the inclusion of an input folder, essential for data persistence. In the event that this folder is initially empty, the application will automatically generate the following directory structure:

my_data_volume/
├── characters
│   └── Assistant.yaml
├── logs
│   └── chat
├── loras
│   └── place-your-loras-here.txt
├── models
│   ├── config.yaml
│   └── place-your-models-here.txt
├── presets
│   ├── Big O.yaml
│   ├── Contrastive Search.yaml
│   ├── Debug-deterministic.yaml
│   ├── Divine Intellect.yaml
│   ├── LLaMA-Precise.yaml
│   ├── Midnight Enigma.yaml
│   ├── Null preset.yaml
│   ├── Shortwave.yaml
│   ├── Yara.yaml
│   └── simple-1.yaml
├── prompts
│   ├── Alpaca-with-Input.txt
│   └── QA.txt
└── training
    ├── datasets
       └── put-trainer-datasets-here.txt
    └── formats
        ├── alpaca-chatbot-format.json
        ├── alpaca-format.json
        ├── llama2-chat-format.json
        └── vicuna-format.json

This configuration will be consistently maintained upon subsequent imports of the same data volume into the application. For convenience, the path to the imported data volume is stored in the DATAPATH environment variable.

Note

If the imported folder is not empty and its structure deviates from the format specified above, the application will terminate abruptly, resulting in an error.

Authentication

By default, the web application is configured to operate without the necessity for login credentials. However, for users seeking to implement authentication, this can be accomplished by utilizing the --gradio-auth parameter. The format for this parameter is as follows:

--gradio-auth username:password

This parameter enables the specification of login credentials to secure access to the application.

Additionally, there is an option to set login credentials via an input file using the --gradio-auth-path parameter. This file should contain one or more username:password pairs. Each pair should be reported sequentially, with one pair per line. The use of an input file for authentication is particularly useful for managing multiple user access controls efficiently.

Note

It is advisable to add login credentials if the application is to be shared via a custom public link, enhancing security and restricting access to authorized users only.

Configuration Options

Numerous configurations can be specified prior to initiating the application through the utilization of optional parameters accessible on the job submission page.

Alternatively, these settings can be specified directly within the web application through its various tabs:

  1. Chat tab: This section is dedicated to interactive chat functionalities, empowering users to simulate conversations with AI models. It encompasses features for managing dialogue flow, engaging in character role-play, and customizing responses to create immersive experiences.

  2. Parameters tab: Here, users have granular control over AI response generation by adjusting a range of parameters. They can fine-tune the AI's creativity, coherence, and context sensitivity to align with their preferences or specific use cases.

  3. Model tab: This tab serves as a centralized hub for managing different LLMs and apply LoRAs to a loaded model. Users can explore model options, load their preferred models, and gain insights into specific model capabilities, streamlining the process of selecting the most suitable model for their needs.

  4. Training tab: Designed to guide users through the process of training LLMs on custom datasets, this tab covers everything from data preparation to the actual training procedure. Its objective is to facilitate personalized AI interactions tailored to individual requirements.

  5. Session tab: This section focuses on preserving and managing session states, enabling users to maintain continuity in their interactions with LLMs over extended periods. It ensures seamless continuity and context retention throughout multiple interactions.

Model retrieval

Model artifacts are systematically organized within the file structure of the project. Specifically, models are stored under the $DATAPATH/models directory, whereas LoRA parameters are segregated into the $DATAPATH/loras folder. This structured storage facilitates streamlined access and management of model resources.

For the retrieval of individual or multiple model files, the huggingface-cli tool can be used instead of the web interface. The command syntax for downloading a specific model file, such as TheBloke/Llama-2-7b-Chat-GGUF into the $DATAPATH/models/ directory, is illustrated as follows:

$ huggingface-cli download TheBloke/Llama-2-7b-Chat-GGUF llama-2-7b-chat.Q5_K_M.gguf --local-dir $DATAPATH/models/ --local-dir-use-symlinks False

Furthermore, the utility supports the downloading of multiple files through pattern matching, thereby enhancing operational efficiency for managing model resources, e.g.:

$ huggingface-cli download TheBloke/Llama-2-7b-Chat-GGUF --local-dir $DATAPATH/models/ --local-dir-use-symlinks False --include='*Q4_K*gguf'

Similarly, using the Hugging Face access token (HF_TOKEN) via command line:

$ huggingface-cli download meta-llama/Meta-Llama-3-8B --local-dir $DATAPATH/models/Meta-Llama-3-8B --local-dir-use-symlinks False --token $HF_TOKEN

For more documentation on downloading with huggingface-cli check here.

To select the model, go to the "Model tab" in the web interface and refresh the list of models by clicking on the 🔄 symbol.

Note

A model that was downloaded previously can also be automatically loaded upon job submission by using the basic --model optional parameter.

Model precision

High-precision models, while delivering remarkable accuracy, are known for their substantial virtual RAM (VRAM) consumption. To mitigate this, users can laverage precision reduction techniques during the model loading phase. The utilization of --load-in-4bit and --use_double_quant options, facilitated by the bitsandbytes library, enables the loading of models in 4-bit precision. This approach significantly reduces VRAM requirements, thereby enhancing the efficiency of resource utilization.

Chat instructions

Upon loading a model in the "Model tab" of the web interface, an automatic detection mechanism attempts to identify the instruction template associated with the model, if available. This process, reliant on a predefined set of regular expressions within the path/to/model/config.yaml file, updates the "Parameters tab" > "Instruction template" section accordingly. However, due to the inherent limitations of this approach, the accuracy of automatic detection is not guaranteed. Users are advised to consult the model card on Hugging Face for verification of the correct prompt format, ensuring alignment with the intended use case.

For example, after loading the llama-2-7b-chat.Q5_K_M.gguf model, the instruct or chat-instruct modes should be used in the "Chat tab".

API Integration

The project's API is engineered to seamlessly integrate with the OpenAI API, mirroring endpoints for chat and completions. This integration expands the application's functionalities, enabling access to advanced AI models and features directly through the API. It enhances the application's utility by incorporating state-of-the-art AI capabilities, thus offering a comprehensive and enriched user experience.

As an illustrative example, open the app terminal interface and download the following LLM file using huggingface-cli:

$ huggingface-cli download TheBloke/Mistral-7B-Instruct-v0.2-GGUF mistral-7b-instruct-v0.2.Q4_K_M.gguf --local-dir $DATAPATH/models/ --local-dir-use-symlinks False

After the download is completed, load the LLM in the web interface.

To interact with the LLM via the integrated API for generating completions, try the command:

curl -s -X POST http://localhost:8080/api/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
            "prompt": "Write the first 10 Fibonacci numbers:\n\n",
            "max_tokens": 250,
            "temperature": 1,
            "top_p": 0.9,
            "seed": 10
        }' | jq -r '.choices[0].text'

Tip

Output:
The first 10 Fibonacci numbers are: 0, 1, 1, 2, 3, 5, 8, 13, 21, 34.

Here's how they're calculated:
- The first number in the sequence is 0.
- The second number is 1.
- Each subsequent number is the sum of the previous two numbers.

So, 0 + 1 = 1, 1 + 1 = 2, 1 + 2 = 3, 2 + 3 = 5, 3 + 5 = 8, 5 + 8 = 13, 8 + 13 = 21, and 13 + 21 = 34.

This command demonstrates how to request the first 10 Fibonacci numbers, showcasing the API's capability to process and return complex queries with precision.

Hint

If the app is deployed with a public link, the LLM can be probed from a remote Linux server using a similar command. However, you need to replace

http://localhost:8080/api/v1/completions

with

https://app-custom_link.cloud.sdu.dk/api/v1/completions

In this context, custom_link should be substituted with the actual URL name associated with your running app instance. Optionally, the user can disable the web UI with the --nowebui option and simply use the API to interact with the LLM. In this case, the model must be selected in advance using the --model basic parameter.

For additional examples involving chat completion endpoints check here.