Skip to content

Python Bindings

cosmic-snow edited this page Aug 15, 2024 · 6 revisions

If you haven't already, you should first have a look at the docs of the Python bindings (aka GPT4All Python SDK). There is also an API documentation, which is built from the docstrings of the gpt4all module.

Note: The docs suggest using venv or conda, although conda might not be working in all configurations. Use a regular virtual environment if you run into trouble.

For a more complete program outside of the examples, have a look at the CLI (Command-Line Interface).

Understanding the Bindings

The bindings are based on the same underlying code (the "backend") as the GPT4All chat application. However, not all functionality of the latter is implemented in the backend.

Notably regarding LocalDocs: While you can create embeddings with the bindings, the rest of the LocalDocs machinery is solely part of the chat application.

Where it matters, namely generating LLM responses, they are based on the same backend code; an adapted version of llama.cpp. This also means that the bindings interface with native code libraries, they are not purely written in Python. These libraries are included in the PyPI distribution and have some system requirements. In addition to that, you need to have Python 3.8 or higher installed.

You can get them from PyPI with something like pip3 install --user gpt4all, but the recommendation is to use some kind of virtual environment.

Tip

If you have an Nvidia graphics card but don't have the CUDA libraries already on your system, consider specifying it as gpt4all[cuda] instead. This way, it'll use PyPI CUDA packages as dependency.

The main interaction points are the class gpt4all.GPT4All, its generate() method and the chat_session() context manager.

Basic Examples

The simplest way to interact is as follows, either without or with a session:

from gpt4all import GPT4All
model = GPT4All('Phi-3-mini-4k-instruct.Q4_0.gguf')
print(model.generate("What is the airspeed velocity of an unladen swallow?"))
from gpt4all import GPT4All
model = GPT4All('Phi-3-mini-4k-instruct.Q4_0.gguf')
with model.chat_session():
    print(model.generate("How do you know she is a witch?"))
    ...

The main difference is that the latter allows you to ask follow-up questions as long as you're within the session, while the former always produces one-off exchanges without memory. The session also uses templates. See the documentation for a full explanation.

Options & Parameters

Streaming Generations

To interact with GPT4All responses as the model generates, use the streaming=True parameter during generation. This changes the returned type from a string to a Python generator iterator.

Streaming Example

Note the for-loops:

from gpt4all import GPT4All
model = GPT4All('Phi-3-mini-4k-instruct.Q4_0.gguf')
tokens = []
with model.chat_session():
    for token in model.generate("What is the capital of France?", streaming=True):
        tokens.append(token)
    print(tokens)
    # output to console:
    for token in model.generate("What is the highest mountain in South America?", streaming=True):
        print(token, end='', flush=True)
    print()

Which may produce output similar to:

[' The', ' capital', ' of', ' France', ' is', ' Paris', '.']
The highest mountain in South America is Aconcagua, which is also considered the highest peak outside of Asia.

Temperature, Top-p and Top-K

The three most influential parameters in generation are Temperature (temp), Top-p (top_p) and Top-K (top_k). In a nutshell, during the process of selecting the next token, not just one or a few are considered, but every single token in the vocabulary is given a probability. The parameters can change the field of candidate tokens.

  • Temperature makes the process either more or less random. A Temperature above 1 increasingly "levels the playing field", while at a Temperature between 0 and 1 the likelihood of the best token candidates grows even more. A Temperature of 0 results in selecting the best token, making the output deterministic. A Temperature of 1 represents a neutral setting with regard to randomness in the process.

  • Top-p and Top-K both narrow the field:

    • Top-K limits candidate tokens to a fixed number after sorting by probability. Setting it higher than the vocabulary size deactivates this limit.

    • Top-p selects tokens based on their total probabilities. For example, a value of 0.8 means "include the best tokens, whose accumulated probabilities reach or just surpass 80%". Setting Top-p to 1, which is 100%, effectively disables it.

The recommendation is to keep at least one of Top-K and Top-p active. Other parameters can also influence generation, so be sure to have a look at all of their descriptions.

Without Online Connectivity (allow_download Parameter)

To prevent GPT4All from accessing online resources, instantiate it with allow_download=False. When setting it to False, there will be no session system prompt, and you must specify the prompt template yourself.

You also need to make sure a model is already available and the model path is set correctly.

from gpt4all import GPT4All
model = GPT4All('orca-mini-3b-gguf2-q4_0.gguf', allow_download=False)

system_prompt = '### System:\nYou are Arthur Dent, a mostly harmless guy from planet Earth.\n\n'
prompt_template = '### User:\n{0}\n\n### Response:\n'

with model.chat_session(system_prompt=system_prompt, prompt_template=prompt_template):
    for token in model.generate("What is a good way to get around the galaxy?", streaming=True):
        print(token)
        ...

Retrieving and Inspecting a System Prompt and Prompt Template

You can retrieve a model's default system prompt and prompt template with an online instance of GPT4All. When not specified, the allow_download parameter is set to True:

from gpt4all import GPT4All
model = GPT4All('orca-mini-3b-gguf2-q4_0.gguf')
print(repr(model.config['systemPrompt']))
print(repr(model.config['promptTemplate']))

Output:

'### System:\nYou are an AI assistant that follows instruction extremely well. Help as much as you can.\n\n'
'### User:\n{0}\n\n### Response:\n'

Then you can pass these explicitly when creating an offline instance later.

Additional Examples & Explanations

Specifying the Model Folder

The model folder can be set with the model_path parameter when creating a GPT4All instance. The example below is is the same as if it weren't provided; that is, ~/.cache/gpt4all/ is the default folder.

from pathlib import Path
from gpt4all import GPT4All
model = GPT4All(model_name='orca-mini-3b-gguf2-q4_0.gguf', model_path=Path.home() / '.cache' / 'gpt4all')

If you want to point it at the chat application's default folder, try one of the following instead:

macOS:

from pathlib import Path
from gpt4all import GPT4All

model_name = 'Phi-3-mini-4k-instruct.Q4_0.gguf'
model_path = Path.home() / 'Library' / 'Application Support' / 'nomic.ai' / 'GPT4All'
model = GPT4All(model_name, model_path)

Windows:

from pathlib import Path
from gpt4all import GPT4All
import os
model_name = 'Phi-3-mini-4k-instruct.Q4_0.gguf'
model_path = Path(os.environ['LOCALAPPDATA']) / 'nomic.ai' / 'GPT4All'
model = GPT4All(model_name, model_path)

Linux:

from pathlib import Path
from gpt4all import GPT4All

model_name = 'Phi-3-mini-4k-instruct.Q4_0.gguf'
model_path = Path.home() / '.local' / 'share' / 'nomic.ai' / 'GPT4All'
model = GPT4All(model_name, model_path)

Alternatively, you could also change the module's default model directory:

from pathlib import Path
from gpt4all import GPT4All, gpt4all
gpt4all.DEFAULT_MODEL_DIRECTORY = Path.home() / 'my' / 'models-directory'
model = GPT4All('orca-mini-3b-gguf2-q4_0.gguf')

Managing Templates

When using a chat_session(), you may customize the system prompt, and set the prompt template if necessary. They take precedence over the defaults provided when allow_download=True.

Custom Session Templates Example

from gpt4all import GPT4All
model = GPT4All('wizardlm-13b-v1.2.Q4_0.gguf')
system_template = 'A chat between a curious user and an artificial intelligence assistant.\n'
# many models use triple hash '###' for keywords, Vicunas are simpler:
prompt_template = 'USER: {0}\nASSISTANT: '
with model.chat_session(system_template, prompt_template):
    response1 = model.generate('why is the grass green?')
    print(response1)
    print()
    response2 = model.generate('why is the sky blue?')
    print(response2)

Where the responses could be like:

The color of grass can be attributed to its chlorophyll content, which allows it to absorb light energy from sunlight through photosynthesis. Chlorophyll absorbs blue and red wavelengths of light while reflecting other colors such as yellow and green. This is why the leaves appear green to our eyes.

The color of the sky appears blue due to a phenomenon called Rayleigh scattering, which occurs when sunlight enters Earth's atmosphere and interacts with air molecules such as nitrogen and oxygen. Blue light has shorter wavelength than other colors in the visible spectrum, so it is scattered more easily by these particles, making the sky appear blue to our eyes.

Interrupting Generation

The simplest way to stop generation is to set a fixed upper limit with the max_tokens parameter.

Additionally, if you know exactly when a model should stop responding, you can add a custom callback to generate(), which lets you react to individual tokens.

Example Custom Stop Callback

The following will stop when it encounters the first period, so typically after one sentence:

from gpt4all import GPT4All
model = GPT4All('orca-mini-3b-gguf2-q4_0.gguf')

def stop_on_token_callback(token_id, token_string):
    if '.' in token_string:  # one sentence is enough; note the 'in', tokens can include whitespace
        return False
    else:
        return True

response = model.generate("Blue Whales are the biggest animal to ever inhabit the Earth.",
                          callback=stop_on_token_callback, temp=0)
print(response)

Response:

 They can grow up to 100 feet long and weigh over 20 tons.