OpenAssistant fine-tuned Pythia-12B: open-source ChatGPT alternative

One of the most exciting open-source chatbot alternatives to ChatGPT – OpenAssistant’s OASST1 fine-tuned Pythia-12B - is now available to run for free on Paperspace, powered by Graphcore IPUs. This truly open-source model can be used commercially without restrictions.

oasst-sft-4-pythia12b is a variant of EleutherAI’s Pythia model family, fine-tuned using the Open Assistant Conversations (OASST1) dataset, a crowdsourced “human-generated, human-annotated assistant-style conversation corpus".

The OASST1 dataset consists of 161,443 messages in 35 different languages, annotated with 461,292 quality ratings, resulting in over 10,000 fully annotated conversation trees.

Running OASST1 Fine-tuned Pythia-12B inference on Paperspace

Open Assistant’s fine-tuned Pythia can easily be run on Graphcore IPUs using a Paperspace Gradient notebook. New users can try out Pythia on an IPU-POD₄, with Paperspace’s six hour free trial. For a higher performance implementation, you can scale up to an IPU-POD₁₆.

The notebook guides you through creating and configuring an inference pipeline and running the pipeline to build a turn-by-turn conversation.

Because the OpenAssistant model uses the same underlying Pythia-12B model as Dolly, we run it using the Dolly pipeline.

Let's begin by loading the inference config. We use the same configuration file as Dolly and manually modify the vocab size which is the only difference between the model graphs. A configuration suitable for your instance will automatically be selected.


from utils.setup import dolly_config_setup

config_name = "dolly_pod4" if number_of_ipus == 4 else "dolly_pod16"
config, *_ = dolly_config_setup("config/inference.yml", "release", config_name)

# Update vocab size for oasst-sft-4-pythia-12b-epoch-3.5 - 50288 rather than 50280
config.model.embedding.vocab_size = 50288

config

Next, we want to create our inference pipeline. Here we define the maximum sequence length and maximum micro batch size. Before executing a model on IPUs it needs to be turned into an executable format by compiling it. This will happen when the pipeline is created. All input shapes must be known before compiling, so if the maximum sequence length or micro batch size is changed, the pipeline will need to be recompiled.

Selecting a longer sequence length or larger batch size will use more IPU memory. This means that increasing one may require you to decrease the other.

This cell will take approximately 18 minutes to complete, which includes downloading the model weights.


import api

# changing these parameters will trigger a recompile.
sequence_length = 512  # max 2048
micro_batch_size = 1

# The pipeline is set to load the OpenAssistant checkpoint rather than the default Dolly one

# We override the Dolly prompt templating by specifying a prompt_format. Setting the format to
# just echo the instruction means that the pipeline does no formatting and it is up to the
# application to provide correctly templated prompts

# We set the text string that OpenAssistant uses to mark that it has finished generation, which
# is different from the Dolly one

oasst_pipeline = api.DollyPipeline(
    config,
    sequence_length=sequence_length,
    micro_batch_size=micro_batch_size,
    hf_dolly_checkpoint="OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5",
    prompt_format="{instruction}",
    end_key="<|endoftext|>",
)

Call the oasst_pipeline object you have just created to generate text from a prompt.

To make a chatbot, we will take user input and feed it to the model. So that the notebook can be tested automatically, we create a function similar to the Python built-in function input() that collects input from the user and returns it, but which will return canned input from a test environment variable EXAMPLE_PROMPTS instead if that variable is set. The variable should be set to a JSON list of strings - for example:


export EXAMPLE_PROMPTS='["Are you related to Dolly?", "Ah OK. How many islands are there in Scotland?"]'


import os
import json

example_prompts = os.getenv("EXAMPLE_PROMPTS")
if example_prompts is not None:
    example_prompts = json.loads(example_prompts)

# Get input from user like input() but with possibility of automation via EXAMPLE_PROMPTS environment variable
def auto_input(prompt: str) -> str:
    if example_prompts is None:
        return input(prompt)
    auto_in = example_prompts.pop(0) if example_prompts != [] else ""
    print(prompt, "AUTO", auto_in)
    return auto_in

A chatbot conversation is built up from a number of turns of user input and the model writing a reply. As a conversation develops, the prompt should be extended turn by turn, so the model has access to the full context.

The model has been trained on a specific prompt template to represent the conversation as it is built up:


<|prompter|>user1<|endoftext|> <|assistant|>reply1<|endoftext|> <|prompter|>user2<|endoftext|> <|assistant|>...

There are a some optional parameters to the pipeline call you can use to control the generation behaviour:

temperature – Indicates whether you want more or less creative output. A value of 1.0 corresponds to the model's default behaviour. Smaller values than this accentuate the next token distribution and make the model more likely to pick a highly probable next token. A value of 0.0 means the model will always pick the most probable token. Temperatures greater than 1.0 flatten the next token distribution making more unusual next tokens more likely. Temperature must be zero or positive.
k – Indicates that only among the highest k probable tokens can be sampled. This is known as "top k" sampling. Set to 0 to disable top k sampling and sample from all possible tokens. The value for k must be between a minimum of 0 and a maximum of config.model.embedding.vocab_size which is 50,288. The default is 5.
output_length - Sets a maximum output length in tokens. Generation normally stops when the model generates its end_key text, but can be made to stop before that by specifying this option. A value of None disables the limit. The default is 'None'.

You can start with any user input. For instance "What other animals are similar to Alpacas?"


import logging

# Conduct a complete conversation - with ability to set pipeline optional parameters
def chat(temperature=None, top_k=None, output_length=None):

    options = {}
    if temperature is not None:
        options["temperature"] = temperature
    if top_k is not None:
        options["k"] = top_k
    if output_length is not None:
        options["output_length"] = output_length

    # Suppress INFO logging to make a better interactive chat display
    logging.disable(logging.INFO)

    print("To complete the chat, enter an empty prompt")

    prompt = ""
    while True:
        user_input = auto_input("Prompter:")
        if user_input == "":
            break

        prompt += f"<|prompter|>{user_input}<|endoftext|><|assistant|>"
        chat_output = oasst_pipeline(prompt, **options)[0]
        prompt += chat_output + "<|endoftext|>"

    # Restore logging to what it was before
    logging.disable(logging.NOTSET)


chat(temperature=0.0)

# Note the first time you run this cell and enter a prompt, there will be a delay of ~1 minute where
# nothing appears to happen. This is where the server is attaching to the IPUs

See image below for an example of the model in action, using a different prompt.

Example Chat

Remember to detach your pipeline when you are finished to free up resources:


oasst_pipeline.detach()

Running OASST1 Fine-tuned Pythia on non-Paperspace IPU environments

To run the demo using IPU hardware other than in Paperspace, you need to have the Poplar SDK enabled.

Refer to the for your system for details on how to enable the Poplar SDK. Also refer to the guide for how to set up Jupyter to be able to run this notebook on a remote IPU machine.