Inference

In this section, we will walk you through the process of obtaining predictions from your deployed machine learning models. The Open Innovation Platform provides a variety of methods for performing inference, based on the model type (LLM / classical ML models), allowing you to choose the approach that best suits your needs.

1. LLM Inference

The open Innovation Platform provides different type of LLM inference based on the model type. For now, you can deploy and inference these types of LLMs:
1. Text Generation
2. Sequence Classification

1.1 Text Generation

The text generation inference is available for the models deployed with TGI or VLLM. We have two types of inference:

1.1.1 Chat

in the chat inference, the user provides the chat history as a list of dictionaries on this format:

{
    "messages": [
        {"role": "system", "content": "Be friendly"},
        {"role": "user", "content": "What's the capital of UAE?"},
        {"role": "assistant", "content": ""}
    ]
}

1.1.2 Completion

in the completion inference, the user provides the messages as one string:

{
  "messages": "once upon a time, "
}

Note: in the background, when the user provides the chat history as a list of dictionaries, the OI platform, converts it to one string based on the chat template for this user, but when the user provides on string in the "messages", OI platform pass the string without any formatting to the LLM. Which means, if the user wants to provide a formatted string with the chat template directly to the LLM, then the completion inference is the best solution for this.

1.1.3 Inference Parameters

Beside the messages the user provides to the LLM, the user can control the LLM response by providing additional parameters. The available parameters depend on the LLM inference server (VLLM, TGI)

1.1.4 Chat Templates

To convert the chat history to a formatted inference input, we used a chat template compatible with the model family. By default, OI platform supports these families:
- LLAMA 2
- LLAMA 3
- Falcon
- Yi
- Mistral
- Aya-23 So, if your model family is one of these, you don't need to add a custom chat template in the model version information and you can use the chat inference.

Note: if your model family is not supported in OI Platform, you have to add a chat template in the model version, otherwise, you can't use the chat inference, but sill can use the completion inference.

1.2 Sequence Classification Inference

For the models deployed with OI_SERVE and the model type is sequence_classification, the user can use the sequence classification inference.
In this type of inference, the user provides one string and the LLM returns the score for each class.
Input:

{
  "text": "The product is good!"
}

Output:

{
  "positive": 0.99,
  "neutral": 0.01,
  "negative": 0
}

2. Classical ML Inference

2.1 Input Format

When deploying Tracked experiments, the platform supports various input formats. You can choose the format that suits your data best and provide the necessary inputs.

Note: log the model signature to ensure the model input is parsed correctly.

TrackingClient.mlflow.log_model(my_model, "model", signature=my_model_signature)

2.1.1 Tensor input (serializing a numpy array)

If the model input is a numpy array, then the REST endpoint expects data as a JSON list.

For instance, if the model input has shape (-1, 3, 2) (batch input), then a well formatted JSON is

[
    [[1, 2], [3, 4], [5, 6]],
    [[7, 8], [9, 0], [1, 2]]
]

2.1.2 Named parameters (serializing a Pandas dataframe)

If the model input is a list of named records then the REST endpoint expects data as a list of JSON dictionaries.

For instance

[
    {"age": 18, "weight": 65},
    {"age": 47, "weight": 73}
]

2.2. Output Format

The REST endpoint returns a JSON dictionary with a single predictions key with the list of predictions, one for each input.

The format of the single prediction depends on the actual model.

If the model returns a Python list or a numpy array, the prediction is a JSON list

{
    "predictions": [
        [
            -3.644273519515991,
            -4.824134826660156,
            -3.8084142208099365,
            -5.363550662994385
        ],
        [    -4.997870922088623,
            -4.3103718757629395,
            -0.13021154701709747,
            -3.2400429248809814
        ]
    ]
}

If the model returns a Pandas dataframe, the prediction is a JSON dictionary

{
    "predictions": [
        {"sentiment": "POSITIVE", "score": 0.976},
        {"sentiment": "NEUTRAL", "score": 0.7345},
    ]
}