Inference Gateway Guide

Overview

The Inference Gateway feature allows you to create and manage inference channels for your AI/ML models through REST APIs. It provides secure token-based authentication and supports various model types including LLMs, computer vision, speech processing, and classical ML models. The gateway simplifies model deployment by handling the complexities of serving while allowing you to focus on building applications.

Setting Up an Inference Gateway

Step 1: Access Model Settings

Navigate to your registered model in the workspace
Go to the "Settings" tab in the model version page
Look for the "Access tokens" section

Alt Text

Step 2: Generate Access Token

Click the "+ Add new token" button
Provide a name for your token (e.g., "new-token")
The system will generate a new access token
Important: Make sure to save the token immediately - you won't be able to access it again

Alt Text

Step 3: Configure Inference Settings

The inference gateway provides built-in support for several AI tasks. When you select your deployment type, the system automatically generates the appropriate inference code with optimized payload structures. Supported tasks include text generation and chat, sequence classification, speech recognition, text-to-speech conversion, image generation, translation, reranking and embedding generation.

Below example shows how to use the API access key and generate the code snippet for API calls in Python applications for a Large Language Model deployment. Alt Text

Step 4: Making API Calls

To make inference calls, you'll need:

Your API token
The model version ID
The appropriate endpoint for your use case

Please, refer to the next section where detailed examples of API calls are mentioned for various AI tasks.

Advanced Configuration

Temperature and Top-p Settings

Temperature (default: 0.7): Controls randomness in the output. Higher values (e.g., 0.8) make the output more random, while lower values (e.g., 0.2) make it more focused and deterministic.
Top-p (default: 0.9): An alternative to sampling with temperature, also known as nucleus sampling. The model considers the tokens with top_p probability mass.

Request Timeout

Default timeout: 120 seconds
Custom timeout can be set using the OICM-Request-Timeout header:

headers = {
    "Authorization": "Bearer {your_api_key}",
    "OICM-Request-Timeout": "300"  # 5 minutes
}

Token Management

Tokens can be created with different expiration periods
Active tokens can be viewed and managed in the Settings page
Tokens can be revoked at any time using the delete action
The platform supports multiple active tokens for different applications or use cases

Inference Payload Examples

2.1 Text Generation

Endpoint

POST <API_HOST>/models/<model_version_id>/proxy/v1/chat/completions

Usage example

import requests
import json

api_key = <api_key>
model_version_id = <model_version_id>
base_url = <base_url> # example: https://inference.develop.openinnovation.ai

inference_url = (
    f"${base_url}/models/{model_version_id}/proxy/v1/chat/completions"
)

headers = {"Authorization": f"Bearer {api_key}", "accept": "text/event-stream"}

payload = {
    "inputs": "What is Deep Learning?",
    "max_new_tokens": 200,
    "do_sample": False,
    "stream": True,
}
chat_response = requests.post(
    f"{inference_url}", headers=headers, json=payload, stream=True
)

for token in chat_response.iter_lines():
    try:
        string_data = token.decode('utf-8')
        string_data = string_data[6:]
        json_data = json.loads(string_data)
        content = json_data['choices'][0]['delta']['content']
        print(content, end="", flush=True)
    except:
        pass

Chat Template

You can provide a custom chat template in the payload. If so, the OI platform will use this template to format the chat history before sending it to the LLM.

Note: the chat template should be a valid jinja that has only one variable messages

Example

{
    "messages": [
        {"role": "system", "content": "Be friendly"},
        {"role": "user", "content": "What's the capital of UAE?"},
        {"role": "assistant", "content": ""}
    ],
    "chat_template": '''
        {% if messages[0]['role'] == 'system' %}
            {% set loop_messages = messages[1:] %}
            {% set system_message = messages[0]['content'] %}
        {% else %}
            {% set loop_messages = messages %}
            {% set system_message = '' %}
        {% endif %}

        {% for message in loop_messages %}
            {% if loop.index0 == 0 %}
                {{ system_message.strip() }}
            {% endif %}
            {{ '\n\n' + message['role'] + ': ' + message['content'].strip().replace('\r\n', '\n').replace('\n\n', '\n') }}

            {% if loop.last and message['role'] == 'user' %}
                {{ '\n\nAssistant: ' }}
            {% endif %}
        {% endfor %}
    '''
}

2.2 Text completion - VLLM

Endpoint

POST <API_HOST>/models/<model_version_id>/proxy/v1

Usage example

import requests
import json

api_key = <api_key>
model_version_id = <model_version_id>
base_url = <base_url>

    inference_url = f"${base_url}/models/{model_version_id}/proxy/v1"


headers = {
    "Authorization": f"Bearer {api_key}",
    "accept": "text/event-stream"
}

endpoints = {
    "models": "models",
    "chat_completion": "chat/completions"
}

model_info = requests.get(f"{inference_url}/{endpoints['models']}", headers=headers).json()
model_name = model_info["data"][0]["id"]


payload = {
    "model": model_name,
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assisant"
        },
        {
            "role": "user",
            "content": "describe gravity to 6-year old child in 50 words"
          # "content": "tell me a long story"

        }
    ],
    "temperature": 0.9,
    "top_p": 0.7,
    "max_tokens": 1000,
    "stream": True
}

chat_response = requests.post(f"{url}/{endpoints['chat_completion']}", headers=headers, json=payload, stream=True)

for token in chat_response.iter_lines():
    try:
        string_data = token.decode('utf-8')
        string_data = string_data[6:]
        json_data = json.loads(string_data)
        content = json_data['choices'][0]['delta']['content']
        print(content, end="", flush=True)
    except:
        pass

2.3 Text completion - TGI

Endpoint

POST <API_HOST>/models/<model_version_id>/proxy/v1

Usage example

import requests
import json

api_key = <api_key>
model_version_id = <model_version_id>
base_url = <base_url>

inference_url = f"${base_url}/models/{model_version_id}/proxy/v1"


headers = {
    "Authorization": f"Bearer {api_key}",
    "accept": "text/event-stream"
}

endpoints = {
    "models": "models",
    "chat_completion": "chat/completions"
}

model_info = requests.get(f"{inference_url}/{endpoints['models']}", headers=headers).json()
model_name = model_info["data"][0]["id"]


payload = {
    "model": model_name,
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assisant"
        },
        {
            "role": "user",
            "content": "describe gravity to 6-year old child in 50 words"
            # "content": "tell me a long story"

        }
    ],
    "temperature": 0.9,
    "top_p": 0.7,
    "max_tokens": 1000,
    "stream": True
}

chat_response = requests.post(f"{url}/{endpoints['chat_completion']}", headers=headers, json=payload, stream=True)

for token in chat_response.iter_lines():
    try:
        string_data = token.decode('utf-8')
        string_data = string_data[6:]
        json_data = json.loads(string_data)
        content = json_data['choices'][0]['delta']['content']
        print(content, end="", flush=True)
    except:
        pass

2.4 Sequence Classification

Endpoint

POST <API_HOST>/models/<model_version_id>/proxy/classify

Usage example

import requests

api_key = <api_key>
model_version_id = <model_version_id>
base_url = <base_url>

inference_url = (
    f"${base_url}/models/{model_version_id}/proxy/v1/classify"
)

headers = {"Authorization": f"Bearer {api_key}"}

payload = {
    "inputs": "this is good!",
}
response = requests.post(f"{inference_url}", headers=headers, json=payload)
response.json(),

  curl: `curl -X POST "${inference_url}" \
-H "Authorization: Bearer ${api_key}" \
-H "Content-Type: application/json" \
-d '{
  "inputs": "this is good!"
}'

Response Format

{
    "classification": [
        {
            "label": "admiration",
            "score": 0.7764764428138733
        },
        {
            "label": "excitement",
            "score": 0.11938948929309845
        },
        {
            "label": "joy",
            "score": 0.04363647475838661
        },
        {
            "label": "approval",
            "score": 0.012329215183854103
        },
        {
            "label": "gratitude",
            "score": 0.010198703035712242
        },
        ...
    ]
}

2.5 Automatic Speech Recognition

Endpoint

POST <API_HOST>/models/<model_version_id>/proxy/transcript

Usage example

import requests

api_key = <api_key>
model_version_id = <model_version_id>
base_url = <base_url>

inference_url = (
    f"${base_url}/models/{model_version_id}/proxy/v1/transcript"
)


headers = {"Authorization": f"Bearer {api_key}"}


files = {
    "file": (
        "file_name.mp3",
        open("/path/to/audio_file", "rb"),
        "audio/wav",  # Adjust content type if needed
    )
}

response = requests.post(f"{inference_url}", headers=headers, files=files)

Response Format

{
    "text": "Hi, can you help me with the driving license?"
}

2.6 Text To Speech

Endpoint

POST <API_HOST>/models/<model_version_id>/proxy/generate-speech

Usage example

import requests

api_key = <api_key>
model_version_id = <model_version_id>
base_url = <base_url>

inference_url = (
    f"${base_url}/models/{model_version_id}/proxy/v1/generate-speech"
)

headers = {"Authorization": f"Bearer {api_key}"}

# get the schema of the model, each tts model has its own schema
schema = requests.get(f"{inference_url}/schema", headers=headers)

# based on the schema, you create the post request with the supported params

Example response:

[
    {
        "desc": "Hey, how are you doing today?",
        "label": "Prompt",
        "name": "prompt",
        "required": true,
        "type": "string"
    },
    {
        "desc": "A female speaker with a slightly low-pitched voice",
        "label": "Speaker Description",
        "name": "description",
        "required": true,
        "type": "string"
    }
]

Request Body Format

The request body should be in the following format. Use the fields names as received in the schema:

{
    "prompt": "Hi, can you help me?",
    "description": "A man with clear voice"
}

Response Format

The received audio is base64 string

{
    "audio": "UklGRiRcAgBXQVZFZm10IBAAAA...",
    "sampling_rate": 44100
}

2.7 Text To Image

Endpoint

POST <API_HOST>/models/<model_version_id>/proxy/generate-image

Usage example

import requests

api_key = <api_key>
model_version_id = <model_version_id>
base_url = <base_url>

inference_url = (
    f"${base_url}/models/{model_version_id}/proxy/v1/generate-image"
)


headers = {"Authorization": f"Bearer {api_key}"}


payload = {
    "prompt": "A man walking on the moon",
    "num_inference_steps": 20,
    "high_noise_frac": 8,
}
response = requests.post(f"{inference_url}", headers=headers, json=payload)

Response Format

The received image is base64 string

{
    "image": "iVBORw0KGgoAAAANS..."
}

2.8 Translation

Endpoint

POST <API_HOST>/models/<model_version_id>/proxy/translate

Usage example

import requests

api_key = <api_key>
model_version_id = <model_version_id>
base_url = <base_url>

inference_url = (
    f"${base_url}/models/{model_version_id}/proxy/v1/translate"
)

headers = {"Authorization": f"Bearer {api_key}"}

payload = {"text": "A man walking on the moon"}
response = requests.post(f"{inference_url}", headers=headers, json=payload)

Response Format

{
    "translation": " qué s'est votre nom?"
}

2.9 Reranking - Embedding - Classification

Usage example

import requests

api_key = <api_key>
model_version_id = <model_version_id>
base_url = <base_url>

url = (
    f"${base_url}/models/{model_version_id}/proxy/v1"
)

headers = {
    "Authorization": f"Bearer {api_key}",
}

# embedding
embed_payload = {"inputs": "What is Deep Learning?"}
embed_response = requests.post(f"{url}/embed", headers=headers, json=embed_payload)

# re-ranking
rerank_payload = {
    "query": "What is Deep Learning?",
    "texts": ["Deep Learning is not...", "Deep learning is..."],
}
rerank_response = requests.post(f"{url}/rerank", headers=headers, json=rerank_payload)

# classify
classify_payload = {"inputs": "Abu Dhabi is great!"}
classify_response = requests.post(
    f"{url}/predict", headers=headers, json=classify_payload
)

3. Classical ML Models

3.1 API Endpoint

Endpoint

POST <API_HOST>/models/<model_version_id>/proxy/predict

Usage example

import requests

api_key = <api_key>
model_version_id = <model_version_id>
base_url = <base_url>

inference_url = (
    f"${base_url}/models/{model_version_id}/proxy/v1/predict"
)


headers = {"Authorization": f"Bearer {api_key}"}


payload = [1, 2, 3]
response = requests.post(f"{inference_url}", headers=headers, json=payload)

Response Format

{
    "data": [83.4155584413916, 209.9168121704531],
    "meta": {
        "input_schema": [
            {
                "tensor-spec": {
                    "dtype": "float64",
                    "shape": [-1, 10]
                },
                "type": "tensor"
            }
        ],
        "output_schema": [
            {
                "tensor-spec": {
                    "dtype": "float64",
                    "shape": [-1]
                },
                "type": "tensor"
            }
        ]
    }
}

4. Advanced options

Timeout

By default, requests have a timeout of 120 seconds.

This timeout can be adjusted on a per-request basis using the header OICM-Request-Timeout. The value should be the request timeout in seconds.

import requests

requests.post(endpoint, json=data, headers={
    "Bearer": api_key,
    "OICM-Request-Timeout": "300"
})

Best Practices

Always store API tokens securely
Use appropriate timeout values based on your use case
Monitor token expiration dates and rotate them as needed
Use system messages effectively for text generation tasks
Test different temperature and top-p values to find the optimal settings for your use case

Troubleshooting

If a request fails, check the token's validity and expiration
Ensure the model version ID is correct
Verify that the request format matches the endpoint specifications
Check if the token has the necessary permissions for the requested operation