Inference API

Run inference on hosted models via simple API calls without managing infrastructure.

What is Inference?

The Inference API allows you to call any public model on the platform or your private models without needing to download or host them yourself.

  • Instant access - No setup or installation required
  • Scalable - Auto-scales with your traffic
  • Fast - Optimized runtimes on GPU infrastructure
  • Simple - REST API with JSON requests and responses

Quick Start

Make your first inference call in seconds:

Example: Text generation

curl https://api.platform.dev/v1/inference/text-generation \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "username/gpt2-finetuned",
    "prompt": "Once upon a time",
    "max_tokens": 100,
    "temperature": 0.7
  }'

# Response
{
  "generated_text": "Once upon a time, in a land far away...",
  "tokens_used": 42
}

Python example:

import requests

response = requests.post(
    "https://api.platform.dev/v1/inference/text-generation",
    headers={"Authorization": f"Bearer {token}"},
    json={
        "model": "username/gpt2-finetuned",
        "prompt": "Once upon a time",
        "max_tokens": 100
    }
)
print(response.json()["generated_text"])

Supported Tasks

Text Generation

Complete text, chat, code generation

POST /v1/inference/text-generation

Text Classification

Sentiment, topic classification, NER

POST /v1/inference/text-classification

Image Classification

Classify images into categories

POST /v1/inference/image-classification

Embeddings

Generate vector embeddings for text

POST /v1/inference/embeddings

Object Detection

Detect and locate objects in images

POST /v1/inference/object-detection

Automatic Speech Recognition

Transcribe audio to text

POST /v1/inference/asr

Task Examples

Text Classification (Sentiment Analysis)

POST /v1/inference/text-classification
{
  "model": "username/sentiment-model",
  "inputs": "I love this product! It's amazing."
}

# Response
{
  "labels": ["positive", "neutral", "negative"],
  "scores": [0.95, 0.03, 0.02]
}

Image Classification

POST /v1/inference/image-classification
{
  "model": "username/image-classifier",
  "image_url": "https://example.com/image.jpg"
}

# Or with base64 encoded image
{
  "model": "username/image-classifier",
  "image": "data:image/jpeg;base64,/9j/4AAQ..."
}

Embeddings

POST /v1/inference/embeddings
{
  "model": "username/embedding-model",
  "inputs": ["Hello world", "Machine learning is fun"]
}

# Response
{
  "embeddings": [
    [0.123, -0.456, 0.789, ...],  # 768 dimensions
    [0.234, -0.567, 0.890, ...]
  ]
}

Streaming Responses

Get real-time streaming output for text generation:

Streaming example:

POST /v1/inference/text-generation/stream
{
  "model": "username/llama-finetuned",
  "prompt": "Write a short story",
  "max_tokens": 500,
  "stream": true
}

# Server-Sent Events (SSE) response
data: {"token": "Once", "finished": false}
data: {"token": " upon", "finished": false}
data: {"token": " a", "finished": false}
data: {"token": " time", "finished": false}
...
data: {"token": ".", "finished": true}

JavaScript streaming client:

const response = await fetch('/v1/inference/text-generation/stream', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer ' + token,
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({ model: 'username/model', prompt: 'Hello' })
});

const reader = response.body.getReader();
const decoder = new TextDecoder();

while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  
  const chunk = decoder.decode(value);
  const lines = chunk.split('\n');
  
  for (const line of lines) {
    if (line.startsWith('data: ')) {
      const data = JSON.parse(line.slice(6));
      console.log(data.token);
    }
  }
}

Rate Limits & Quotas

Inference API rate limits by plan:

Free Tier

100 requests/minute • 10 GPU-hours/month

Pro ($9/month)

1,000 requests/minute • 100 GPU-hours/month

Team ($20/user/month)

5,000 requests/minute • 500 GPU-hours/month

Enterprise

Custom limits • Dedicated infrastructure

⚠️ Rate limit exceeded? The API returns HTTP 429. Check theRetry-After header for wait time.

Error Handling

Common error codes and how to handle them:

401 Unauthorized

Invalid or missing API token. Check your Authorization header.

403 Forbidden

No access to this model. Ensure the model is public or you have permissions.

404 Not Found

Model not found. Check the model name and namespace.

429 Too Many Requests

Rate limit exceeded. Implement exponential backoff or upgrade your plan.

503 Service Unavailable

Model is loading (cold start) or temporarily unavailable. Retry after a few seconds.

Advanced Options

Fine-tune your inference requests with these parameters:

# Text generation parameters
{
  "model": "username/model",
  "prompt": "Your input text",
  "max_tokens": 100,           // Max tokens to generate
  "temperature": 0.7,          // 0.0 = deterministic, 2.0 = very random
  "top_p": 0.9,                // Nucleus sampling
  "top_k": 50,                 // Top-K sampling
  "frequency_penalty": 0.0,    // Penalize repeated tokens
  "presence_penalty": 0.0,     // Penalize already mentioned tokens
  "stop_sequences": ["\n"],   // Stop generation at these sequences
  "seed": 42                   // For reproducible outputs
}

Best Practices

  • Cache responses - Cache identical requests to reduce costs and latency
  • Handle rate limits - Implement exponential backoff for 429 errors
  • Set timeouts - Always set request timeouts (30-60s recommended)
  • Batch requests - Send multiple inputs in one request when possible
  • Monitor usage - Track API calls and costs in your dashboard
  • Use streaming - For long responses, use streaming to improve UX