Inference API
Run inference on hosted models via simple API calls without managing infrastructure.
What is Inference?
The Inference API allows you to call any public model on the platform or your private models without needing to download or host them yourself.
- Instant access - No setup or installation required
- Scalable - Auto-scales with your traffic
- Fast - Optimized runtimes on GPU infrastructure
- Simple - REST API with JSON requests and responses
Quick Start
Make your first inference call in seconds:
Example: Text generation
curl https://api.platform.dev/v1/inference/text-generation \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model": "username/gpt2-finetuned",
"prompt": "Once upon a time",
"max_tokens": 100,
"temperature": 0.7
}'
# Response
{
"generated_text": "Once upon a time, in a land far away...",
"tokens_used": 42
}Python example:
import requests
response = requests.post(
"https://api.platform.dev/v1/inference/text-generation",
headers={"Authorization": f"Bearer {token}"},
json={
"model": "username/gpt2-finetuned",
"prompt": "Once upon a time",
"max_tokens": 100
}
)
print(response.json()["generated_text"])Supported Tasks
Text Generation
Complete text, chat, code generation
POST /v1/inference/text-generationText Classification
Sentiment, topic classification, NER
POST /v1/inference/text-classificationImage Classification
Classify images into categories
POST /v1/inference/image-classificationEmbeddings
Generate vector embeddings for text
POST /v1/inference/embeddingsObject Detection
Detect and locate objects in images
POST /v1/inference/object-detectionAutomatic Speech Recognition
Transcribe audio to text
POST /v1/inference/asrTask Examples
Text Classification (Sentiment Analysis)
POST /v1/inference/text-classification
{
"model": "username/sentiment-model",
"inputs": "I love this product! It's amazing."
}
# Response
{
"labels": ["positive", "neutral", "negative"],
"scores": [0.95, 0.03, 0.02]
}Image Classification
POST /v1/inference/image-classification
{
"model": "username/image-classifier",
"image_url": "https://example.com/image.jpg"
}
# Or with base64 encoded image
{
"model": "username/image-classifier",
"image": "data:image/jpeg;base64,/9j/4AAQ..."
}Embeddings
POST /v1/inference/embeddings
{
"model": "username/embedding-model",
"inputs": ["Hello world", "Machine learning is fun"]
}
# Response
{
"embeddings": [
[0.123, -0.456, 0.789, ...], # 768 dimensions
[0.234, -0.567, 0.890, ...]
]
}Streaming Responses
Get real-time streaming output for text generation:
Streaming example:
POST /v1/inference/text-generation/stream
{
"model": "username/llama-finetuned",
"prompt": "Write a short story",
"max_tokens": 500,
"stream": true
}
# Server-Sent Events (SSE) response
data: {"token": "Once", "finished": false}
data: {"token": " upon", "finished": false}
data: {"token": " a", "finished": false}
data: {"token": " time", "finished": false}
...
data: {"token": ".", "finished": true}JavaScript streaming client:
const response = await fetch('/v1/inference/text-generation/stream', {
method: 'POST',
headers: {
'Authorization': 'Bearer ' + token,
'Content-Type': 'application/json'
},
body: JSON.stringify({ model: 'username/model', prompt: 'Hello' })
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const chunk = decoder.decode(value);
const lines = chunk.split('\n');
for (const line of lines) {
if (line.startsWith('data: ')) {
const data = JSON.parse(line.slice(6));
console.log(data.token);
}
}
}Rate Limits & Quotas
Inference API rate limits by plan:
Free Tier
100 requests/minute • 10 GPU-hours/month
Pro ($9/month)
1,000 requests/minute • 100 GPU-hours/month
Team ($20/user/month)
5,000 requests/minute • 500 GPU-hours/month
Enterprise
Custom limits • Dedicated infrastructure
⚠️ Rate limit exceeded? The API returns HTTP 429. Check theRetry-After header for wait time.
Error Handling
Common error codes and how to handle them:
401 Unauthorized
Invalid or missing API token. Check your Authorization header.
403 Forbidden
No access to this model. Ensure the model is public or you have permissions.
404 Not Found
Model not found. Check the model name and namespace.
429 Too Many Requests
Rate limit exceeded. Implement exponential backoff or upgrade your plan.
503 Service Unavailable
Model is loading (cold start) or temporarily unavailable. Retry after a few seconds.
Advanced Options
Fine-tune your inference requests with these parameters:
# Text generation parameters
{
"model": "username/model",
"prompt": "Your input text",
"max_tokens": 100, // Max tokens to generate
"temperature": 0.7, // 0.0 = deterministic, 2.0 = very random
"top_p": 0.9, // Nucleus sampling
"top_k": 50, // Top-K sampling
"frequency_penalty": 0.0, // Penalize repeated tokens
"presence_penalty": 0.0, // Penalize already mentioned tokens
"stop_sequences": ["\n"], // Stop generation at these sequences
"seed": 42 // For reproducible outputs
}Best Practices
- ✅Cache responses - Cache identical requests to reduce costs and latency
- ✅Handle rate limits - Implement exponential backoff for 429 errors
- ✅Set timeouts - Always set request timeouts (30-60s recommended)
- ✅Batch requests - Send multiple inputs in one request when possible
- ✅Monitor usage - Track API calls and costs in your dashboard
- ✅Use streaming - For long responses, use streaming to improve UX