Everything an AI agent can do with the Replicate API.

A reference guide for building AI agents: every method, how to authenticate, and the permissions each one needs.

Endpoints27
API versionv1
Last updated23 June 2026
Orientation

How the Replicate API works.

The Replicate API is how an app or AI agent runs machine-learning models: creating a prediction to generate an image or transcribe audio, fine-tuning a model through a training, fetching a result, or listing models and deployments. Access is granted through an account API token, which carries the full access of the account it belongs to with no per-endpoint scopes to narrow it. Replicate can push a prediction's state changes to a webhook, so an integration learns when a long-running job finishes without polling.

27Endpoints
7Capability groups
14Read
13Write
1Permissions
Authentication
Replicate authenticates every call with an account API token sent as 'Authorization: Bearer '. A token is created and revoked on the account's API tokens page. There is no OAuth flow for first-party calls, and a token represents the user or organization it belongs to.
Permissions
A Replicate token is account-level and has no granular per-endpoint or per-resource scopes. It carries the full access of its account, so the same token that lists a model can also create predictions that cost money, delete a private model, or cancel a training. Limiting what a token can do is left to whatever sits in front of the API, not to the token itself.
Versioning
The HTTP API is served as a single current version, with changes shipped through a public changelog rather than new dated version strings. A model is versioned separately: each push creates a new model version with its own id and input and output schema, and a prediction can pin the exact version it runs.
Data model
Replicate is resource-oriented JSON over HTTPS at https://api.replicate.com/v1. A prediction runs a model version with a set of inputs and returns output and logs, moving through starting, processing, then succeeded, failed, or canceled. Models, deployments, trainings, and files are the other core resources, and a state change can be pushed to a webhook. Lists are cursor-paginated.
Connect & authenticate

Connection & authentication methods.

How an app or AI agent connects to Replicate determines what it can reach. There is a route for making calls, a route for receiving events when a prediction changes state, and a hosted server that exposes Replicate operations to agents, and each is governed by the API token behind it.

Ways to connect

HTTP API

The HTTP API answers at https://api.replicate.com/v1. It takes JSON request bodies, returns JSON, and pages through lists with a cursor. Every call authenticates with an account API token sent as 'Authorization: Bearer '.

Best forConnecting an app or AI agent to Replicate.
Governed byThe API token, which carries the full access of its account.
Docs ↗

Webhooks

Replicate POSTs the prediction or training object to an HTTPS URL named on the request when the job changes state, filtered by start, output, logs, and completed. The receiver verifies the webhook-id, webhook-timestamp, and webhook-signature headers against the default endpoint's signing secret (whsec_...), an HMAC-SHA256 over the signed content, to confirm the request came from Replicate.

Best forReceiving Replicate events at an app or AI agent.
Governed byThe signing secret on the default webhook endpoint.
Docs ↗

MCP server

Replicate's official Model Context Protocol server exposes the operations of the HTTP API to AI agents and LLM clients, like searching and fetching models, running predictions and retrieving results, and managing deployments and webhooks. The remote server at mcp.replicate.com authenticates through a web flow where an account API key is provided for the server to use; a local npm package, replicate-mcp, runs with an API token set in the client. It stays current as the HTTP API adds features.

Best forConnecting an AI agent to Replicate through MCP.
Governed byThe API token the server is given.
Docs ↗
Authentication

API token

Replicate authenticates every call with an account API token sent as a Bearer token in the Authorization header. A token is account-level: it carries the full access of the user or organization it belongs to, with no per-endpoint or per-resource scopes to narrow it. The same token that reads a model can create predictions that cost money, delete a private model, or cancel a training. A token is created and revoked on the account's API tokens page, and an organization token can be tied to a service account.

TokenBearer API token (r8_...)
Best forServer-side calls with full account access.
Docs ↗
Capability map

What an AI agent can do in Replicate.

The Replicate API is split into areas an agent can act on, like predictions, models, deployments, trainings, files, and the account. A Replicate API token carries the full access of the account it belongs to, so the same token that lists a model can also create predictions that cost money, delete a private model, or cancel a training.

Endpoint reference

Every Replicate API method.

Filter by method, access, or permission, or search any path. Select a row for version detail, rate limits, the related webhook event, and the source.

MethodEndpointWhat it doesAccessPermissionVersion

Predictions

Create a prediction to run a model, retrieve its state and output, list past predictions, and cancel a running one.5

Runs a model and bills the account. Takes a version and input, and an optional webhook plus webhook_events_filter (start, output, logs, completed).

Acts onprediction
Permission (capability)API token
VersionAvailable since the API’s base version
Webhook eventprediction.completed
Rate limit600 requests per minute

Status moves through starting, processing, then succeeded, failed, or canceled.

Acts onprediction
Permission (capability)API token
VersionAvailable since the API’s base version
Webhook eventNone
Rate limit3000 requests per minute

Read-only.

Acts onprediction
Permission (capability)API token
VersionAvailable since the API’s base version
Webhook eventNone
Rate limit3000 requests per minute

Stops billing for any remaining run time.

Acts onprediction
Permission (capability)API token
VersionAvailable since the API’s base version
Webhook eventNone
Rate limit3000 requests per minute

Runs a model and bills the account, addressing the model by owner and name rather than a version id.

Acts onprediction
Permission (capability)API token
VersionAvailable since the API’s base version
Webhook eventprediction.completed
Rate limit600 requests per minute

Models

Get and list models, create and update a model, run a model's official version, and manage its versions.8

Read-only.

Acts onmodel
Permission (capability)API token
VersionAvailable since the API’s base version
Webhook eventNone
Rate limit3000 requests per minute

Read-only.

Acts onmodel
Permission (capability)API token
VersionAvailable since the API’s base version
Webhook eventNone
Rate limit3000 requests per minute

Sets owner, name, visibility (public or private), and the hardware it runs on.

Acts onmodel
Permission (capability)API token
VersionAvailable since the API’s base version
Webhook eventNone
Rate limit3000 requests per minute

Changes properties on a model the account owns.

Acts onmodel
Permission (capability)API token
VersionAvailable since the API’s base version
Webhook eventNone
Rate limit3000 requests per minute

Only a private model with no published versions can be deleted.

Acts onmodel
Permission (capability)API token
VersionAvailable since the API’s base version
Webhook eventNone
Rate limit3000 requests per minute

Read-only.

Acts onmodel version
Permission (capability)API token
VersionAvailable since the API’s base version
Webhook eventNone
Rate limit3000 requests per minute

Read-only.

Acts onmodel version
Permission (capability)API token
VersionAvailable since the API’s base version
Webhook eventNone
Rate limit3000 requests per minute

Deletes the version and the predictions and output files tied to it.

Acts onmodel version
Permission (capability)API token
VersionAvailable since the API’s base version
Webhook eventNone
Rate limit3000 requests per minute

Deployments

Create, read, update, and delete deployments, and run predictions against a deployment.4

Sets the model version, hardware, and the minimum and maximum number of running instances.

Acts ondeployment
Permission (capability)API token
VersionAvailable since the API’s base version
Webhook eventNone
Rate limit3000 requests per minute

Read-only.

Acts ondeployment
Permission (capability)API token
VersionAvailable since the API’s base version
Webhook eventNone
Rate limit3000 requests per minute

Changing instance counts affects running cost.

Acts ondeployment
Permission (capability)API token
VersionAvailable since the API’s base version
Webhook eventNone
Rate limit3000 requests per minute

Runs the deployment's model and bills the account.

Acts onprediction
Permission (capability)API token
VersionAvailable since the API’s base version
Webhook eventprediction.completed
Rate limit600 requests per minute

Trainings

Start a training to fine-tune a model, retrieve its state, list past trainings, and cancel a running one.4

Runs a training job and bills the account, writing the result into a destination model.

Acts ontraining
Permission (capability)API token
VersionAvailable since the API’s base version
Webhook eventtraining.completed
Rate limit3000 requests per minute

Read-only.

Acts ontraining
Permission (capability)API token
VersionAvailable since the API’s base version
Webhook eventNone
Rate limit3000 requests per minute

Read-only.

Acts ontraining
Permission (capability)API token
VersionAvailable since the API’s base version
Webhook eventNone
Rate limit3000 requests per minute

Stops billing for any remaining run time.

Acts ontraining
Permission (capability)API token
VersionAvailable since the API’s base version
Webhook eventNone
Rate limit3000 requests per minute

Files

Upload a file to use as model input, retrieve a file's metadata, and list uploaded files.3

Stores a file on the account and returns a URL to reference as model input.

Acts onfile
Permission (capability)API token
VersionAvailable since the API’s base version
Webhook eventNone
Rate limit3000 requests per minute

Read-only.

Acts onfile
Permission (capability)API token
VersionAvailable since the API’s base version
Webhook eventNone
Rate limit3000 requests per minute

Read-only.

Acts onfile
Permission (capability)API token
VersionAvailable since the API’s base version
Webhook eventNone
Rate limit3000 requests per minute

Account & hardware

Read the authenticated account and list the hardware a model can run on.2

Read-only; confirms which account a token authenticates as.

Acts onaccount
Permission (capability)API token
VersionAvailable since the API’s base version
Webhook eventNone
Rate limit3000 requests per minute

Read-only; each entry has a name and an SKU used when creating a model.

Acts onhardware
Permission (capability)API token
VersionAvailable since the API’s base version
Webhook eventNone
Rate limit3000 requests per minute

Webhooks

Retrieve the signing secret used to verify that a webhook came from Replicate.1

Returns a key prefixed with whsec_ used to check the webhook-signature header. Read-only.

Acts onwebhook
Permission (capability)API token
VersionAvailable since the API’s base version
Webhook eventNone
Rate limit3000 requests per minute
No endpoints match those filters.
Webhooks

Webhook events.

Replicate can notify an app when a prediction or training changes state, like starting, producing output, or completing. It posts the prediction object to an HTTPS URL named on the request, so an integration learns when a long-running job finishes without polling.

EventWhat it signalsTriggered by
prediction completedA prediction finished, reaching succeeded, failed, or canceled. The completed event in webhook_events_filter delivers the final prediction object, while start, output, and logs deliver earlier states./v1/predictions
/v1/models/{model_owner}/{model_name}/predictions
/v1/deployments/{deployment_owner}/{deployment_name}/predictions
training completedA training finished, reaching succeeded, failed, or canceled. A training is a kind of prediction, so the same webhook and webhook_events_filter options apply./v1/models/{model_owner}/{model_name}/versions/{version_id}/trainings
No events match that search.
Rate limits & pagination

Rate limits, pagination & request size.

Replicate limits how fast an app can call by a per-minute request rate, with a separate, lower ceiling on creating predictions, and stricter limits apply as account credit runs low.

Request rate

Replicate meters requests by a per-minute rate, not by a point or quota cost. Creating predictions is capped at 600 requests per minute, and every other endpoint at 3000 requests per minute. Short bursts above these defaults are allowed before throttling begins, and the ceilings tighten as account credit runs low to prevent overspending. Going over returns HTTP 429 with a detail field that names when the limit resets, for example 'Request was throttled. Expected available in 30s.'

Pagination

List endpoints, like predictions, trainings, models, and files, are cursor-paginated. A response carries next and previous URLs, and following the next URL fetches the next page until it is absent, rather than building the URL by hand.

Request size

Responses are JSON. A file can be uploaded through the files endpoint to reference as model input rather than inlining large data, and output files and the predictions tied to a model version are removed when that version is deleted.

Errors

Status codes & error handling.

The status codes an agent should handle, and what to do about each.

StatusCodeMeaningWhat to do
401UnauthorizedNo valid API token was provided, or it is invalid or revoked.Send a valid token in the Authorization header as 'Bearer ', and rotate it if it has leaked.
402Payment RequiredThe account cannot be billed for the request, for example when it has no payment method or has run out of credit.Add or update the account's billing details, then retry.
404Not FoundThe requested object does not exist, or the token's account cannot see it.Check the path, owner, name, or id, and confirm the token's account has access.
422Unprocessable EntityThe request was well-formed but a field failed validation, such as model input that does not match the version's input schema.Read the detail field, fix the named input, and resend.
429ThrottledThe request rate was exceeded. Creating predictions is capped lower than other endpoints, and limits tighten as account credit runs low. The body's detail field names when the limit resets, for example 'Request was throttled. Expected available in 30s.'Back off and retry after the time named in the detail message, and smooth the request rate.
Versioning & freshness

Version history.

Replicate serves a single dated version of its HTTP API and ships changes through a public changelog rather than minting new version strings. A model is versioned separately, and a prediction can pin the exact model version it runs.

Version history

What changed, and when

Latest versionv1
v1Current version
Current HTTP API (single version)

Replicate serves one current version of its HTTP API at the /v1 path and ships changes through a public changelog rather than minting new dated version strings. Models are versioned separately from the API, and a prediction can pin the exact model version it runs. The entries below are notable dated changes from the changelog.

What changed
  • The HTTP API is a single, continuously updated version.
  • Models carry their own versions, each with an id and input and output schema.
2026-02-10Feature update
MCP server auto-discovery

Replicate's MCP server became discoverable through the official MCP Registry, publishing metadata at a /.well-known/mcp/server.json endpoint following the server.json specification.

What changed
  • MCP server metadata published for auto-discovery via the MCP Registry.

The HTTP API is a single current version; pin the model version a prediction runs.

Replicate changelog ↗
Questions

Replicate API, answered.

How does authentication work, and does Replicate use OAuth?+
Every request carries an account API token in the Authorization header as 'Bearer '. Replicate does not use OAuth for first-party API calls. A token is created and revoked on the account's API tokens page, and it authenticates as the user or organization it belongs to. The remote MCP server adds a web-based flow where an account API key is provided for the server to use on the account's behalf.
Can I limit what a token can do, by endpoint or resource?+
No. A Replicate token is account-level and has no per-endpoint or per-resource scopes. It carries the full access of its account, so the same token that reads a model can also create billable predictions, delete a private model, or cancel a training. Narrowing access has to come from whatever sits in front of the API, such as a gateway, rather than from the token.
What are the rate limits?+
Creating predictions is limited to 600 requests per minute, and every other endpoint to 3000 requests per minute. Short bursts above these defaults are allowed before throttling, and the ceilings tighten as account credit runs low. Going over returns HTTP 429 with a detail field saying when the limit resets, like 'Request was throttled. Expected available in 30s.'
How do I receive a result instead of polling?+
Name an HTTPS webhook URL on the create-prediction or create-training request, and choose which states to receive with webhook_events_filter from start, output, logs, and completed. Replicate POSTs the prediction object as those states occur. Server-sent events are an alternative for streaming output without polling.
How do I verify a webhook really came from Replicate?+
Fetch the default endpoint's signing secret from GET /v1/webhooks/default/secret, which returns a key prefixed with whsec_. Each webhook carries webhook-id, webhook-timestamp, and webhook-signature headers. Concatenate the id, timestamp, and raw body, compute an HMAC-SHA256 with the secret, and compare it in constant time against the signature in the header, also checking the timestamp is recent to block replays.
How does versioning work, for the API and for models?+
The HTTP API is a single current version, and changes are published in the changelog rather than as new dated version strings. Models are versioned independently: each push creates a new model version with its own id and input and output schema. A prediction can pin the exact model version it runs, so a fixed version keeps producing consistent results.
What does it cost to run a prediction?+
A prediction bills the account for the time the model runs on its hardware, so creating a prediction or a training is a billable action, not a free read. Public models are billed per run, and a deployment or a private model is billed for the hardware it uses. A 402 response means the account cannot be billed, for example with no payment method or no credit.
Related

More ai API guides for agents

What is Bollard AI?

Control what every AI agent can do in Replicate.

Bollard AI sits between a team's AI agents and Replicate. Grant each agent exactly the access it needs, read or write, action by action, and every call is checked and logged.

  • Allow running predictions while blocking model and deployment changes, never a shared Replicate token.
  • Denied by default, so an agent reaches only what has been explicitly allowed.
  • Every call recorded in plain English: who, what, where, and the decision.
Replicate
Media Agent
Run predictions ActionOffReadFull use
List models ResourceOffReadFull use
Delete models ActionOffReadFull use
Per-agent access, set in Bollard AI, not in Replicate