Resources/The Agent API Atlas/AI/Replicate

Everything an AI agent can do with the Replicate API.

A reference guide for building AI agents: every method, how to authenticate, and the permissions each one needs.

Endpoints27

API versionv1

Last updated23 June 2026

Orientation

How the Replicate API works.

The Replicate API is how an app or AI agent runs machine-learning models: creating a prediction to generate an image or transcribe audio, fine-tuning a model through a training, fetching a result, or listing models and deployments. Access is granted through an account API token, which carries the full access of the account it belongs to with no per-endpoint scopes to narrow it. Replicate can push a prediction's state changes to a webhook, so an integration learns when a long-running job finishes without polling.

27Endpoints

7Capability groups

14Read

13Write

1Permissions

Authentication

Replicate authenticates every call with an account API token sent as 'Authorization: Bearer '. A token is created and revoked on the account's API tokens page. There is no OAuth flow for first-party calls, and a token represents the user or organization it belongs to.

Permissions

A Replicate token is account-level and has no granular per-endpoint or per-resource scopes. It carries the full access of its account, so the same token that lists a model can also create predictions that cost money, delete a private model, or cancel a training. Limiting what a token can do is left to whatever sits in front of the API, not to the token itself.

Versioning

The HTTP API is served as a single current version, with changes shipped through a public changelog rather than new dated version strings. A model is versioned separately: each push creates a new model version with its own id and input and output schema, and a prediction can pin the exact version it runs.

Data model

Replicate is resource-oriented JSON over HTTPS at https://api.replicate.com/v1. A prediction runs a model version with a set of inputs and returns output and logs, moving through starting, processing, then succeeded, failed, or canceled. Models, deployments, trainings, and files are the other core resources, and a state change can be pushed to a webhook. Lists are cursor-paginated.

Connect & authenticate

Connection & authentication methods.

How an app or AI agent connects to Replicate determines what it can reach. There is a route for making calls, a route for receiving events when a prediction changes state, and a hosted server that exposes Replicate operations to agents, and each is governed by the API token behind it.

Ways to connect

HTTP API

The HTTP API answers at https://api.replicate.com/v1. It takes JSON request bodies, returns JSON, and pages through lists with a cursor. Every call authenticates with an account API token sent as 'Authorization: Bearer '.

Best forConnecting an app or AI agent to Replicate.

Governed byThe API token, which carries the full access of its account.

Docs ↗

Webhooks

Replicate POSTs the prediction or training object to an HTTPS URL named on the request when the job changes state, filtered by start, output, logs, and completed. The receiver verifies the webhook-id, webhook-timestamp, and webhook-signature headers against the default endpoint's signing secret (whsec_...), an HMAC-SHA256 over the signed content, to confirm the request came from Replicate.

Best forReceiving Replicate events at an app or AI agent.

Governed byThe signing secret on the default webhook endpoint.

Docs ↗

MCP server

Replicate's official Model Context Protocol server exposes the operations of the HTTP API to AI agents and LLM clients, like searching and fetching models, running predictions and retrieving results, and managing deployments and webhooks. The remote server at mcp.replicate.com authenticates through a web flow where an account API key is provided for the server to use; a local npm package, replicate-mcp, runs with an API token set in the client. It stays current as the HTTP API adds features.

Best forConnecting an AI agent to Replicate through MCP.

Governed byThe API token the server is given.

Docs ↗

Authentication

API token

Replicate authenticates every call with an account API token sent as a Bearer token in the Authorization header. A token is account-level: it carries the full access of the user or organization it belongs to, with no per-endpoint or per-resource scopes to narrow it. The same token that reads a model can create predictions that cost money, delete a private model, or cancel a training. A token is created and revoked on the account's API tokens page, and an organization token can be tied to a service account.

TokenBearer API token (r8_...)

Best forServer-side calls with full account access.

Docs ↗

Capability map

What an AI agent can do in Replicate.

The Replicate API is split into areas an agent can act on, like predictions, models, deployments, trainings, files, and the account. A Replicate API token carries the full access of the account it belongs to, so the same token that lists a model can also create predictions that cost money, delete a private model, or cancel a training.

Predictions

5 endpoints

Create a prediction to run a model, retrieve its state and output, list past predictions, and cancel a running one.

Creating a prediction runs a model and costs money.

View endpoints →

Models

8 endpoints

Get and list models, create and update a model, run a model's official version, and manage its versions.

A write here changes or deletes real model data.

View endpoints →

Deployments

4 endpoints

Create, read, update, and delete deployments, and run predictions against a deployment.

A write here changes how a deployment serves a model.

View endpoints →

Trainings

4 endpoints

Start a training to fine-tune a model, retrieve its state, list past trainings, and cancel a running one.

Creating a training runs a job and costs money.

View endpoints →

Files

3 endpoints

Upload a file to use as model input, retrieve a file's metadata, and list uploaded files.

Uploading stores a file on the account.

View endpoints →

Account & hardware

2 endpoints

Read the authenticated account and list the hardware a model can run on.

Reads expose account and hardware details.

View endpoints →

Webhooks

1 endpoint

Retrieve the signing secret used to verify that a webhook came from Replicate.

The signing secret confirms a webhook is genuine.

View endpoints →

Endpoint reference

Every Replicate API method.

Filter by method, access, or permission, or search any path. Select a row for version detail, rate limits, the related webhook event, and the source.

Hide deprecated

Method	Endpoint	What it does	Access	Permission	Version
Predictions Create a prediction to run a model, retrieve its state and output, list past predictions, and cancel a running one.5
POST	`/v1/predictions`	Create a prediction to run a model version, optionally with a webhook for state changes.	write	`API token`	Current
Runs a model and bills the account. Takes a version and input, and an optional webhook plus webhook_events_filter (start, output, logs, completed). Acts onprediction Permission (capability)`API token` VersionAvailable since the API’s base version Webhook event`prediction.completed` Rate limit600 requests per minute SourceOfficial documentation ↗
GET	`/v1/predictions/{prediction_id}`	Retrieve the current state and output of a prediction.	read	`API token`	Current
Status moves through starting, processing, then succeeded, failed, or canceled. Acts onprediction Permission (capability)`API token` VersionAvailable since the API’s base version Webhook eventNone Rate limit3000 requests per minute SourceOfficial documentation ↗
GET	`/v1/predictions`	List the authenticated account's predictions (cursor-paginated).	read	`API token`	Current
Read-only. Acts onprediction Permission (capability)`API token` VersionAvailable since the API’s base version Webhook eventNone Rate limit3000 requests per minute SourceOfficial documentation ↗
POST	`/v1/predictions/{prediction_id}/cancel`	Cancel a prediction that is still running.	write	`API token`	Current
Stops billing for any remaining run time. Acts onprediction Permission (capability)`API token` VersionAvailable since the API’s base version Webhook eventNone Rate limit3000 requests per minute SourceOfficial documentation ↗
POST	`/v1/models/{model_owner}/{model_name}/predictions`	Create a prediction using an official model's latest or pinned version.	write	`API token`	Current
Runs a model and bills the account, addressing the model by owner and name rather than a version id. Acts onprediction Permission (capability)`API token` VersionAvailable since the API’s base version Webhook event`prediction.completed` Rate limit600 requests per minute SourceOfficial documentation ↗
Models Get and list models, create and update a model, run a model's official version, and manage its versions.8
GET	`/v1/models/{model_owner}/{model_name}`	Get a model's details, including its latest version.	read	`API token`	Current
Read-only. Acts onmodel Permission (capability)`API token` VersionAvailable since the API’s base version Webhook eventNone Rate limit3000 requests per minute SourceOfficial documentation ↗
GET	`/v1/models`	List public models (cursor-paginated).	read	`API token`	Current
Read-only. Acts onmodel Permission (capability)`API token` VersionAvailable since the API’s base version Webhook eventNone Rate limit3000 requests per minute SourceOfficial documentation ↗
POST	`/v1/models`	Create a new model.	write	`API token`	Current
Sets owner, name, visibility (public or private), and the hardware it runs on. Acts onmodel Permission (capability)`API token` VersionAvailable since the API’s base version Webhook eventNone Rate limit3000 requests per minute SourceOfficial documentation ↗
PATCH	`/v1/models/{model_owner}/{model_name}`	Update a model's metadata.	write	`API token`	Current
Changes properties on a model the account owns. Acts onmodel Permission (capability)`API token` VersionAvailable since the API’s base version Webhook eventNone Rate limit3000 requests per minute SourceOfficial documentation ↗
DELETE	`/v1/models/{model_owner}/{model_name}`	Delete a private model that has no versions.	write	`API token`	Current
Only a private model with no published versions can be deleted. Acts onmodel Permission (capability)`API token` VersionAvailable since the API’s base version Webhook eventNone Rate limit3000 requests per minute SourceOfficial documentation ↗
GET	`/v1/models/{model_owner}/{model_name}/versions/{version_id}`	Get a specific version of a model, including its input and output schema.	read	`API token`	Current
Read-only. Acts onmodel version Permission (capability)`API token` VersionAvailable since the API’s base version Webhook eventNone Rate limit3000 requests per minute SourceOfficial documentation ↗
GET	`/v1/models/{model_owner}/{model_name}/versions`	List all versions of a model.	read	`API token`	Current
Read-only. Acts onmodel version Permission (capability)`API token` VersionAvailable since the API’s base version Webhook eventNone Rate limit3000 requests per minute SourceOfficial documentation ↗
DELETE	`/v1/models/{model_owner}/{model_name}/versions/{version_id}`	Delete a model version and its associated output files.	write	`API token`	Current
Deletes the version and the predictions and output files tied to it. Acts onmodel version Permission (capability)`API token` VersionAvailable since the API’s base version Webhook eventNone Rate limit3000 requests per minute SourceOfficial documentation ↗
Deployments Create, read, update, and delete deployments, and run predictions against a deployment.4
POST	`/v1/deployments`	Create a deployment that serves a model version on chosen hardware.	write	`API token`	Current
Sets the model version, hardware, and the minimum and maximum number of running instances. Acts ondeployment Permission (capability)`API token` VersionAvailable since the API’s base version Webhook eventNone Rate limit3000 requests per minute SourceOfficial documentation ↗
GET	`/v1/deployments/{deployment_owner}/{deployment_name}`	Get a deployment's details, including its current release.	read	`API token`	Current
Read-only. Acts ondeployment Permission (capability)`API token` VersionAvailable since the API’s base version Webhook eventNone Rate limit3000 requests per minute SourceOfficial documentation ↗
PATCH	`/v1/deployments/{deployment_owner}/{deployment_name}`	Update a deployment, such as its version or instance count.	write	`API token`	Current
Changing instance counts affects running cost. Acts ondeployment Permission (capability)`API token` VersionAvailable since the API’s base version Webhook eventNone Rate limit3000 requests per minute SourceOfficial documentation ↗
POST	`/v1/deployments/{deployment_owner}/{deployment_name}/predictions`	Create a prediction against a deployment.	write	`API token`	Current
Runs the deployment's model and bills the account. Acts onprediction Permission (capability)`API token` VersionAvailable since the API’s base version Webhook event`prediction.completed` Rate limit600 requests per minute SourceOfficial documentation ↗
Trainings Start a training to fine-tune a model, retrieve its state, list past trainings, and cancel a running one.4
POST	`/v1/models/{model_owner}/{model_name}/versions/{version_id}/trainings`	Start a training to fine-tune a model from a base version.	write	`API token`	Current
Runs a training job and bills the account, writing the result into a destination model. Acts ontraining Permission (capability)`API token` VersionAvailable since the API’s base version Webhook event`training.completed` Rate limit3000 requests per minute SourceOfficial documentation ↗
GET	`/v1/trainings/{training_id}`	Retrieve the current state of a training.	read	`API token`	Current
Read-only. Acts ontraining Permission (capability)`API token` VersionAvailable since the API’s base version Webhook eventNone Rate limit3000 requests per minute SourceOfficial documentation ↗
GET	`/v1/trainings`	List the authenticated account's trainings (cursor-paginated).	read	`API token`	Current
Read-only. Acts ontraining Permission (capability)`API token` VersionAvailable since the API’s base version Webhook eventNone Rate limit3000 requests per minute SourceOfficial documentation ↗
POST	`/v1/trainings/{training_id}/cancel`	Cancel a training that is still running.	write	`API token`	Current
Stops billing for any remaining run time. Acts ontraining Permission (capability)`API token` VersionAvailable since the API’s base version Webhook eventNone Rate limit3000 requests per minute SourceOfficial documentation ↗
Files Upload a file to use as model input, retrieve a file's metadata, and list uploaded files.3
POST	`/v1/files`	Upload a file to use as input to a model.	write	`API token`	Current
Stores a file on the account and returns a URL to reference as model input. Acts onfile Permission (capability)`API token` VersionAvailable since the API’s base version Webhook eventNone Rate limit3000 requests per minute SourceOfficial documentation ↗
GET	`/v1/files/{file_id}`	Retrieve a file's metadata.	read	`API token`	Current
Read-only. Acts onfile Permission (capability)`API token` VersionAvailable since the API’s base version Webhook eventNone Rate limit3000 requests per minute SourceOfficial documentation ↗
GET	`/v1/files`	List the files uploaded by the authenticated account.	read	`API token`	Current
Read-only. Acts onfile Permission (capability)`API token` VersionAvailable since the API’s base version Webhook eventNone Rate limit3000 requests per minute SourceOfficial documentation ↗
Account & hardware Read the authenticated account and list the hardware a model can run on.2
GET	`/v1/account`	Get the authenticated user or organization the token belongs to.	read	`API token`	Current
Read-only; confirms which account a token authenticates as. Acts onaccount Permission (capability)`API token` VersionAvailable since the API’s base version Webhook eventNone Rate limit3000 requests per minute SourceOfficial documentation ↗
GET	`/v1/hardware`	List the hardware a model can run on.	read	`API token`	Current
Read-only; each entry has a name and an SKU used when creating a model. Acts onhardware Permission (capability)`API token` VersionAvailable since the API’s base version Webhook eventNone Rate limit3000 requests per minute SourceOfficial documentation ↗
Webhooks Retrieve the signing secret used to verify that a webhook came from Replicate.1
GET	`/v1/webhooks/default/secret`	Get the signing secret for the default webhook endpoint, used to verify webhooks.	read	`API token`	Current
Returns a key prefixed with whsec_ used to check the webhook-signature header. Read-only. Acts onwebhook Permission (capability)`API token` VersionAvailable since the API’s base version Webhook eventNone Rate limit3000 requests per minute SourceOfficial documentation ↗

No endpoints match those filters.

Webhooks

Webhook events.

Replicate can notify an app when a prediction or training changes state, like starting, producing output, or completing. It posts the prediction object to an HTTPS URL named on the request, so an integration learns when a long-running job finishes without polling.

Event	What it signals	Triggered by
`prediction completed`	A prediction finished, reaching succeeded, failed, or canceled. The completed event in webhook_events_filter delivers the final prediction object, while start, output, and logs deliver earlier states.	`/v1/predictions` `/v1/models/{model_owner}/{model_name}/predictions` `/v1/deployments/{deployment_owner}/{deployment_name}/predictions`
`training completed`	A training finished, reaching succeeded, failed, or canceled. A training is a kind of prediction, so the same webhook and webhook_events_filter options apply.	`/v1/models/{model_owner}/{model_name}/versions/{version_id}/trainings`

No events match that search.

Rate limits & pagination

Rate limits, pagination & request size.

Replicate limits how fast an app can call by a per-minute request rate, with a separate, lower ceiling on creating predictions, and stricter limits apply as account credit runs low.

Request rate

Replicate meters requests by a per-minute rate, not by a point or quota cost. Creating predictions is capped at 600 requests per minute, and every other endpoint at 3000 requests per minute. Short bursts above these defaults are allowed before throttling begins, and the ceilings tighten as account credit runs low to prevent overspending. Going over returns HTTP 429 with a detail field that names when the limit resets, for example 'Request was throttled. Expected available in 30s.'

Pagination

List endpoints, like predictions, trainings, models, and files, are cursor-paginated. A response carries next and previous URLs, and following the next URL fetches the next page until it is absent, rather than building the URL by hand.

Request size

Responses are JSON. A file can be uploaded through the files endpoint to reference as model input rather than inlining large data, and output files and the predictions tied to a model version are removed when that version is deleted.

Errors

Status codes & error handling.

The status codes an agent should handle, and what to do about each.

Status	Code	Meaning	What to do
401	`Unauthorized`	No valid API token was provided, or it is invalid or revoked.	Send a valid token in the Authorization header as 'Bearer ', and rotate it if it has leaked.
402	`Payment Required`	The account cannot be billed for the request, for example when it has no payment method or has run out of credit.	Add or update the account's billing details, then retry.
404	`Not Found`	The requested object does not exist, or the token's account cannot see it.	Check the path, owner, name, or id, and confirm the token's account has access.
422	`Unprocessable Entity`	The request was well-formed but a field failed validation, such as model input that does not match the version's input schema.	Read the detail field, fix the named input, and resend.
429	`Throttled`	The request rate was exceeded. Creating predictions is capped lower than other endpoints, and limits tighten as account credit runs low. The body's detail field names when the limit resets, for example 'Request was throttled. Expected available in 30s.'	Back off and retry after the time named in the detail message, and smooth the request rate.

Versioning & freshness

Version history.

Replicate serves a single dated version of its HTTP API and ships changes through a public changelog rather than minting new version strings. A model is versioned separately, and a prediction can pin the exact model version it runs.

Version history

What changed, and when

Latest versionv1

v1Current version

Current HTTP API (single version)

Replicate serves one current version of its HTTP API at the /v1 path and ships changes through a public changelog rather than minting new dated version strings. Models are versioned separately from the API, and a prediction can pin the exact model version it runs. The entries below are notable dated changes from the changelog.

What changed

The HTTP API is a single, continuously updated version.
Models carry their own versions, each with an id and input and output schema.

2026-02-10Feature update

MCP server auto-discovery

Replicate's MCP server became discoverable through the official MCP Registry, publishing metadata at a /.well-known/mcp/server.json endpoint following the server.json specification.

What changed

MCP server metadata published for auto-discovery via the MCP Registry.

The HTTP API is a single current version; pin the model version a prediction runs.

Replicate changelog ↗

Questions

Replicate API, answered.

How does authentication work, and does Replicate use OAuth?+

Every request carries an account API token in the Authorization header as 'Bearer '. Replicate does not use OAuth for first-party API calls. A token is created and revoked on the account's API tokens page, and it authenticates as the user or organization it belongs to. The remote MCP server adds a web-based flow where an account API key is provided for the server to use on the account's behalf.

Can I limit what a token can do, by endpoint or resource?+

No. A Replicate token is account-level and has no per-endpoint or per-resource scopes. It carries the full access of its account, so the same token that reads a model can also create billable predictions, delete a private model, or cancel a training. Narrowing access has to come from whatever sits in front of the API, such as a gateway, rather than from the token.

What are the rate limits?+

Creating predictions is limited to 600 requests per minute, and every other endpoint to 3000 requests per minute. Short bursts above these defaults are allowed before throttling, and the ceilings tighten as account credit runs low. Going over returns HTTP 429 with a detail field saying when the limit resets, like 'Request was throttled. Expected available in 30s.'

How do I receive a result instead of polling?+

Name an HTTPS webhook URL on the create-prediction or create-training request, and choose which states to receive with webhook_events_filter from start, output, logs, and completed. Replicate POSTs the prediction object as those states occur. Server-sent events are an alternative for streaming output without polling.

How do I verify a webhook really came from Replicate?+

Fetch the default endpoint's signing secret from GET /v1/webhooks/default/secret, which returns a key prefixed with whsec_. Each webhook carries webhook-id, webhook-timestamp, and webhook-signature headers. Concatenate the id, timestamp, and raw body, compute an HMAC-SHA256 with the secret, and compare it in constant time against the signature in the header, also checking the timestamp is recent to block replays.

How does versioning work, for the API and for models?+

The HTTP API is a single current version, and changes are published in the changelog rather than as new dated version strings. Models are versioned independently: each push creates a new model version with its own id and input and output schema. A prediction can pin the exact model version it runs, so a fixed version keeps producing consistent results.

What does it cost to run a prediction?+

A prediction bills the account for the time the model runs on its hardware, so creating a prediction or a training is a billable action, not a free read. Public models are billed per run, and a deployment or a private model is billed for the hardware it uses. A 402 response means the account cannot be billed, for example with no payment method or no credit.

What is Bollard AI?

Control what every AI agent can do in Replicate.

Bollard AI sits between a team's AI agents and Replicate. Grant each agent exactly the access it needs, read or write, action by action, and every call is checked and logged.

Allow running predictions while blocking model and deployment changes, never a shared Replicate token.
Denied by default, so an agent reaches only what has been explicitly allowed.
Every call recorded in plain English: who, what, where, and the decision.

Control Replicate access in Bollard Browse all APIs →

Replicate

Media Agent

Run predictions ActionOffReadFull use

List models ResourceOffReadFull use

Delete models ActionOffReadFull use

Per-agent access, set in Bollard AI, not in Replicate

How the Replicate API works.

Connection & authentication methods.

HTTP API

Webhooks

MCP server

API token

What an AI agent can do in Replicate.

Predictions

Models

Deployments

Trainings

Files

Account & hardware

Webhooks

Every Replicate API method.

Predictions

Models

Deployments

Trainings

Files

Account & hardware

Webhooks

Webhook events.

Rate limits, pagination & request size.

Request rate

Pagination

Request size

Status codes & error handling.

Version history.

What changed, and when

Replicate API, answered.

More ai API guides for agents

Hugging Face

ElevenLabs

Google Gemini

OpenAI

Anthropic

Cohere

Control what every AI agent can do in Replicate.