Everything an AI agent can do with the Google Gemini API.

A reference guide for building AI agents: every method, how to authenticate, and the permissions each one needs.

Endpoints26
API versionv1beta
Last updated23 June 2026
Orientation

How the Google Gemini API works.

The Google Gemini API is how an app or AI agent works with Google's Gemini models: generating text and images from a prompt, turning text into embeddings for search, uploading files for a model to read, caching context to reuse, and fine-tuning a model. Access is granted through an account-wide API key from Google AI Studio, which has no per-endpoint permissions, so any call the key can reach, it can make. The model is the versioned, dated thing rather than the API path, and the API can notify an endpoint when asynchronous batch work finishes.

26Endpoints
7Capability groups
11Read
15Write
0Permissions
Authentication
Gemini authenticates calls with an API key created in Google AI Studio, sent in the x-goog-api-key header or as a query parameter. The key is account-wide and is the credential behind nearly every call. Some operations, like calling a tuned model, can instead use an OAuth 2.0 access token tied to a Google account. This is the Gemini Developer API on generativelanguage.googleapis.com, which is separate from Gemini on Google Cloud Vertex AI, where the same models are reached through Google Cloud authentication and project-level controls.
Permissions
The API key has no granular per-endpoint permissions. It is not scoped to specific methods or resources, so any call the key can reach, it can make, from generating content to uploading files, caching context, and creating or deleting tuned models. There is no built-in way on the key itself to allow reads but block writes, or to limit it to one area. This is the gap Bollard fills, by deciding per agent which methods a key is allowed to call. Vertex AI, by contrast, layers Google Cloud Identity and Access Management roles on top, which the Developer API key does not have.
Versioning
The API exposes a stable surface under v1 and a broader preview surface under v1beta that carries newer features first, and the SDKs default to v1beta. There is no dated version string on the API path itself. Instead the model is the versioned, dated thing: each model has a name and version, models are promoted from preview to general availability and later retired, and release notes track those model changes by date.
Data model
Gemini is resource-oriented JSON over HTTPS at generativelanguage.googleapis.com. Generation calls address a model by name, like /v1beta/models/{model}:generateContent, while files, cached content, batches, and tuned models are stored resources at /v1beta/files, /v1beta/cachedContents, /v1beta/batches, and /v1beta/tunedModels. The Live API runs over a separate WebSocket connection for real-time audio and video, and there is no general webhook for ordinary generation, only completion notifications for asynchronous batch and long-running work.
Connect & authenticate

Connection & authentication methods.

How an app or AI agent connects to Gemini determines what it can reach. There is a route for making calls, a real-time route for live audio and video, and a hosted documentation server, and each is governed by the API key behind it.

Ways to connect

REST API

The REST API answers at https://generativelanguage.googleapis.com, with the stable surface under /v1 and the broader preview surface under /v1beta. A call authenticates with an API key from Google AI Studio, sent in the x-goog-api-key header or as a query parameter.

Best forConnecting an app or AI agent to Gemini.
Governed byThe API key behind the call.
Docs ↗

Live API (WebSocket)

The Live API uses a bidirectional WebSocket connection for real-time, low-latency interaction, streaming audio and video in and audio and text out. It authenticates with the same API key and suits voice and live multimodal agents rather than single request-and-response calls.

Best forReal-time voice and live multimodal agents.
Governed byThe API key behind the connection.
Docs ↗

MCP server (documentation)

Google hosts a public Model Context Protocol server at https://gemini-api-docs-mcp.dev that exposes a search_documentation function, so an agent can pull current Gemini API definitions and patterns into its context. It serves documentation lookup, not calls against the generative API itself, which are made over REST or the Live API.

Best forGiving a coding agent current Gemini API documentation.
Governed byPublic documentation access; it makes no account calls.
Docs ↗
Authentication

API key

A Gemini API key from Google AI Studio authenticates every call, sent in the x-goog-api-key header or as a query parameter. The key is account-wide and carries no per-endpoint scopes, so any call it can reach, it can make. A key must never be exposed in client code.

TokenAPI key (x-goog-api-key)
Best forServer-side access to the Gemini Developer API
Docs ↗

OAuth 2.0

Some operations, like calling a tuned model, can use OAuth 2.0 access tokens tied to a Google account rather than a plain API key. OAuth suits flows that act on behalf of a user, while most generation work uses an API key.

TokenOAuth 2.0 access token
Best forActing on behalf of a user, such as tuned-model access
Docs ↗
Capability map

What an AI agent can do with Google Gemini.

The Gemini API is split into areas an agent can act on, like generating content, creating embeddings, counting tokens, uploading files, caching context, running batches, and tuning models. Each area has its own methods, and some create lasting resources or spend against the account's quota.

Endpoint reference

Every Google Gemini API method.

Filter by method, access, or permission, or search any path. Select a row for version detail, rate limits, the related webhook event, and the source.

MethodEndpointWhat it doesAccessPermissionVersion

Models & content

Generate content from a model, stream it back as it is produced, and list or read the available models.4

The API key carries no per-method scope. Marked as a write because the call consumes token quota and can run tools, though it creates no stored resource.

Acts onmodel
Permission (capability)None required
VersionAvailable since the API’s base version
Webhook eventNone
Rate limitStandard limits apply

Takes the same request as generateContent and returns a stream of partial responses. The API key carries no per-method scope.

Acts onmodel
Permission (capability)None required
VersionAvailable since the API’s base version
Webhook eventNone
Rate limitStandard limits apply

Read-only. The API key carries no per-method scope.

Acts onmodel
Permission (capability)None required
VersionAvailable since the API’s base version
Webhook eventNone
Rate limitStandard limits apply

Read-only. The API key carries no per-method scope.

Acts onmodel
Permission (capability)None required
VersionAvailable since the API’s base version
Webhook eventNone
Rate limitStandard limits apply

Embeddings

Turn text into embedding vectors, one input at a time or in a batch, for search and similarity work.2

Marked as a write because it consumes token quota; it creates no stored resource. The API key carries no per-method scope.

Acts onembedding
Permission (capability)None required
VersionAvailable since the API’s base version
Webhook eventNone
Rate limitStandard limits apply

Each input in the batch is an EmbedContentRequest. The API key carries no per-method scope.

Acts onembedding
Permission (capability)None required
VersionAvailable since the API’s base version
Webhook eventNone
Rate limitStandard limits apply

Token counting

Run a model's tokenizer over input to count tokens before sending a generation call.1

A read-only utility used to size a prompt before a generation call. The API key carries no per-method scope.

Acts onmodel
Permission (capability)None required
VersionAvailable since the API’s base version
Webhook eventNone
Rate limitStandard limits apply

Files

Upload files for a model to read, then list, read, or delete them. Files are held for 48 hours.4

Uses a separate upload host. Stored files are held for 48 hours, up to 2 GB per file and 20 GB per project. The API key carries no per-method scope.

Acts onfile
Permission (capability)None required
VersionAvailable since the API’s base version
Webhook eventNone
Rate limitStandard limits apply

Read-only. The API key carries no per-method scope.

Acts onfile
Permission (capability)None required
VersionAvailable since the API’s base version
Webhook eventNone
Rate limitStandard limits apply

Read-only. The API key carries no per-method scope.

Acts onfile
Permission (capability)None required
VersionAvailable since the API’s base version
Webhook eventNone
Rate limitStandard limits apply

Irreversible. The API key carries no per-method scope.

Acts onfile
Permission (capability)None required
VersionAvailable since the API’s base version
Webhook eventNone
Rate limitStandard limits apply

Context caching

Save precomputed input tokens as cached content and reuse them across calls, then list, read, update, or delete the cache.5

A time-to-live is set with the ttl field or an expireTime. The API key carries no per-method scope.

Acts oncachedContent
Permission (capability)None required
VersionAvailable since the API’s base version
Webhook eventNone
Rate limitStandard limits apply

Read-only. The API key carries no per-method scope.

Acts oncachedContent
Permission (capability)None required
VersionAvailable since the API’s base version
Webhook eventNone
Rate limitStandard limits apply

Read-only. The API key carries no per-method scope.

Acts oncachedContent
Permission (capability)None required
VersionAvailable since the API’s base version
Webhook eventNone
Rate limitStandard limits apply

The API key carries no per-method scope.

Acts oncachedContent
Permission (capability)None required
VersionAvailable since the API’s base version
Webhook eventNone
Rate limitStandard limits apply

Irreversible. The API key carries no per-method scope.

Acts oncachedContent
Permission (capability)None required
VersionAvailable since the API’s base version
Webhook eventNone
Rate limitStandard limits apply

Batch

Submit many generation requests as one asynchronous job at half the standard cost, then read, list, or cancel it.5

Targets a 24-hour turnaround and can notify a registered endpoint on completion. The API key carries no per-method scope.

Acts onbatch
Permission (capability)None required
VersionAvailable since the API’s base version
Webhook eventbatch-completed
Rate limitStandard limits apply

Read-only. The API key carries no per-method scope.

Acts onbatch
Permission (capability)None required
VersionAvailable since the API’s base version
Webhook eventNone
Rate limitStandard limits apply

Read-only. The API key carries no per-method scope.

Acts onbatch
Permission (capability)None required
VersionAvailable since the API’s base version
Webhook eventNone
Rate limitStandard limits apply

The API key carries no per-method scope.

Acts onbatch
Permission (capability)None required
VersionAvailable since the API’s base version
Webhook eventNone
Rate limitStandard limits apply

The API key carries no per-method scope.

Acts onbatch
Permission (capability)None required
VersionAvailable since the API’s base version
Webhook eventNone
Rate limitStandard limits apply

Tuned models

Create a fine-tuned model from training data, then list, read, generate from, or delete it.5

Starts a long-running tuning job that produces a lasting tuned model. The API key carries no per-method scope.

Acts ontunedModel
Permission (capability)None required
VersionAvailable since the API’s base version
Webhook eventNone
Rate limitStandard limits apply

Read-only. The API key carries no per-method scope.

Acts ontunedModel
Permission (capability)None required
VersionAvailable since the API’s base version
Webhook eventNone
Rate limitStandard limits apply

Read-only. The API key carries no per-method scope.

Acts ontunedModel
Permission (capability)None required
VersionAvailable since the API’s base version
Webhook eventNone
Rate limitStandard limits apply

Calling a tuned model can need proper authentication beyond a plain API key. The key carries no per-method scope.

Acts ontunedModel
Permission (capability)None required
VersionAvailable since the API’s base version
Webhook eventNone
Rate limitStandard limits apply

Irreversible. The API key carries no per-method scope.

Acts ontunedModel
Permission (capability)None required
VersionAvailable since the API’s base version
Webhook eventNone
Rate limitStandard limits apply
No endpoints match those filters.
Webhooks

Webhook events.

Gemini has no general webhook system for content generation, so an app reads results back from the call it made. The exception is asynchronous work, like a batch job or a long-running operation, which can notify a registered endpoint when it finishes.

EventWhat it signalsTriggered by
Batch / long-running operation completionFires when an asynchronous batch job or long-running operation finishes, so an integration learns the result is ready without polling. This event-driven notification was introduced for the Batch API and long-running operations in May 2026./v1beta/{batch.model=models/*}:batchGenerateContent
No events match that search.
Rate limits & pagination

Rate limits, pagination & request size.

Gemini limits how fast and how much an app or AI agent can call, through per-model ceilings on requests per minute, tokens per minute, and requests per day, with the ceilings rising as the billing account moves up a usage tier.

Request rate

Gemini sets per-model limits across three dimensions at once: requests per minute, tokens per minute, and requests per day. Usage is checked against each, and exceeding any one returns HTTP 429 with the status RESOURCE_EXHAUSTED. The ceilings rise as the billing account moves up a usage tier, decided by cumulative spend: a Free tier, then Tier 1 once billing is set up, Tier 2 after 100 US dollars of spend and 30 days, and Tier 3 after 1,000 US dollars and 30 days. Higher tiers and newer or larger models carry higher ceilings, and a separate, much larger token allowance applies to enqueued batch work, for example several million enqueued tokens for a flash model at Tier 1 rising into the billions at Tier 3.

Pagination

List endpoints page through results with a pageSize parameter that caps the page and a pageToken parameter that requests the next page. A response returns a nextPageToken when more results remain, and an empty nextPageToken means the last page has been reached.

Request size

Limits are set by the model and the resource rather than one global cap. The Files API holds up to 20 GB per project, with a per-file maximum of 2 GB and files retained for 48 hours; uploaded PDFs are capped at 50 MB. Each model has its own input and output token limits, readable on the model resource, and context caching has its own minimum token count to be eligible.

Errors

Status codes & error handling.

The status codes an agent should handle, and what to do about each.

StatusCodeMeaningWhat to do
400INVALID_ARGUMENTThe request body is malformed, with a typo, a missing required field, or an invalid value.Check the request against the API reference, correct the named field, and resend.
400FAILED_PRECONDITIONThe free tier is not available in the caller's region, or billing is not enabled for a request that needs it.Enable a paid plan or billing on the project in Google AI Studio.
403PERMISSION_DENIEDThe API key lacks permission for the request, often a wrong key or missing authentication for a tuned model.Confirm the right key is sent and use the correct authentication for the resource.
404NOT_FOUNDThe requested resource was not found, such as a model, file, or tuned model that does not exist for this key or version.Check the resource name and the API version in the path, then retry.
429RESOURCE_EXHAUSTEDA rate limit was exceeded for the model, on requests per minute, tokens per minute, or requests per day.Back off and retry, smooth the request rate, or request a quota increase or a higher usage tier.
500INTERNALAn unexpected error on Google's side, sometimes triggered by an unusually long input context.Retry with backoff, and reduce the input context if the error persists.
503UNAVAILABLEThe service is temporarily overloaded or down.Retry with backoff, or switch to another model for the moment.
504DEADLINE_EXCEEDEDThe request could not finish within the deadline, often because the prompt or context is too large.Raise the client timeout, or reduce the prompt and context size.
Versioning & freshness

Version history.

Gemini exposes a stable v1 surface and a broader v1beta surface that carries preview features first, and the model itself is the thing that is versioned and dated, not the API path.

Version history

What changed, and when

Latest versionv1beta
v1betaCurrent version
Preview surface (SDK default)

The API exposes a stable v1 surface and a broader v1beta preview surface that carries newer features first, and the SDKs default to v1beta. The API path itself has no dated version string; the model is the versioned, dated thing, promoted from preview to general availability and later retired. The dated entries below are notable model and platform changes from the Gemini API release notes.

What changed
  • v1beta exposes preview features ahead of the stable v1 surface
  • Both v1 and v1beta support the Interactions API as of June 2026
2026-06-17Feature update
Streaming speech generation

Streaming support was added for the text-to-speech preview model.

What changed
  • Streaming output for the text-to-speech preview model
2026-05-28Feature update
Native visual models reach general availability

The native visual models known as Nano Banana 2 and Pro were released as generally available versions.

What changed
  • Native visual models released as GA
2026-05-19Feature update
Gemini 3.5 Flash GA and managed agents preview

Gemini 3.5 Flash was released as generally available, and managed agents launched in public preview.

What changed
  • Gemini 3.5 Flash reached general availability
  • Managed agents launched in public preview
2026-05-04Feature update
Event-driven webhooks for asynchronous work

Event-driven webhook support was introduced for the Batch API and long-running operations, so an integration can be notified on completion rather than polling.

What changed
  • Webhook completion notifications added for the Batch API and long-running operations
2026-04-22Feature update
Embedding model 2 reaches general availability

The second-generation embedding model was released as generally available.

What changed
  • Embedding model 2 reached general availability
2026-04-01Feature update
Flex and Priority inference tiers

New Flex and Priority inference tiers were introduced to trade off cost against latency.

What changed
  • Flex and Priority inference tiers introduced

An integration can target the stable surface or opt into the preview surface for newer features.

Gemini API release notes ↗
Questions

Google Gemini API, answered.

How do I authenticate, and where does the API key go?+
A call authenticates with an API key created in Google AI Studio. The key is sent in the x-goog-api-key request header, or as a key query parameter on the URL. It is account-wide, so it must stay on a server and never appear in client code, where anyone could read and reuse it. A few operations, like calling a tuned model, can use an OAuth 2.0 access token instead.
Can an API key be limited to read-only, or to one part of the API?+
Not on the key itself. A Gemini API key carries no per-endpoint scopes, so it cannot be set to allow reads but block writes, or to reach only one area like files or tuning. Any method the key can call, it can call. Limiting an agent to a subset of methods is exactly what a gateway like Bollard adds in front of the key. The Vertex AI route does support Google Cloud roles, but the Developer API key does not.
What is the difference between the Gemini Developer API and Vertex AI?+
Both serve the same Gemini models, but through different front doors. The Gemini Developer API on generativelanguage.googleapis.com uses a simple API key and is quick to start with. Vertex AI is part of Google Cloud, reached through Google Cloud authentication, with project-level Identity and Access Management roles, regional controls, and enterprise billing. Teams already on Google Cloud or needing those controls tend to use Vertex AI; this page covers the Developer API.
How do the rate limits work?+
Each model has limits on requests per minute, tokens per minute, and requests per day, and usage is checked against all three at once, so exceeding any one returns HTTP 429 with the status RESOURCE_EXHAUSTED. The ceilings rise with the account's usage tier, which is based on cumulative spend, from a Free tier up through Tier 1, Tier 2, and Tier 3. The fix for a 429 is to back off and retry, smooth the call rate, or request a quota increase.
Does Gemini send webhooks?+
Not for ordinary content generation, where the result comes back on the call that was made. The exception is asynchronous work: a batch job or a long-running operation can notify a registered endpoint when it completes, support that was added in May 2026, so an integration learns the result is ready without polling for it.
How long are uploaded files kept, and how big can they be?+
Files uploaded through the Files API are held for 48 hours, then deleted automatically, and can also be deleted sooner. A project can store up to 20 GB at once, with a maximum of 2 GB per file; uploaded PDFs are capped at 50 MB. While a file exists, any call using the same key can reference it.
What is context caching and how much can it save?+
Context caching saves precomputed input tokens so they can be reused across calls, for example asking several questions about the same large document or media file. Instead of resending and reprocessing that content each time, a call points at the cached content. Each cache carries a time-to-live, set with the ttl field or an expireTime, and expires or can be deleted when it is no longer needed.
Related

More ai API guides for agents

What is Bollard AI?

Control what every AI agent can do with Gemini.

Bollard AI sits between a team's AI agents and Google Gemini. Grant each agent exactly the access it needs, read or write, area by area, and every call is checked and logged.

  • Set read, write, or full access per agent, never a shared Gemini key.
  • Denied by default, so an agent reaches only what has been explicitly allowed.
  • Every call recorded in plain English: who, what, where, and the decision.
Google Gemini
Drafting Agent
Generate text and images ActionOffReadFull use
Read uploaded files ResourceOffReadFull use
Fine-tune models ActionOffReadFull use
Per-agent access, set in Bollard AI, not in Gemini