Google AI API (Gemini)

Google's Gemini API covers both Google AI Studio (developer) and Vertex AI (enterprise GCP) entry points, with the largest context window of any commercial model and competitive pricing for high-volume workloads.

Google's LLM API surface spans two entry points: Google AI Studio (direct API, developer-friendly) and Vertex AI (enterprise, GCP-integrated). Both serve the Gemini model family. 650M monthly active Gemini users as of April 2026.


Models (April 2026)

ModelContextStrengthPricing (in/out per M)
gemini-2.5-pro1MBest reasoning, long context$1.25 / $10
gemini-2.5-flash1MFast, cheap, thinking mode$0.15 / $0.60
gemini-2.0-flash1MPrevious gen, stable$0.10 / $0.40
gemini-1.5-pro2MLargest context window$1.25 / $5
text-embedding-0042KEmbeddings$0.025 / —

Gemini 2.5 Pro has the highest context window of any commercial model at 2M tokens (Gemini 1.5 Pro). Strong on coding and reasoning; competitive with Claude Opus and o3.


Setup

pip install google-generativeai          # Google AI Studio SDK
pip install google-cloud-aiplatform      # Vertex AI SDK

Get an API key from Google AI Studio (aistudio.google.com).


Google AI Studio SDK

import google.generativeai as genai

genai.configure(api_key="GOOGLE_API_KEY")  # or GOOGLE_API_KEY env var

model = genai.GenerativeModel(
    model_name="gemini-2.5-pro",
    system_instruction="You are a helpful assistant.",
)

# Simple generation
response = model.generate_content("Explain attention mechanisms.")
print(response.text)

# With generation config
response = model.generate_content(
    "Write a haiku about transformers.",
    generation_config=genai.GenerationConfig(
        temperature=0.9,
        max_output_tokens=100,
    ),
)

Streaming

for chunk in model.generate_content("Tell me a story.", stream=True):
    print(chunk.text, end="", flush=True)

Multi-turn Chat

chat = model.start_chat(history=[])

response = chat.send_message("What is RAG?")
print(response.text)

response = chat.send_message("How does it compare to fine-tuning?")
print(response.text)

# Access history
for message in chat.history:
    print(f"{message.role}: {message.parts[0].text[:100]}")

Vision and Multimodal

import PIL.Image

model = genai.GenerativeModel("gemini-2.5-pro")

# Image from file
image = PIL.Image.open("diagram.png")
response = model.generate_content(["Explain this architecture diagram:", image])

# Image from URL
response = model.generate_content([
    "What's in this image?",
    {"mime_type": "image/jpeg", "data": base64_image_bytes},
])

# PDF analysis (Gemini handles PDFs natively)
with open("contract.pdf", "rb") as f:
    pdf_data = f.read()

response = model.generate_content([
    {"mime_type": "application/pdf", "data": pdf_data},
    "Summarise the key terms and obligations in this contract.",
])

Function Calling

def get_stock_price(ticker: str) -> dict:
    """Get current stock price."""
    return {"ticker": ticker, "price": 185.42, "currency": "USD"}

# Define tool
get_stock_tool = genai.protos.Tool(
    function_declarations=[
        genai.protos.FunctionDeclaration(
            name="get_stock_price",
            description="Get the current stock price for a given ticker symbol.",
            parameters=genai.protos.Schema(
                type=genai.protos.Type.OBJECT,
                properties={
                    "ticker": genai.protos.Schema(
                        type=genai.protos.Type.STRING,
                        description="Stock ticker symbol, e.g. AAPL, GOOGL",
                    )
                },
                required=["ticker"],
            ),
        )
    ]
)

model = genai.GenerativeModel("gemini-2.5-pro", tools=[get_stock_tool])
response = model.generate_content("What's Apple's current stock price?")

# Handle tool call
if response.candidates[0].content.parts[0].function_call:
    fc = response.candidates[0].content.parts[0].function_call
    result = get_stock_price(**dict(fc.args))
    
    # Send result back
    response = model.generate_content([
        "What's Apple's stock price?",
        response.candidates[0].content,
        genai.protos.Content(
            parts=[genai.protos.Part(
                function_response=genai.protos.FunctionResponse(
                    name=fc.name, response=result
                )
            )],
            role="function",
        ),
    ])

Structured Output (JSON Mode)

import json

model = genai.GenerativeModel(
    "gemini-2.5-flash",
    generation_config={"response_mime_type": "application/json"},
)

response = model.generate_content(
    "List the top 3 Python web frameworks with a brief description each. Return as JSON array."
)
data = json.loads(response.text)

Embeddings

result = genai.embed_content(
    model="models/text-embedding-004",
    content="What is the capital of France?",
    task_type="retrieval_query",  # or "retrieval_document", "semantic_similarity"
)
embedding = result["embedding"]  # list of 768 floats

Task types affect the embedding. Use retrieval_query for queries and retrieval_document for documents being indexed.


Vertex AI (Enterprise)

Vertex AI adds: IAM/VPC security, regional data residency, enterprise SLAs, audit logging, private endpoints.

import vertexai
from vertexai.generative_models import GenerativeModel

vertexai.init(project="my-gcp-project", location="us-central1")

model = GenerativeModel("gemini-2.5-pro")
response = model.generate_content("Explain quantum computing.")
print(response.text)

Vertex AI uses Google Cloud credentials (Application Default Credentials) rather than API keys:

gcloud auth application-default login

LangChain Integration

from langchain_google_genai import ChatGoogleGenerativeAI, GoogleGenerativeAIEmbeddings

llm = ChatGoogleGenerativeAI(
    model="gemini-2.5-pro",
    google_api_key="GOOGLE_API_KEY",
    temperature=0.7,
)

embeddings = GoogleGenerativeAIEmbeddings(
    model="models/text-embedding-004",
    google_api_key="GOOGLE_API_KEY",
)

Thinking Mode (Gemini 2.5)

Gemini 2.5 Pro/Flash support a thinking mode similar to Claude's extended thinking:

model = genai.GenerativeModel("gemini-2.5-pro")
response = model.generate_content(
    "Prove that there are infinitely many prime numbers.",
    generation_config=genai.GenerationConfig(
        thinking_config=genai.ThinkingConfig(thinking_budget=5000)
    ),
)
# response includes thinking steps + final answer

Google AI vs Anthropic vs OpenAI

FeatureGemini 2.5 ProClaude Sonnet 4.6GPT-4o
Max context2M tokens1M tokens128K tokens
Pricing (input)$1.25/M$3/M$2.50/M
Code (SWE-bench)~70%+79.6%~73%
Google Workspace integrationNativeNoneNone
MultimodalStrongStrongStrong

Gemini is the natural choice for teams already on GCP or deeply integrated with Google Workspace.


Key Facts

  • Gemini 2.5 Pro context window: 1M tokens; Gemini 1.5 Pro: 2M tokens (largest commercial)
  • 650M monthly active Gemini users as of April 2026
  • Gemini 2.5 Pro pricing: $1.25/M input, $10/M output
  • text-embedding-004 output: 768 floats, $0.025/M tokens
  • Vertex AI adds IAM/VPC, regional data residency, audit logging, and private endpoints
  • task_type matters for embeddings: retrieval_query vs retrieval_document produce different vectors
  • Thinking mode configured via ThinkingConfig(thinking_budget=N) in generation config

Common Failure Cases

generate_content raises BlockedPromptException with no useful message
Why: Gemini's safety filters block the request; the error message often doesn't specify which filter triggered.
Detect: google.generativeai.types.BlockedPromptException; response.prompt_feedback.block_reason contains the actual reason.
Fix: check response.prompt_feedback.safety_ratings to identify the triggering category; adjust content or use SafetySettings to lower thresholds for the relevant category.

Vertex AI ADC credentials fail in GitHub Actions
Why: Application Default Credentials require gcloud auth application-default login which is interactive; GitHub Actions can't run it.
Detect: google.auth.exceptions.DefaultCredentialsError in CI; the workflow works locally but not in Actions.
Fix: use Workload Identity Federation (OIDC) in Actions; or set GOOGLE_APPLICATION_CREDENTIALS to a service account JSON key file stored in GitHub Secrets.

Thinking mode significantly exceeds expected token budget
Why: thinking_budget=5000 sets a maximum, not a minimum; complex reasoning problems may use the full budget, multiplying cost unexpectedly.
Detect: response.usage_metadata.thoughts_token_count is consistently near the budget limit; cost per call is higher than expected.
Fix: lower thinking_budget for simpler queries; use a tiered approach — start with Flash (no thinking), escalate to Pro with thinking only for hard problems.

Function calling loop fails because Gemini passes None for optional parameters
Why: Gemini includes optional parameters with null value in the function call payload; the Python function receiving None for a required-looking param raises an error.
Detect: tool function raises TypeError: expected str, got None; the function call in the trace shows null for an optional argument.
Fix: use Optional[str] = None type hints and handle None explicitly in the function body; or set "nullable": True in the schema.

text-embedding-004 task_type mismatch degrades retrieval quality by 5-10%
Why: using retrieval_document type for both indexing and querying, or omitting task_type, bypasses Google's asymmetric embedding training.
Detect: retrieval quality doesn't match Google's benchmarks; swap types and compare recall on a test set.
Fix: always use task_type="retrieval_query" for user queries and task_type="retrieval_document" for documents being indexed.

Connections

Open Questions

  • How does Gemini 2M context quality degrade on retrieval tasks at full window utilisation?
  • What is the practical cost difference between Vertex AI and AI Studio for production workloads?
  • How does Gemini thinking mode quality compare to Claude extended thinking on complex reasoning benchmarks?