Audio and Voice AI
ASR (Whisper/Deepgram) + LLM + TTS (ElevenLabs/OpenAI) voice pipeline — targeting 300-700ms latency via sentence-boundary streaming and fastest model selection.
Document Processing with AI
Claude is the strongest model for document understanding — send PDFs directly via the API or Files API; for pipeline scale, pair pymupdf/unstructured for extraction with Claude for interpretation.
Image Generation Models
Flux.1 has displaced Stable Diffusion as the open-source default; DALL-E 3 leads on text-prompt adherence; all major image gen models are accessible via API with no local GPU needed.
Video AI
Gemini 1.5 Pro / 2.0 Flash are the frontier for video understanding — 1M token context handles full-length films; video generation (Sora, Veo, Runway) is improving fast but still unreliable for complex motion.
Vision and Multimodal AI
VLM architecture (ViT encoder → projection → LLM), Claude's document-understanding strength, multimodal RAG with ColPali, and image generation model comparison.