Reviews AI Tools Open Source Live News AI Official

Open Source

Explore the latest AI open-source projects from GitHub and HuggingFace.

Vision-Agents - Open Source | Evermx | Evermx

Back to Open Source

Trending

Vision-Agents

GetStreamApache-2.0

View on GitHub

Agent7.1K Stars536 Forks231 views

Vision-Agents is an open-source framework by Stream for building real-time AI agents that process video and audio with ultra-low latency. The project enables developers to create intelligent applications combining computer vision, language models, and speech processing. It uses WebRTC-based direct streaming to LLM providers for live video analysis, with a frame processor pipeline as a fallback for providers lacking WebRTC capability. The framework achieves sub-30ms audio/video latency via Stream's edge network. Core conversational capabilities include turn detection, speaker diarization, voice activity detection, speech-to-text and text-to-speech integration, and function/tool calling for executing code mid-conversation. Vision-Agents supports 25+ integrations across LLM providers (OpenAI, Gemini, Anthropic, Mistral, xAI, Hugging Face), speech services (Deepgram, ElevenLabs, AWS Polly, Cartesia), and vision tools (YOLO, Roboflow, Moondream, NVIDIA Cosmos 2). Practical use cases include sports coaching with real-time pose tracking, security monitoring with face recognition, phone-based RAG assistants via Twilio, and silent interview coaching. Built with Python and installable via uv, it features a modular architecture separating edge network management, media processing, and LLM integration.

Key Features

WebRTC-based direct streaming to LLM providers for real-time live video analysis with sub-30ms latency
Frame processor pipeline for providers without WebRTC support, enabling universal model compatibility
25+ integrations across LLM providers, speech services, and vision tools including OpenAI, Gemini, YOLO, and Deepgram
Built-in conversational AI with turn detection, speaker diarization, voice activity detection, and function calling
Twilio phone integration enabling voice-based RAG assistants with vector search knowledge bases
Text back-channel for silent agent messaging during calls, enabling use cases like interview coaching

Related Projects

TrendingAgent

GitHub

366.0K75.2K

OpenClaw

OpenClaw

MIT493

Open Source

Vision-Agents

Key Features

Tags

Related Projects

OpenClaw

OpenClaw

Superpowers

Hermes Agent