Open Source
Explore the latest AI open-source projects from GitHub and HuggingFace.
Explore the latest AI open-source projects from GitHub and HuggingFace.

Vision-Agents is an open-source framework by Stream for building real-time AI agents that process video and audio with ultra-low latency. The project enables developers to create intelligent applications combining computer vision, language models, and speech processing. It uses WebRTC-based direct streaming to LLM providers for live video analysis, with a frame processor pipeline as a fallback for providers lacking WebRTC capability. The framework achieves sub-30ms audio/video latency via Stream's edge network. Core conversational capabilities include turn detection, speaker diarization, voice activity detection, speech-to-text and text-to-speech integration, and function/tool calling for executing code mid-conversation. Vision-Agents supports 25+ integrations across LLM providers (OpenAI, Gemini, Anthropic, Mistral, xAI, Hugging Face), speech services (Deepgram, ElevenLabs, AWS Polly, Cartesia), and vision tools (YOLO, Roboflow, Moondream, NVIDIA Cosmos 2). Practical use cases include sports coaching with real-time pose tracking, security monitoring with face recognition, phone-based RAG assistants via Twilio, and silent interview coaching. Built with Python and installable via uv, it features a modular architecture separating edge network management, media processing, and LLM integration.