Open Source
Explore the latest AI open-source projects from GitHub and HuggingFace.
Explore the latest AI open-source projects from GitHub and HuggingFace.
## Introduction Midscene.js is an AI-powered UI automation framework that uses vision-driven technology to automate interactions across web, mobile, and desktop platforms. With 12,100+ GitHub stars and active development by web-infra-dev, Midscene represents a paradigm shift in test automation: instead of relying on brittle CSS selectors and XPath queries, it uses visual language models to understand and interact with interfaces the same way a human does, by looking at screenshots. This approach eliminates the maintenance nightmare of traditional selector-based automation that breaks whenever UI layouts change. By operating on visual understanding rather than DOM structure, Midscene tests remain stable across UI redesigns, framework migrations, and even cross-platform deployments. ## Architecture and Design Midscene operates through a pure-vision localization pipeline. When an action is requested, the system captures a screenshot of the current interface, sends it to a visual language model for understanding, identifies the target element's location, and executes the interaction. This approach works identically whether the UI is rendered by React, Vue, native Android, iOS, or even a canvas-based application. The framework supports multiple visual language models as its perception backbone: | Model | Type | Best For | |-------|------|----------| | Qwen3-VL | Cloud/Self-hosted | General-purpose vision tasks | | Doubao-1.6-vision | Cloud | High-accuracy element detection | | Gemini-3-Pro | Cloud | Complex multi-step workflows | | UI-TARS | Open-source | Cost-effective self-hosted deployment | The architecture separates the perception layer (visual understanding) from the action layer (element interaction), allowing users to swap models without changing test logic. DOM extraction is available as a fallback but is not required for most operations, significantly reducing token consumption. ## Key Capabilities **Natural Language Interaction**: Instead of writing `page.click('#submit-btn')`, developers write `ai('click the submit button')`. This natural language interface makes tests self-documenting and accessible to non-technical team members. **Data Extraction APIs**: Beyond clicking and typing, Midscene can extract structured data from any interface. Ask it to read a table, capture form values, or extract text from a complex dashboard layout, and it returns structured JSON. **Cross-Platform Coverage**: A single API works across Puppeteer and Playwright for web testing, adb-based Android automation, WebDriverAgent-powered iOS testing, and custom SDK integrations for desktop applications. Write once, automate everywhere. **Zero-Code Chrome Extension**: For teams that want to automate without writing any code, the Chrome Extension provides a visual interface for recording and replaying browser interactions with AI-powered element recognition. **Built-In Playgrounds**: Dedicated Android and iOS playgrounds let developers experiment with mobile automation interactively before committing to full test suites. **Visualization Reporting**: Every test run generates visual replay reports showing exactly what the AI saw, where it clicked, and how the interface responded. This makes debugging failed tests straightforward. **Assertions and Waits**: Built-in assertion capabilities verify UI state through natural language ("assert the login button is visible"), while conditional waits handle asynchronous interfaces gracefully. ## Developer Integration Installation via npm: ```bash npm install @midscene/web ``` A basic Playwright integration: ```typescript import { ai } from '@midscene/web/playwright'; test('login flow', async ({ page }) => { await page.goto('https://app.example.com'); await ai('type "user@example.com" in the email field'); await ai('type "password123" in the password field'); await ai('click the sign in button'); await ai('assert the dashboard heading is visible'); }); ``` For Android testing, the SDK connects via adb and provides the same natural language interface over mobile screens. ## Limitations Vision-based automation is inherently slower than selector-based approaches due to the overhead of screenshot capture and model inference. Each action incurs API costs when using cloud-hosted vision models, though self-hosted options like UI-TARS mitigate this. Highly dynamic UIs with frequent layout changes during interactions can confuse the vision model. Small or visually similar elements may occasionally be misidentified, requiring more specific natural language descriptions. The framework's TypeScript-only nature limits server-side integration options for teams using other languages. Performance in pixel-dense interfaces with many similar elements can degrade without careful prompt engineering. ## Who Should Use This Midscene is ideal for QA teams tired of maintaining brittle selector-based test suites that break with every UI update. Front-end developers who want end-to-end tests that survive framework migrations find the vision-based approach liberating. Mobile development teams needing cross-platform test coverage benefit from the unified API across Android, iOS, and web. Product managers and designers who want to write or review test scenarios in natural language appreciate the accessibility. Any organization investing in test automation that should remain stable as their UI evolves will find Midscene dramatically reduces maintenance overhead.