Reviews AI Tools Open Source Live News AI Official

Open Source

Explore the latest AI open-source projects from GitHub and HuggingFace.

JoyAI-Image - Open Source | Evermx | Evermx

Back to Open Source

Trending

JoyAI-Image

jd-opensourceApache-2.0

View on GitHub

Multimodal2.2K Stars157 Forks50 views

## Introduction JoyAI-Image is an open-source multimodal model from JD.com's AI research division that attempts to close the loop between image understanding and image generation — two capabilities that the open-source ecosystem has historically treated as separate problems requiring separate model families. The core claim is that a system which genuinely understands spatial relationships in images will also generate and edit them more accurately, and that a system trained to generate coherent spatial layouts will develop richer scene understanding. JoyAI-Image operationalizes this idea by coupling an 8B Multimodal LLM (MLLM) with a 16B Multimodal Diffusion Transformer (MMDiT) through a shared interface, then training them jointly so that improvements in one component feed back into the other. Released with technical reports, datasets, and benchmarks in early to mid 2026, the project has reached over 2,100 GitHub stars. ## What It Is The system is built around three tightly integrated components: - **JoyAI-Image-Und**: The understanding backbone — an 8B MLLM fine-tuned for spatial scene parsing, relational grounding, and structured image question answering. It extracts detailed spatial metadata (object locations, relative positions, depth cues) that the generation side can condition on. - **JoyAI-Image-Edit**: The editing model — a 16B MMDiT checkpoint that accepts instruction text and spatial parameters derived from the understanding module, enabling precise manipulations such as object translation, rotation, and camera perspective shifts. - **OpenSpatial-3M**: A dataset of 3 million spatially annotated image-instruction pairs released alongside the model to support community research on spatial editing and generation tasks. The closed-loop architecture means that generation is guided by explicit spatial understanding rather than relying purely on diffusion model priors. When a user asks to "move the red chair to the left side of the room," the understanding module first parses the current spatial layout, then passes structured spatial parameters to the diffusion model as conditioning signals — resulting in edits that respect scene geometry rather than making plausible-looking but physically inconsistent changes. ## Key Capabilities ### Spatial Intelligence JoyAI-Image's defining feature is its treatment of spatial reasoning as a first-class capability. The MLLM backbone is trained specifically on scene relational understanding — answering questions about object positions, occlusion relationships, and viewpoint geometry — rather than generic visual QA. This spatial grounding directly improves both the quality of edits and the faithfulness of generated scenes to text descriptions. ### Instruction-Guided Spatial Editing Users can issue natural-language editing instructions — "rotate the object 45 degrees," "change the camera angle to a top-down view," "remove the background clutter" — and the model applies the transformation while maintaining scene coherence. The SpatialEdit benchmark introduced by the team provides a standardized evaluation for this class of tasks. ### Text-to-Image Generation with Advanced Typography The generative component handles text-to-image synthesis with particular strength in scenes involving complex text layouts: multi-panel comics, multilingual signage, dense information graphics, and structured diagrams. Typography-heavy generation is a known weakness of most diffusion models; JoyAI-Image addresses it through dedicated training on layout-rich data. ### Multi-View and 3D-Consistent Generation The model supports multi-view generation — producing consistent images of the same object or scene from different camera angles — a capability useful for product visualization, game asset creation, and 3D reconstruction pipelines. ### Unified Understanding-Generation Feedback Unlike pipeline approaches where understanding and generation are run in sequence, JoyAI-Image uses bidirectional feedback during training: generative transformations provide complementary spatial evidence that improves the understanding model, while better understanding improves generation conditioning. This mutual reinforcement is the architectural innovation the team reports as key to its performance. ## Deployment and Integration JoyAI-Image requires Python 3.10+, CUDA, PyTorch 2.8+, transformers 4.57.0+, and flash-attn 2.8.0+. The project ships with: - **ComfyUI integration** for node-based visual workflow deployment - **Hugging Face Diffusers compatibility** added in May 2026, enabling standard diffusers pipeline usage - Inference scripts, training code, and benchmark evaluation utilities The two published checkpoints (Und and Edit) allow teams to deploy only the capability they need; a distilled inference-optimized version and a multi-image editing variant are listed as forthcoming. ## Why It Matters The dominant open-source approach to image editing in 2025–2026 remains prompt-based diffusion inpainting — effective for creative tasks but imprecise for spatial manipulation. Moving an object to a specific location, maintaining consistent scale, or adjusting camera perspective require explicit geometric reasoning that pure diffusion models lack. JoyAI-Image's architecture, which anchors generation in structured spatial understanding from the MLLM backbone, represents a credible path toward more controllable, geometry-aware image editing. The release of OpenSpatial-3M as a public dataset is also notable: it provides the community with training data specifically annotated for spatial editing tasks, which had previously been scarce. Researchers working on controllable generation, robotic manipulation planning, product visualization, or e-commerce image processing all stand to benefit from the dataset independently of the model weights themselves.

Key Features

Unified architecture combining an 8B MLLM and 16B MMDiT in a closed-loop understanding-generation system
Instruction-guided spatial editing: object translation, rotation, and camera perspective control
Spatial intelligence backbone for relational scene parsing and grounded image question answering
Advanced typography generation for multi-panel comics, multilingual layouts, and dense information graphics
Multi-view consistent generation for product visualization and 3D reconstruction pipelines
OpenSpatial-3M: 3 million spatially annotated image-instruction pairs released as an open dataset
SpatialEdit benchmark for standardized evaluation of controllable image manipulation
ComfyUI and Hugging Face Diffusers integration for node-based and pipeline-based deployment