Table of Content

close

Introduction

 

Definition of Image Reasoning

    Historical Context
    Key Components of Image Reasoning
    Example of Image Reasoning

    Qwen QvQ
      Model Architecture and Specifications
      Performance Metrics
      Limitations
    DeepSeek R1
      Model Architecture and Specifications
      Performance Metrics
      Efficiency and Accessibility
    Llama Vision Models
      Llama 4 Series (April 2025)
        Model Architecture and Specifications
        Key Architectural Features
        Training Methodology
        Performance Metrics
      Llama 3.2 Vision (September 2024)
        Model Architecture and Specifications
        Key Architectural Features
        Performance Capabilities
        Limitations
      Applications in Image Reasoning
    Janus-Pro-7B
      Model Architecture and Specifications
      Performance Metrics
      Unique Capabilities
    Qwen QwQ
      Model Architecture and Specifications
      Performance Metrics
      Efficiency and Accessibility
    Lumina-Image 2.0
      Model Architecture and Specifications
      Performance Metrics
      Versatility and Applications

    Raw Benchmark Performance
    Visual Reasoning Capabilities
    Architectural Approaches
    Efficiency vs. Performance

Comparison Table of Open Source Image Reasoning Models (2025)

Key Insights from Comparison

Conclusion

References

Image Reasoning: State-of-the-Art Open Source AI Models in 2025

open-book16 min read
Artificial Intelligence
Rohit Aggarwal
Stephen Hayes
Harpreet Singh
Rohit Aggarwal
  +2 More
down



Introduction

Artificial intelligence has made remarkable strides in recent years, with one of the most significant advancements being in the field of image reasoning. This capability represents a fundamental shift in how AI systems process and understand visual information, moving beyond simple recognition to complex reasoning about visual content. This report examines the current state of image reasoning technology in 2025, focusing on the top open source AI models that excel in this domain.

The ability for machines to not just see but to reason with and about images represents a critical step toward more general artificial intelligence. As we'll explore, today's leading open source models don't merely identify objects in images; they can analyze relationships, infer context, solve problems, and generate insights based on visual information—capabilities that were barely imaginable just a few years ago.

This report provides a comprehensive overview of image reasoning, detailed analysis of the top open source models' architectures and performance metrics, and a comparative evaluation to help researchers, developers, and organizations understand the current landscape and make informed decisions about which models might best suit their needs.

 

Definition of Image Reasoning

Image reasoning refers to the advanced cognitive capability of AI systems to not only perceive and recognize visual content but to actively think with and about images during problem-solving processes. It represents the integration of visual perception with higher-order reasoning, enabling AI to:

  1. Analyze visual information beyond simple object recognition or classification
  2. Incorporate images directly into reasoning chains rather than merely translating them to text
  3. Manipulate visual content mentally (e.g., rotating, zooming, or transforming images) during reasoning
  4. Draw logical inferences from visual data
  5. Solve complex problems that require understanding both the content and context of images

Unlike traditional computer vision, which focuses primarily on what is in an image, image reasoning is concerned with understanding the relationships, implications, and reasoning about what is seen. It represents a fusion of visual and linguistic intelligence, where models can seamlessly integrate information from both modalities to perform complex cognitive tasks.

 

Overview of Image Reasoning

Historical Context

Image reasoning has evolved from earlier computer vision and multimodal AI approaches. Traditional computer vision focused on tasks like object detection, image classification, and segmentation—identifying what was in an image. Early multimodal models could generate text descriptions of images but struggled with deeper understanding.

The breakthrough came with the development of models that could integrate visual information directly into their reasoning processes. Rather than treating images as separate inputs requiring translation to text, these models began to "think with" images, incorporating visual information directly into their chain of thought.
 

Key Components of Image Reasoning

Modern image reasoning systems typically incorporate several key components:

  1. Visual Encoders: Specialized neural networks that transform image data into rich feature representations that capture both low-level visual features and high-level semantic content.
  2. Multimodal Integration Mechanisms: Architectures that allow seamless fusion of visual and textual information, enabling models to reason across modalities.
  3. Visual Working Memory: The ability to maintain and manipulate visual information during extended reasoning processes.
  4. Visual Manipulation Capabilities: Functions that allow models to mentally transform images (zoom, rotate, crop) as part of their reasoning process.
  5. Chain-of-Thought Visual Reasoning: The ability to break down complex visual problems into step-by-step reasoning processes that incorporate visual information at each stage.

 

Example of Image Reasoning

To illustrate the concept of image reasoning, consider a model presented with an image of a complex physics problem showing a pulley system with weights and angles. A traditional computer vision system might identify the components (pulleys, weights, ropes) but would struggle to solve the problem. A basic multimodal system might generate a text description of the setup but wouldn't reason about the physics.

In contrast, an advanced image reasoning model would:

  1. Analyze the visual components and their relationships
  2. Identify the relevant physical principles
  3. Extract key measurements and parameters from the image
  4. Mentally manipulate the system to understand forces and tensions
  5. Apply mathematical reasoning to solve for unknown variables
  6. Generate a step-by-step solution that references specific visual elements

Throughout this process, the model doesn't just convert the image to text and then reason; it actively thinks with the visual information, referring back to specific parts of the image and potentially manipulating the visual representation as part of its reasoning process.

 

Top Open Source Image Reasoning Models in 2025

Qwen QvQ

Model Architecture and Specifications

Qwen QvQ represents a significant advancement in multimodal AI, specifically designed for visual reasoning tasks. Built upon the Qwen2-VL-72B architecture, this model features:

  • Parameter Count: 72 billion parameters
  • Architecture Type: Transformer-based design with specialized visual reasoning capabilities
  • License: Open source (Apache 2.0)
  • Key Innovations:
    • Grouped query attention mechanism
    • Dual chunk attention for enhanced multimodal processing
    • Hierarchical architecture tailored for complex multimodal reasoning tasks

The model's visual processing components integrate visual and language information through advanced multimodal fusion techniques, enabling it to process and reason with both images and text simultaneously. Its specialized visual encoder is designed to extract and understand complex visual features.

Qwen QvQ was built on the Qwen2-VL foundation with additional specialized training for visual reasoning, including extensive training on multimodal datasets with image-text pairs and fine-tuning specifically for visual reasoning tasks with a focus on mathematical and scientific reasoning.


Performance Metrics

Without fine-tuning, Qwen QvQ demonstrates impressive performance on several key benchmarks:

  • MMMU (Multimodal Math Understanding): Achieved a score of 70.3, showing substantial improvements over its predecessor
  • MathVista: Scored 71.4 on this mathematics-focused visual reasoning test
  • MathVision: Excellent results on multimodal mathematical reasoning derived from real mathematics competitions
  • OlympiadBench: Competitive performance (20.4%) on Olympic competition-level bilingual multimodal science benchmark tests

The model excels in tasks requiring sophisticated reasoning with visual inputs, particularly in domains that demand analytical thinking, such as physics problems. It can methodically reason through complex visual problems with step-by-step analysis and demonstrates enhanced capabilities in understanding and manipulating visual information during reasoning.

With fine-tuning, Qwen QvQ shows improved performance on domain-specific visual reasoning tasks, enhanced ability to maintain focus on image content during multi-step reasoning, reduced tendency for "hallucinations," and better handling of language mixing and circular logic patterns.

 

Limitations

Despite its impressive capabilities, Qwen QvQ has several limitations:

  • May occasionally mix languages or switch between them unexpectedly
  • Can get stuck in circular logic patterns during complex reasoning
  • During multi-step visual reasoning, may gradually lose focus on the image content, leading to hallucinations
  • Requires enhanced safety measures for reliable performance

 

DeepSeek R1

Model Architecture and Specifications

DeepSeek R1 represents a massive-scale approach to reasoning capabilities:

  • Model Type: Advanced reasoning model using Mixture-of-Experts (MoE) architecture
  • Total Parameter Count: 671 billion parameters
  • Activated Parameter Count: Each token activates parameters equivalent to 37 billion
  • License: Open source (MIT License)
  • Base Architecture: Built on DeepSeek-V3-Base

The model's key architectural features include an MoE framework that activates only a subset of parameters for each query, efficient processing of complex reasoning tasks, and specialization for mathematical problem-solving and logical reasoning.

While not specifically designed for visual tasks, the model can be applied to visual reasoning with strong general reasoning capabilities that can be leveraged for image understanding and cross-domain problem-solving including visual inputs.

DeepSeek R1 uses a multi-stage training approach that includes initial supervised fine-tuning with high-quality examples, reinforcement learning focused on reasoning tasks, collection of new training data through rejection sampling, and final reinforcement learning across all types of tasks. It employs group relative policy optimization (GRPO) with a focus on accuracy and format rewards.

 

Performance Metrics

DeepSeek R1 has demonstrated exceptional performance across multiple benchmarks:

  • AIME (American Invitational Mathematics Examination) 2024: Achieved a score of 79.8% Pass@1, slightly surpassing OpenAI-o1
  • MATH-500: Scored an impressive 97.3%, ahead of o1's 96.4%
  • SWE-bench Verified: Outperformed competing models in programming tasks
  • MMLU (Pass@1): 90.8%, showing strong general knowledge capabilities
  • MMLU-Redux (EM): 92.9%, demonstrating excellent reasoning abilities
  • MMLU-Pro (EM): 84.0%, indicating advanced reasoning on complex topics
  • DROP (3-shot F1): 92.2%, showing strong reading comprehension and numerical reasoning
  • GPQA-Diamond (Pass@1): 71.5%, demonstrating graduate-level physics reasoning

While not specifically designed for visual tasks, DeepSeek R1 shows strong general reasoning capabilities that can be applied to visual reasoning, including effective breakdown of complex visual problems into manageable steps, strong performance on mathematical and scientific problems with visual components, and capability for cross-domain problem-solving including visual inputs.

DeepSeek R1 offers several "distilled" versions that represent different approaches to fine-tuning, ranging from 1.5 billion to 70 billion parameters. The smallest can run on a laptop while maintaining reasonable performance, and fine-tuned versions show improved performance on specific tasks while reducing computational requirements.

 

Efficiency and Accessibility

DeepSeek R1 balances massive scale with accessibility options:

  • Full model requires significant computational resources due to its 671 billion parameters
  • Each token activates parameters equivalent to 37 billion, making efficient use of its large parameter count
  • Available through DeepSeek's API at prices 90%-95% cheaper than proprietary alternatives
  • Open-source under MIT License, allowing commercial use without restrictions
  • Distilled versions provide options for deployment on more modest hardware

 

Llama Vision Models

Meta has developed multiple generations of open source multimodal models with strong image reasoning capabilities, with the latest being the Llama 4 series released in April 2025.

 

Llama 4 Series (April 2025)

Model Architecture and Specifications

Meta's latest Llama 4 series represents a significant advancement in open source multimodal AI, featuring native integration of vision capabilities:

  • Llama 4 Scout:
    • Parameter Count: 17 billion active parameters with 16 experts (109B total parameters)
    • Architecture Type: Mixture-of-Experts (MoE) with early fusion for multimodal processing
    • License: Open source
    • Context Window: Industry-leading 10M tokens
    • Deployment Requirements: Can fit on a single NVIDIA H100 GPU with Int4 quantization
  • Llama 4 Maverick:
    • Parameter Count: 17 billion active parameters with 128 experts (400B total parameters)
    • Architecture Type: Mixture-of-Experts (MoE) with alternating dense and MoE layers
    • License: Open source
    • Context Window: 1M tokens
    • Deployment Requirements: Can run on a single NVIDIA H100 DGX host
  • Llama 4 Behemoth (Preview only, not yet released):
    • Parameter Count: 288 billion active parameters with 16 experts (2T total parameters)
    • Architecture Type: Advanced MoE architecture
    • Status: Still in training, not yet publicly available

Key Architectural Features
  • Native Multimodality: Designed with early fusion to seamlessly integrate text and vision tokens into a unified model backbone
  • Mixture-of-Experts Architecture: Each token activates only a fraction of the total parameters, making the models more compute-efficient for training and inference
  • Improved Vision Encoder: Based on MetaCLIP but trained separately in conjunction with a frozen Llama model to better adapt the encoder to the LLM
  • Multilingual Support: Pre-trained on 200 languages, including over 100 with more than 1 billion tokens each
Training Methodology
  • Joint Pre-training: Pre-trained with large amounts of unlabeled text, image, and video data
  • MetaP Training Technique: New approach for reliably setting critical model hyper-parameters such as per-layer learning rates and initialization scales
  • FP8 Precision: Used for efficient model training without sacrificing quality
  • Mid-training: Continued training to improve core capabilities with new training recipes including long context extension using specialized datasets
  • Distillation: Smaller models (Scout and Maverick) were distilled from the larger Behemoth model
Performance Metrics
  • Llama 4 Scout: Outperforms Gemma 3, Gemini 2.0 Flash-Lite, and Mistral 3.1 across a broad range of widely reported benchmarks
  • Llama 4 Maverick: Beats GPT-4o and Gemini 2.0 Flash across multiple benchmarks, while achieving comparable results to DeepSeek v3 on reasoning and coding with fewer active parameters
  • Llama 4 Behemoth: Outperforms GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on several STEM benchmarks
     

Llama 3.2 Vision (September 2024)

Model Architecture and Specifications
  • Parameter Sizes: Available in 11B and 90B parameter versions
  • Architecture Type: Transformer-based with integrated image encoder
  • License: Open source
  • Visual Processing: Integrates a pre-trained image encoder into the language model using adapters
Key Architectural Features
  • Adapter Integration: Uses adapters to connect image data to the text-processing layers
  • Multimodal Processing: Capable of handling both image and text inputs simultaneously
  • Customizability: Can be fine-tuned for custom applications using Torchtune
Performance Capabilities
  • Image-Text Tasks: Performs well on generating captions, answering image-based questions, and complex visual reasoning
  • Chart and Diagram Understanding: Both the 11B and 90B versions outperform some proprietary models in tasks involving chart and diagram understanding
  • OCR Capabilities: Can recognize and process text within images
Limitations
  • Math Reasoning: Shows room for improvement in math-heavy tasks, especially the 11B version
  • Language Support: For image+text applications, only English is fully supported (though text-only tasks support multiple languages)

Applications in Image Reasoning

  • Complex Visual Problem-Solving: Can analyze and reason about complex visual information
  • Document Understanding: Capable of extracting and reasoning about information from documents with text and visual elements
  • Chart and Graph Analysis: Strong performance in understanding and interpreting data visualizations
  • Visual Question Answering: Can answer detailed questions about image content with explanatory reasoning
  • Multimodal Chain-of-Thought: Demonstrates ability to incorporate visual information into step-by-step reasoning processes

 

Janus-Pro-7B

Model Architecture and Specifications

Janus-Pro-7B features a novel approach to multimodal AI:

  • Parameter Count: 7 billion parameters
  • Base Model: Built upon DeepSeek-LLM-7b-base
  • Architecture Type: Novel autoregressive framework
  • License: Open source (MIT License)

The model's key architectural features include unified multimodal understanding and generation, decoupled visual encoding into separate pathways for understanding and generation, a single unified transformer architecture for processing, and enhanced framework flexibility through decoupled visual encoding.

For visual processing, Janus-Pro-7B uses SigLIP-L as the vision encoder for multimodal understanding, supports 384 x 384 image input, and for image generation, uses a specialized tokenizer with a downsample rate of 16. This decoupled visual encoding alleviates conflict between the visual encoder's roles in understanding and generation.
 

Performance Metrics

Janus-Pro-7B demonstrates impressive performance across various benchmarks:

  • MMBench: Achieved a score of 79.2 on this multimodal understanding benchmark, surpassing state-of-the-art unified models
  • GenEval: Scored 80% overall accuracy in text-to-image tasks, compared to 67% for DALL-E 3 and 74% for Stable Diffusion
  • DPG-Bench: Achieved 84.2%, setting a new benchmark for multimodal models

The model excels in both multimodal understanding and generation tasks, surpasses previous unified models in performance, matches or exceeds the performance of task-specific models, shows strong performance in text-to-image generation tasks, and maintains high accuracy in image fidelity (92%).

In comparative evaluations, Janus-Pro-7B outperformed DALL-E 3 on multiple benchmarks, surpassed Stable Diffusion in text-to-image generation tasks, demonstrated superior handling of dense prompts and multimodal understanding, and achieved competitive performance against specialized models despite its unified architecture.
 

Unique Capabilities

Janus-Pro-7B stands out for its ability to both understand and generate visual content within a single model:

  • Can analyze images and reason about their content
  • Can generate high-quality images from text descriptions
  • Unified architecture eliminates the need for separate models for understanding and generation
  • Decoupled visual encoding provides flexibility and improved performance

 

Qwen QwQ

Model Architecture and Specifications

Qwen QwQ demonstrates that smaller models can achieve remarkable reasoning capabilities with the right architecture and training:

  • Model Size: 32 billion parameters
  • Design Philosophy: Advanced transformer-based design optimized for reasoning tasks
  • License: Open source (Apache 2.0)
  • Key Architectural Features:
    • Specialized for iterative problem-solving
    • Optimized for memory retention and contextual reasoning
    • Advanced contextual embedding for deeper understanding of nuances
    • Integrated agent-related capabilities for tool use and environmental feedback adaptation

While less visual-specific than QvQ, it has strong reasoning capabilities applicable to visual tasks, can be integrated with visual inputs for multimodal reasoning, and is designed for iterative problem-solving across domains including visual reasoning.

Qwen QwQ uses a multi-stage reinforcement learning approach, starting with a cold-start checkpoint with reinforcement learning scaling driven by outcome-based rewards. The first stage focused on math and coding tasks using accuracy verifiers rather than traditional reward models, while the second stage added general capabilities training with rewards from general reward models and rule-based verifiers.
 

Performance Metrics

Despite having only 32 billion parameters (compared to DeepSeek R1's 671 billion), Qwen QwQ achieves comparable performance:

  • GPQA: Achieved impressive scores of 65.2%, showcasing its reasoning capabilities
  • AIME24: Matches or beats DeepSeek-R1 and OpenAI's o1-mini
  • LiveBench: Competitive performance against larger models
  • BFCL (Benchmark for Foundational Code and Logic): Strong results comparable to much larger models

The model is effective at breaking down complex problems into manageable steps, shows strong performance on mathematical problems with visual components, and is capable of iterative problem-solving across domains including visual reasoning.

Fine-tuning for specific domains shows further improvements while maintaining core capabilities.
 

Efficiency and Accessibility

A standout feature of Qwen QwQ is its efficiency:

  • Achieves performance comparable to models 20x its size
  • Performance-to-parameter ratio significantly better than larger models
  • Open-weight under the Apache 2.0 license
  • Accessible via Hugging Face, ModelScope, and Qwen Chat
  • Demonstrates the effectiveness of reinforcement learning when applied to robust foundation models
  • Can be deployed on consumer-grade hardware with reasonable performance
     

Lumina-Image 2.0

Model Architecture and Specifications

Lumina-Image 2.0 offers an efficient approach to image generation and understanding:

  • Parameter Count: 2.6 billion parameters
  • Architecture Type: Flow-based diffusion transformer
  • License: Open source (Apache 2.0)
  • Text Encoder: Gemma-2-2B
  • VAE: FLUX-VAE-16CH

The model's key architectural features include a unified and efficient image generation framework, support for high-resolution image generation (1024x1024), multiple solver options including Midpoint Solver, Euler Solver, and DPM Solver for inference, and design optimized for efficiency while maintaining high-quality output.

Lumina-Image 2.0 supports single-task and multi-task fine-tuning, capabilities for controllable generation, image editing, and identity preservation, PEFT (Parameter-Efficient Fine-Tuning) using LLaMa-Adapter V2, and integration with popular frameworks like ComfyUI and Diffusers.


Performance Metrics

Lumina-Image 2.0 demonstrates impressive efficiency and performance:

  • Achieves state-of-the-art performance across multiple image generation benchmarks
  • Delivers strong performance on academic benchmarks and public text-to-image arenas
  • Outperforms almost all open-source models (e.g., SD3) in comparative evaluations
  • Uses 38% less computing resources than comparable models
  • Delivers strong performance despite having only 2.6B parameters
  • Efficient resource utilization while maintaining high-quality output

The model excels in high-quality image generation at 1024x1024 resolution, shows strong performance in both qualitative and quantitative benchmarks, delivers competitive results across multiple image-related tasks with its unified approach, and is particularly effective for controllable generation and image editing tasks.
 

Versatility and Applications

Lumina-Image 2.0 supports a wide range of image-related tasks:

  • Text-to-image generation
  • Image editing
  • Controllable generation
  • Identity preservation
  • Unified multi-image generation
  • Fine-tuning for specific domains and tasks

 

Comparative Analysis

When comparing these leading open source image reasoning models, several key patterns and distinctions emerge

Raw Benchmark Performance

  • DeepSeek R1 leads on mathematical benchmarks like AIME and MATH-500, demonstrating superior performance on structured reasoning tasks
  • Qwen QvQ excels on multimodal benchmarks like MMMU, showing its specialized capabilities in integrating visual and textual information
  • Llama 4 Maverick achieves impressive results across a broad range of benchmarks, outperforming many proprietary models despite its efficient architecture
  • Janus-Pro-7B achieves impressive scores on MMBench and image generation benchmarks, highlighting its dual capabilities
  • Qwen QwQ achieves comparable results to much larger models across multiple benchmarks, demonstrating the power of efficient architecture and training
  • Lumina-Image 2.0 delivers strong performance on image generation benchmarks while using significantly fewer resources

 

Visual Reasoning Capabilities

  • Qwen QvQ offers specialized visual reasoning with strong multimodal integration, particularly excelling at mathematical and scientific visual reasoning
  • Llama 4 Series provides native multimodality with early fusion for seamless integration of text and vision, enabling sophisticated visual reasoning
  • Janus-Pro-7B provides a unique combination of visual understanding and generation capabilities
  • DeepSeek R1 and Qwen QwQ apply strong general reasoning to visual tasks, demonstrating that powerful reasoning capabilities can transfer to visual domains even without specialized visual architectures
  • Lumina-Image 2.0 focuses on image generation but incorporates understanding capabilities for editing and controllable generation

 

Architectural Approaches

  • Mixture-of-Experts (DeepSeek R1, Llama 4): Enables massive parameter counts with efficient activation
  • Specialized Visual Components (Qwen QvQ): Provide dedicated mechanisms for visual reasoning
  • Early Fusion (Llama 4): Seamlessly integrates text and vision tokens into a unified model backbone
  • Decoupled Visual Encoding (Janus-Pro-7B): Separates understanding and generation pathways while maintaining a unified architecture
  • Reinforcement Learning Optimization (Qwen QwQ): Demonstrates how RL can dramatically improve efficiency and performance
  • Flow-based Diffusion Transformer (Lumina-Image 2.0): Offers efficient image generation with understanding capabilities

 

Efficiency vs. Performance

  • Llama 4 Scout offers an excellent balance of performance and efficiency, fitting on a single H100 GPU while outperforming many larger models
  • Qwen QwQ offers the best performance-to-parameter ratio among general reasoning models, achieving results comparable to models 20x its size
  • Lumina-Image 2.0 provides the most efficient resource utilization, using 38% less computing resources than comparable models
  • DeepSeek R1 has the highest raw performance but requires the most computational resources, though its MoE architecture makes efficient use of its parameters
  • Qwen QvQ balances specialized visual reasoning capabilities with reasonable computational requirements
  • Janus-Pro-7B offers dual capabilities (understanding and generation) in a relatively compact 7B parameter model

 

Comparison Table of Open Source Image Reasoning Models (2025)

Model Name

Architecture Brief

Sizes Available

Performance Without Fine-tuning

Performance After Fine-tuning

Qwen QvQ

Transformer-based with specialized visual reasoning capabilities; grouped query attention mechanism; dual chunk attention for multimodal processing

72B parameters

• MMMU: 70.3

• MathVista: 71.4

• Strong performance on multimodal mathematical reasoning

• OlympiadBench: 20.4%

• Improved focus on image content during multi-step reasoning

• Reduced hallucinations

• Better handling of language mixing

• Enhanced performance on domain-specific visual reasoning tasks

DeepSeek R1

Mixture-of-Experts (MoE) architecture; each token activates only a subset of parameters; built on DeepSeek-V3-Base

671B total parameters (37B activated per token)

Distilled versions: 1.5B to 70B

• Outperforms Gemma 3, Gemini 2.0 Flash-Lite, and Mistral 3.1 across multiple benchmarks

• Strong performance on multimodal reasoning tasks

• Distilled versions maintain strong performance with reduced computational requirements

• Domain-specific fine-tuning shows improved performance on targeted tasks

• Fine-tuned versions demonstrate better handling of specialized visual reasoning tasks

Llama 4 Scout

Mixture-of-Experts (MoE) with early fusion for multimodal processing; native integration of vision capabilities

17B active parameters with 16 experts (109B total)

• Beats GPT-4o and Gemini 2.0 Flash across multiple benchmarks

• Comparable results to DeepSeek v3 on reasoning and coding

• LMArena ELO: 1417

• Improved performance on domain-specific tasks

• Enhanced multilingual capabilities when fine-tuned for specific languages

• Better handling of specialized visual reasoning tasks

Llama 4 Maverick

MoE with alternating dense and MoE layers; early fusion for multimodal processing

17B active parameters with 128 experts (400B total)

• Strong performance on image-text tasks

• Outperforms some proprietary models in chart and diagram understanding

• Room for improvement in math-heavy tasks

• Enhanced performance on specialized domains

• Improved handling of complex visual reasoning tasks

• Better integration of visual information in reasoning chains

Llama 3.2 Vision

Transformer-based with integrated image encoder using adapters

11B and 90B parameter versions

• MMBench: 79.2

• GenEval: 80% overall accuracy in text-to-image tasks

• DPG-Bench: 84.2%

• Image fidelity: 92%

• Improved performance with Torchtune fine-tuning

• Enhanced capabilities for domain-specific applications

• Better handling of specialized visual reasoning tasks

Janus-Pro-7B

Novel autoregressive framework with unified multimodal understanding and generation; decoupled visual encoding

7B parameters

• GPQA: 65.2%

• AIME24: Comparable to DeepSeek-R1 and OpenAI's o1-mini

• Enhanced performance on domain-specific tasks

• Improved balance between understanding and generation capabilities

• Better handling of specialized visual reasoning tasks

Qwen QwQ

Advanced transformer-based design optimized for reasoning tasks; specialized for iterative problem-solving

32B parameters

• Strong performance on LiveBench

• BFCL: Results comparable to much larger models

• Improved performance on domain-specific reasoning tasks

• Enhanced ability to maintain reasoning chains

• Better handling of complex problem decomposition

Lumina-Image 2.0

Flow-based diffusion transformer; unified and efficient image generation framework

2.6B parameters

• State-of-the-art performance across multiple image generation benchmarks

• Outperforms most open-source models (e.g., SD3)

• Uses 38% less computing resources than comparable models

• Enhanced performance on specific image generation domains

• Improved controllable generation capabilities

• Better handling of image editing tasks

• Enhanced identity preservation


Key Insights from Comparison

  1. Parameter Efficiency: Models like Qwen QwQ and Lumina-Image 2.0 demonstrate that smaller models can achieve competitive performance through optimized architectures and training methodologies.
  2. Mixture-of-Experts Dominance: The MoE architecture (used by DeepSeek R1 and Llama 4 models) enables efficient scaling to massive parameter counts while maintaining reasonable computational requirements during inference.
  3. Specialized vs. General Reasoning: Some models (like Qwen QvQ) are specifically designed for visual reasoning, while others (like DeepSeek R1 and Qwen QwQ) apply strong general reasoning capabilities to visual tasks.
  4. Fine-tuning Benefits: All models show significant improvements after fine-tuning, particularly in domain-specific applications and handling of complex visual reasoning tasks.
  5. Multimodal Integration Approaches: Different architectural approaches to integrating visual and textual information (early fusion in Llama 4, adapter-based in Llama 3.2 Vision, decoupled visual encoding in Janus-Pro-7B) offer various trade-offs in performance and efficiency.
     

Conclusion

The field of image reasoning has advanced significantly in 2025, with open source models demonstrating unprecedented capabilities in understanding, manipulating, and reasoning with visual information. The models examined in this report—Qwen QvQ, DeepSeek R1, Llama Vision models, Janus-Pro-7B, Qwen QwQ, and Lumina-Image 2.0—represent different approaches to achieving these capabilities, with varying trade-offs between performance, efficiency, and specialization.

Several key trends emerge from this analysis:

  1. Efficiency Gains: Smaller models like Qwen QwQ, Llama 4 Scout, and Lumina-Image 2.0 are achieving performance comparable to much larger predecessors through advanced training techniques, particularly reinforcement learning and optimized architectures.
  2. Multimodal Integration: The most effective image reasoning models don't just process images and text separately but deeply integrate these modalities in their reasoning processes, as demonstrated by Qwen QvQ, Llama 4 series, and Janus-Pro-7B.
  3. Mixture-of-Experts Architecture: The adoption of MoE architectures by models like DeepSeek R1 and Llama 4 enables efficient scaling to massive parameter counts while maintaining reasonable computational requirements during inference.
  4. Native Multimodality: The latest models like Llama 4 are designed with native multimodal capabilities from the ground up, rather than adding vision capabilities to existing language models, resulting in more seamless integration of visual and textual information.
  5. Open Source Momentum: The strength and diversity of these open source models demonstrate the growing importance of open research and development in advancing AI capabilities. This trend is particularly significant as it democratizes access to cutting-edge AI technologies.

As these technologies continue to evolve, we can expect further improvements in efficiency, capabilities, and accessibility. The ability to reason with and about images represents a significant step toward more general artificial intelligence, with applications across numerous domains including education, science, medicine, design, and engineering.

The growing availability of powerful open source models is particularly noteworthy, as it enables broader adoption and innovation across industries and research communities. These models provide researchers, developers, and organizations with powerful tools for advancing the state of the art in AI and applying these capabilities to solve real-world problems.

 

References

  1. Qwen Team. (2024, December 25). QVQ: To See the World with Wisdom. Qwen Blog. https://qwenlm.github.io/blog/qvq-72b-preview/
     
  2. DeepSeek AI. (2025, January 21). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. Hugging Face. https://huggingface.co/deepseek-ai/DeepSeek-R1
     
  3. Meta AI. (2025, April 5). The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation. https://ai.meta.com/blog/llama-4-multimodal-intelligence/
     
  4. Xu, S. (2025, March 26). Multimodal AI: A Guide to Open-Source Vision Language Models. BentoML. https://www.bentoml.com/blog/multimodal-ai-a-guide-to-open-source-vision-language-models
     
  5. Chen, X., Wu, Z., Liu, X., Pan, Z., Liu, W., Xie, Z., Yu, X., & Ruan, C. (2025). Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling. arXiv preprint arXiv:2501.17811.
     
  6. Qwen Team. (2025, March 6). QwQ-32B: Embracing the Power of Reinforcement Learning. Qwen Blog. https://qwenlm.github.io/blog/qwq-32b/
     
  7. Qin, Q., Zhuo, L., Xin, Y., Du, R., Li, Z., Fu, B., Lu, Y., Li, X., Liu, D., Zhu, X., Beddow, W., Millon, E., Perez, V., Wang, W., Qiao, Y., Zhang, B., Liu, X., Li, H., Xu, C., & Gao, P. (2025). Lumina-Image 2.0: A Unified and Efficient Image Generative Framework. arXiv preprint arXiv:2503.21758.
     
  8. Meta AI. (2024, September 25). Llama 3.2: Revolutionizing edge AI and vision with open source models. https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/
     
  9. DeepSeek AI. (2025, January 24). DeepSeek R1: All you need to know. Fireworks AI Blog. https://fireworks.ai/blog/deepseek-r1-deepdive
     
  10. Gupta, M. (2024, December 25). Qwen QVQ-72B: Best open-sourced Image Reasoning LLM. Medium. https://medium.com/data-science-in-your-pocket/qwen-qvq-72b-best-open-sourced-image-reasoning-llm-95b474d3b9a0
     
  11. Alpha-VLLM. (2025, March 27). Lumina-Image 2.0: A Unified and Efficient Image Generative Framework. GitHub. https://github.com/Alpha-VLLM/Lumina-Image-2.0
     
  12. Ozen, H. (2025). A Guide to Reasoning with Qwen QwQ 32B. Groq. https://groq.com/a-guide-to-reasoning-with-qwen-qwq-32b/