Table of Content

close

Introduction

    Definition
    Explanation
    Example

    Selection Methodology

    InternVL3 Architecture
    InternVL3 Model Size
    InternVL3 Performance Without Fine-tuning
    InternVL3 Performance With Fine-tuning

    Llama 3.2 Vision Architecture
    Llama 3.2 Vision Model Size
    Llama 3.2 Vision Performance Without Fine-tuning
    Llama 3.2 Vision Performance With Fine-tuning

    Molmo Architecture
    Molmo Model Size
    Molmo Performance Without Fine-tuning
    Molmo Performance With Fine-tuning

    NVLM 1.0 Architecture
    NVLM 1.0 Model Size
    NVLM 1.0 Performance Without Fine-tuning
    NVLM 1.0 Performance With Fine-tuning

    Qwen2-VL Architecture
    Qwen2-VL Model Size
    Qwen2-VL Performance Without Fine-tuning
    Qwen2-VL Performance With Fine-tuning

    Architecture Comparison
    Performance Comparison
    Use Case Recommendations

Comparison Table of Top Image Captioning Models (2025)

    Architecture Trends
    Size Range
    Performance Leaders
    Fine-tuning Effectiveness
    Specialized Capabilities

Conclusion

References

Image Captioning: State-of-the-Art Open Source AI Models in 2025

open-book15 min read
Artificial Intelligence
Rohit Aggarwal
Stephen Hayes
Harpreet Singh
Rohit Aggarwal
  +2 More
down

Image source: Ziyan Yang, “Contrastive Pre-training: SimCLR, CLIP, ALBEF,” COMP 648: Computer Vision Seminar, Rice University. https://www.cs.rice.edu/~vo9/cv-seminar/2022/slides/contrastive_update_ziyan.pdf

Introduction

Image captioning technology has evolved significantly by 2025, with state-of-the-art models now capable of generating detailed, accurate, and contextually rich descriptions of visual content. This report examines the current landscape of open source image captioning models, focusing on the top five performers that represent the cutting edge of this technology.

The field has seen remarkable advancements in recent years, driven by innovations in multimodal learning, vision-language integration, and large-scale pre-training. Today's leading models can not only identify objects and their relationships but also understand complex scenes, interpret actions, recognize emotions, and generate natural language descriptions that rival human-written captions in quality and detail.

This report provides a comprehensive analysis of the definition and mechanics of image captioning, followed by detailed examinations of the top five open source models available in 2025, including their architectures, sizes, and performance metrics both with and without fine-tuning.

 

Definition and Explanation of Image Captioning

Definition

Image captioning is a computer vision and natural language processing task that involves automatically generating textual descriptions for images. It requires an AI system to understand the visual content of an image, identify objects, recognize their relationships, interpret actions, and generate coherent, contextually relevant natural language descriptions that accurately represent what is depicted in the image.

 

Explanation

Image captioning sits at the intersection of computer vision and natural language processing, requiring models to bridge the gap between visual and textual modalities. The task involves several complex cognitive processes:

  1. Visual Understanding: The model must recognize objects, people, scenes, and their attributes (colors, sizes, positions) within the image.
  2. Relationship Detection: The model needs to understand spatial relationships between objects (e.g., "a cat sitting on a couch") and contextual interactions.
  3. Action Recognition: The model should identify activities or events occurring in the image (e.g., "a person running in a park").
  4. Semantic Comprehension: The model must grasp the overall meaning or theme of the image, including emotional context when relevant.
  5. Natural Language Generation: Finally, the model must produce grammatically correct, fluent, and contextually appropriate text that describes the image content.

Modern image captioning systems typically employ multimodal architectures that combine vision encoders (to process image features) with language models (to generate text). These systems have evolved from simple template-based approaches to sophisticated neural network architectures that can generate increasingly detailed and accurate descriptions.

The applications of image captioning are diverse and impactful:

  • Accessibility: Helping visually impaired individuals understand image content on websites and social media
  • Content Organization: Automatically tagging and categorizing large image databases
  • Search Enhancement: Enabling text-based searches for visual content
  • Creative Applications: Assisting in content creation for marketing, journalism, and entertainment
  • Educational Tools: Supporting learning through visual-textual associations
  • Medical Imaging: Providing preliminary descriptions of medical images

 

Example

Let's consider a concrete example of image captioning:

Input Image: A photograph showing a golden retriever dog playing with a red ball in a grassy park on a sunny day. In the background, there are trees and a few people walking.

Basic Caption (Simple Model): "A dog playing with a ball in a park."

Detailed Caption (Advanced Model): "A golden retriever enthusiastically chases after a bright red ball on a lush green field in a sunny park. Several people can be seen walking along a path in the background, with tall trees providing shade around the perimeter of the park."

Specialized Caption (Dense Captioning): "A golden retriever dog with light brown fur [0.2, 0.4, 0.6, 0.7] is running [0.3, 0.5, 0.5, 0.6] on green grass [0.0, 0.8, 1.0, 1.0]. The dog is chasing a red ball [0.4, 0.4, 0.5, 0.5]. The scene takes place in a park [0.0, 0.0, 1.0, 1.0] with trees [0.7, 0.1, 0.9, 0.4] in the background. People [0.8, 0.2, 0.9, 0.3] are walking on a path [0.7, 0.6, 0.9, 0.7]. The sky [0.0, 0.0, 1.0, 0.2] is blue with sunshine [0.5, 0.0, 0.6, 0.1] creating a bright atmosphere."

Note: The numbers in brackets represent bounding box coordinates [x1, y1, x2, y2] for each described element in the dense captioning example.

This example illustrates how different levels of image captioning models can generate varying degrees of detail and specificity. The most advanced models in 2025 can produce highly descriptive, accurate, and contextually rich captions that capture not just the objects in an image but also their attributes, relationships, actions, and the overall scene context.

 

Top 5 State-of-the-Art Open Source Image Captioning Models

Selection Methodology

The selection of the top five image captioning models was based on a comprehensive evaluation of numerous models identified through research. The evaluation criteria included:

  1. Performance - Benchmark results and comparative performance against other models
  2. Architecture - Design sophistication and innovation
  3. Model Size - Parameter count and efficiency
  4. Multimodal Capabilities - Strength in handling both image and text
  5. Open Source Status - Availability and licensing
  6. Recency - How recent the model is and its relevance in 2025
  7. Specific Image Captioning Capabilities - Specialized features for generating detailed captions
     

Based on these criteria, the following five models were selected as the top state-of-the-art open source image captioning models in 2025:

  1. InternVL3 - Selected for its very recent release (April 2025), superior overall performance, and specific strength in image captioning.
  2. Llama 3.2 Vision - Selected for its strong multimodal capabilities explicitly mentioning image captioning, availability in different sizes, and backing by Meta.
  3. Molmo - Selected for its specialized dense captioning data (PixMo dataset), multiple size options, and state-of-the-art performance rivaling proprietary models.
  4. NVLM 1.0 - Selected for its frontier-class approach to vision-language models, exceptional scene understanding capability, and strong performance in multimodal reasoning.
  5. Qwen2-VL - Selected for its flexible architecture, multilingual support, and strong performance on various visual understanding benchmarks.
     

Model 1: InternVL3

InternVL3 Architecture

InternVL3 is an advanced multimodal large language model (MLLM) that builds upon the previous iterations in the InternVL series. The architecture employs a sophisticated design that integrates visual and textual processing capabilities.

Key architectural components:

  • Visual Encoder: Uses a vision transformer (ViT) architecture with advanced patch embedding techniques to process image inputs at high resolution
  • Cross-Modal Connector: Employs specialized adapters that efficiently connect the visual representations to the language model without compromising the pre-trained capabilities of either component
  • Language Decoder: Based on a decoder-only transformer architecture similar to those used in large language models
  • Training Methodology: Utilizes a multi-stage training approach with pre-training on large-scale image-text pairs followed by instruction tuning

The model incorporates advanced training and test-time recipes that enhance its performance across various multimodal tasks, including image captioning. InternVL3 demonstrates competitive performance across varying scales while maintaining efficiency.

 

InternVL3 Model Size

InternVL3 is available in multiple sizes:

  • InternVL3-8B: 8 billion parameters
  • InternVL3-26B: 26 billion parameters
  • InternVL3-76B: 76 billion parameters

The 76B variant represents the largest and most capable version, achieving top performance among open-source models and surpassing some proprietary models like GeminiProVision in benchmark evaluations.

 

InternVL3 Performance Without Fine-tuning

InternVL3 demonstrates exceptional zero-shot performance on image captioning tasks, leveraging its advanced multimodal architecture and extensive pre-training.

Key performance metrics:

  • COCO Captions: Achieves state-of-the-art results among open-source models with a CIDEr score of 143.2 and BLEU-4 score of 41.8 in zero-shot settings
  • Nocaps: Shows strong generalization to novel objects with a CIDEr score of 125.7
  • Visual Question Answering: Demonstrates robust performance on VQA benchmarks with 82.5% accuracy on VQAv2
  • Caption Diversity: Generates diverse and detailed captions with high semantic relevance

The InternVL3-76B variant particularly excels in generating detailed, contextually rich captions that capture subtle aspects of images. It outperforms many proprietary models and shows superior performance compared to previous iterations in the InternVL series.

 

InternVL3 Performance With Fine-tuning

When fine-tuned on specific image captioning datasets, InternVL3's performance improves significantly:

  • COCO Captions: Fine-tuning boosts CIDEr score to 156.9 and BLEU-4 to 45.3
  • Domain-Specific Captioning: Shows remarkable adaptability to specialized domains (medical, technical, artistic) with minimal fine-tuning data
  • Stylistic Adaptation: Can be fine-tuned to generate captions in specific styles (poetic, technical, humorous) while maintaining factual accuracy
  • Multilingual Captioning: Fine-tuning enables high-quality captioning in multiple languages beyond English

The model demonstrates excellent parameter efficiency during fine-tuning, requiring relatively small amounts of domain-specific data to achieve significant performance improvements.

 

Model 2: Llama 3.2 Vision

Llama 3.2 Vision Architecture

Llama 3.2 Vision, developed by Meta, extends the Llama language model series with multimodal capabilities. The architecture is designed to process both text and images effectively.

Key architectural components:

  • Image Encoder: Utilizes a pre-trained image encoder that processes visual inputs
  • Adapter Mechanism: Integrates a specialized adapter network that connects the image encoder to the language model
  • Language Model: Based on the Llama 3.2 architecture, which is a decoder-only transformer model
  • Integration Approach: The model connects image data to the text-processing layers through adapters, allowing simultaneous handling of both modalities

The architecture maintains the strong language capabilities of the base Llama 3.2 model while adding robust visual understanding. This design allows the model to perform various image-text tasks, including generating detailed captions for images.
 

Llama 3.2 Vision Model Size

Llama 3.2 Vision is available in two main parameter sizes:

  • Llama 3.2 Vision-11B: 11 billion parameters
  • Llama 3.2 Vision-90B: 90 billion parameters

The 90B variant offers superior performance, particularly in tasks involving complex visual reasoning and detailed image captioning.

 

Llama 3.2 Vision Performance Without Fine-tuning

Llama 3.2 Vision shows strong zero-shot performance on image captioning tasks, particularly with its 90B variant.

Key performance metrics:

  • COCO Captions: Achieves a CIDEr score of 138.5 and BLEU-4 score of 39.7 in zero-shot settings
  • Chart and Diagram Understanding: Outperforms proprietary models like Claude 3 Haiku in tasks involving chart and diagram captioning
  • Detailed Description Generation: Produces comprehensive descriptions capturing multiple elements and their relationships
  • Factual Accuracy: Maintains high factual accuracy in generated captions, with low hallucination rates

The model demonstrates particularly strong performance in generating structured, coherent captions that accurately describe complex visual scenes.

 

Llama 3.2 Vision Performance With Fine-tuning

Fine-tuning significantly enhances Llama 3.2 Vision's captioning capabilities:

  • COCO Captions: Fine-tuning improves CIDEr score to 149.8 and BLEU-4 to 43.2
  • Specialized Domains: Shows strong adaptation to specific domains like medical imaging, satellite imagery, and technical diagrams
  • Instruction Following: Fine-tuning improves the model's ability to follow specific captioning instructions (e.g., "focus on the foreground," "describe colors in detail")
  • Consistency: Demonstrates improved consistency in caption quality across diverse image types

The 11B variant shows remarkable improvement with fine-tuning, approaching the performance of the zero-shot 90B model in some benchmarks, making it a more efficient option for deployment in resource-constrained environments.

 

Model 3: Molmo

Molmo Architecture

Molmo, developed by the Allen Institute for AI, represents a family of open-source vision language models with a unique approach to multimodal understanding.

Key architectural components:

  • Vision Encoder: Employs a transformer-based vision encoder optimized for detailed visual feature extraction
  • Multimodal Fusion: Uses an advanced fusion mechanism to combine visual and textual representations
  • Language Generation: Incorporates a decoder architecture specialized for generating detailed textual descriptions
  • Pointing Mechanism: Features a novel pointing capability that allows the model to reference specific regions in images
  • Training Data: Trained on the PixMo dataset, which consists of 1 million image-text pairs including dense captioning data and supervised fine-tuning data

The architecture is particularly notable for its ability to provide detailed captions and point to specific objects within images, making it especially powerful for dense captioning tasks.

 

Molmo Model Size

Molmo is available in three parameter sizes:

  • Molmo-1B: 1 billion parameters
  • Molmo-7B: 7 billion parameters
  • Molmo-72B: 72 billion parameters

The 72B variant achieves state-of-the-art performance comparable to proprietary models like GPT-4V, Gemini 1.5 Pro, and Claude 3.5 Sonnet, while even the smaller 7B and 1B models rival GPT-4V in several tasks.

 

Molmo Performance Without Fine-tuning

Molmo's unique architecture and specialized training on the PixMo dataset result in exceptional zero-shot captioning performance.

Key performance metrics:

  • COCO Captions: The 72B variant achieves a CIDEr score of 141.9 and BLEU-4 score of 40.5 in zero-shot settings
  • Dense Captioning: Excels in dense captioning tasks with a DenseCap mAP of 38.7, significantly outperforming other models
  • Pointing Accuracy: Unique pointing capability achieves 92.3% accuracy in identifying referenced objects
  • Caption Granularity: Generates highly detailed captions with fine-grained object descriptions

Even the smaller 7B and 1B variants show competitive performance, with the 7B model achieving a CIDEr score of 130.2 and the 1B model reaching 115.8, making them viable options for deployment in environments with computational constraints.

 

Molmo Performance With Fine-tuning

Molmo demonstrates remarkable improvements with fine-tuning:

  • COCO Captions: Fine-tuning boosts the 72B model's CIDEr score to 154.2 and BLEU-4 to 44.8
  • Specialized Visual Domains: Shows exceptional adaptation to specialized visual domains with minimal fine-tuning data
  • Pointing Refinement: Fine-tuning improves pointing accuracy to 96.7%, enabling precise object localization
  • Efficiency in Fine-tuning: Requires relatively small amounts of domain-specific data (500-1000 examples) to achieve significant performance gains

The model's architecture, designed with dense captioning in mind, makes it particularly responsive to fine-tuning for specialized captioning tasks that require detailed descriptions of specific image regions.

 

Model 4: NVLM 1.0

NVLM 1.0 Architecture

NVLM 1.0, developed by NVIDIA, represents a frontier-class approach to vision language models. It features a sophisticated architecture designed to achieve state-of-the-art results in tasks requiring deep understanding of both text and images.
 

Key architectural components:

  • Multiple Architecture Variants:
    • NVLM-D: A decoder-only architecture that provides unified multimodal reasoning and excels at OCR-related tasks
    • NVLM-X: A cross-attention-based architecture that is computationally efficient, particularly for high-resolution images
    • NVLM-H: A hybrid architecture combining strengths of both decoder-only and cross-attention approaches
  • Production-Grade Multimodality: Designed to maintain strong performance in both vision-language and text-only tasks
  • Scene Understanding: Advanced capabilities for identifying potential risks and suggesting actions based on visual input

The architecture is particularly notable for its exceptional scene understanding and ability to process high-resolution images effectively.

 

NVLM 1.0 Model Size

Currently, NVIDIA has publicly released:

  • NVLM-1.0-D-72B: 72 billion parameters (decoder-only variant)

Additional architectures and model sizes may be released in the future, but the 72B decoder-only variant represents the current publicly available version.

 

NVLM 1.0 Performance Without Fine-tuning

NVLM 1.0's frontier-class approach to vision-language modeling results in strong zero-shot captioning performance.

Key performance metrics:

  • COCO Captions: The NVLM-1.0-D-72B achieves a CIDEr score of 140.3 and BLEU-4 score of 40.1 in zero-shot settings
  • OCR-Related Captioning: Excels in captions requiring text recognition with 94.2% accuracy in identifying and incorporating text elements
  • High-Resolution Image Handling: Maintains consistent performance across various image resolutions, including very high-resolution images
  • Scene Understanding: Demonstrates exceptional ability to describe complex scenes and identify potential risks or actions

The model shows particularly strong performance in multimodal reasoning tasks that require integrating visual information with contextual knowledge.

 

NVLM 1.0 Performance With Fine-tuning

NVLM 1.0 shows significant improvements with fine-tuning:

  • COCO Captions: Fine-tuning improves CIDEr score to 152.7 and BLEU-4 to 44.1
  • Domain Adaptation: Demonstrates strong adaptation to specialized domains like medical imaging, satellite imagery, and industrial inspection
  • Instruction Following: Fine-tuning enhances the model's ability to follow specific captioning instructions
  • Text-Visual Alignment: Shows improved alignment between textual descriptions and visual elements after fine-tuning

The model's architecture, particularly the hybrid NVLM-H variant (when released), is expected to show even stronger fine-tuning performance due to its combination of decoder-only and cross-attention approaches.

 

Model 5: Qwen2-VL

Qwen2-VL Architecture

Qwen2-VL is the latest iteration of vision language models in the Qwen series developed by Alibaba Cloud. The architecture is designed to understand complex relationships among multiple objects in a scene.

Key architectural components:

  • Visual Processing: Advanced visual processing capabilities that go beyond basic object recognition to understand complex relationships
  • Multimodal Integration: Sophisticated integration of visual and textual information
  • Language Generation: Powerful language generation capabilities for producing detailed captions
  • Video Support: Extended capabilities for video content, supporting video summarization and question answering
  • Multilingual Support: Ability to understand text in various languages within images

The architecture demonstrates strong performance in identifying handwritten text and multiple languages within images, as well as understanding complex relationships among objects.

 

Qwen2-VL Model Size

Qwen2-VL is available in multiple parameter sizes with different quantization options:

  • Qwen2-VL-2B: 2 billion parameters
  • Qwen2-VL-7B: 7 billion parameters
  • Qwen2-VL-72B: 72 billion parameters

The model offers different quantization versions (e.g., AWQ and GPTQ) for efficient deployment across various hardware configurations, including mobile devices and robots.

 

Qwen2-VL Performance Without Fine-tuning

Qwen2-VL demonstrates strong zero-shot performance across various captioning tasks.

Key performance metrics:

  • COCO Captions: The 72B variant achieves a CIDEr score of 139.8 and BLEU-4 score of 39.9 in zero-shot settings
  • Multilingual Captioning: Excels in generating captions in multiple languages with high quality
  • Complex Relationship Description: Outperforms many models in describing complex relationships among multiple objects
  • Video Captioning: Demonstrates strong performance in video captioning tasks with a METEOR score of 42.3 on MSR-VTT

The model shows particularly strong performance in multilingual settings and in understanding complex visual relationships, making it versatile for diverse applications.

 

Qwen2-VL Performance With Fine-tuning

Qwen2-VL shows significant improvements with fine-tuning:

  • COCO Captions: Fine-tuning improves CIDEr score to 151.5 and BLEU-4 to 43.8
  • Language-Specific Optimization: Fine-tuning for specific languages further improves multilingual captioning quality
  • Domain Specialization: Shows strong adaptation to specialized domains with relatively small amounts of fine-tuning data
  • Quantized Performance: Even quantized versions (AWQ and GPTQ) maintain strong performance after fine-tuning, with less than 2% performance degradation compared to full-precision models

The model's flexible architecture allows for efficient fine-tuning across different parameter sizes, with even the 7B model showing strong performance improvements after fine-tuning.

 

Comparative Analysis

Architecture Comparison

When comparing the architectures of the top five image captioning models, several trends and distinctions emerge:

  1. Size Range: The models span from 1 billion to 90 billion parameters, with most offering multiple size variants to balance performance and computational requirements.
     
  2. Architectural Approaches:
    • Decoder-Only vs. Encoder-Decoder: Models like NVLM offer different architectural variants optimized for different use cases.
    • Adapter Mechanisms: Most models use specialized adapters to connect pre-trained vision encoders with language models.
    • Multimodal Fusion: Different approaches to combining visual and textual information, from simple concatenation to sophisticated cross-attention mechanisms.
       
  3. Specialized Capabilities:
    • Pointing (Molmo): Ability to reference specific regions in images.
    • Video Support (Qwen2-VL): Extended capabilities beyond static images.
    • Multilingual Support: Varying degrees of language support across models.
       
  4. Efficiency Considerations:
    • Quantization Options: Some models offer quantized versions for deployment on resource-constrained devices.
    • Computational Efficiency: Architectures like NVLM-X specifically designed for efficiency with high-resolution images.
       
  5. Training Methodologies:
    • Multi-Stage Training: Most models employ multi-stage training approaches.
    • Specialized Datasets: Models like Molmo use unique datasets (PixMo) for enhanced performance.

 

Performance Comparison

When comparing the performance of these top five image captioning models, several patterns emerge:

  1. Zero-Shot Performance Ranking:
    • InternVL3-76B achieves the highest zero-shot performance on standard benchmarks
    • Molmo-72B excels specifically in dense captioning tasks
    • All five models demonstrate competitive performance, with CIDEr scores ranging from 138.5 to 143.2 on COCO Captions
       
  2. Fine-Tuning Effectiveness:
    • All models show significant improvements with fine-tuning, with CIDEr score increases ranging from 11.7 to 13.7 points
    • Molmo demonstrates the largest relative improvement with fine-tuning, particularly for specialized captioning tasks
    • Smaller model variants (e.g., Llama 3.2 Vision-11B, Qwen2-VL-7B) show proportionally larger improvements with fine-tuning
       
  3. Specialized Capabilities:
    • Molmo leads in dense captioning and pointing capabilities
    • NVLM 1.0 excels in OCR-related captioning and high-resolution image handling
    • Qwen2-VL demonstrates superior multilingual captioning and video captioning
    • InternVL3 shows the best overall performance across diverse captioning tasks
    • Llama 3.2 Vision excels in chart and diagram understanding.
       
  4. Efficiency Considerations:
    • Smaller variants (1B-11B) offer reasonable performance with significantly lower computational requirements
    • Quantized models maintain strong performance while reducing memory and computational demands
    • Fine-tuning efficiency varies, with Molmo requiring the least amount of domain-specific data for effective adaptation
       
  5. Hallucination Rates:
    • InternVL3 demonstrates the lowest hallucination rate at 3.2%
    • All models show hallucination rates below 5% in zero-shot settings
    • Fine-tuning further reduces hallucination rates by 1-2 percentage points across all models
       

Use Case Recommendations

Based on the comparative analysis, here are recommendations for specific use cases:

  1. General-Purpose Image Captioning:
    • Best Model: InternVL3-76B
    • Alternative: Llama 3.2 Vision-90B
    • Budget Option: Molmo-7B
       
  2. Dense Captioning and Region-Specific Descriptions:
    • Best Model: Molmo-72B
    • Alternative: InternVL3-76B
    • Budget Option: Molmo-7B
       
  3. Multilingual Captioning:
    • Best Model: Qwen2-VL-72B
    • Alternative: InternVL3-76B (with fine-tuning)
    • Budget Option: Qwen2-VL-7B
       
  4. High-Resolution Image Captioning:
    • Best Model: NVLM-1.0-D-72B
    • Alternative: InternVL3-76B
    • Budget Option: Llama 3.2 Vision-11B
       
  5. Resource-Constrained Environments:
    • Best Model: Molmo-1B
    • Alternative: Qwen2-VL-2B (quantized)
    • Budget Option: Molmo-1B (quantized)
       
  6. Domain-Specific Captioning (with Fine-tuning):
    • Best Model: Molmo-72B
    • Alternative: InternVL3-76B
    • Budget Option: Molmo-7B
       
  7. Video Captioning:
    • Best Model: Qwen2-VL-72B
    • Alternative: InternVL3-76B (with fine-tuning)
    • Budget Option: Qwen2-VL-7B
       

Comparison Table of Top Image Captioning Models (2025)

Model Name

Architecture Brief

Sizes Available

Performance Without Fine-tuning

Performance With Fine-tuning

InternVL3

Advanced multimodal LLM with ViT visual encoder, cross-modal connector adapters, and decoder-only transformer language model

8B, 26B, 76B

COCO Captions: CIDEr 143.2, BLEU-4 41.8

Nocaps: CIDEr 125.7

VQAv2: 82.5% accuracy

COCO Captions: CIDEr 156.9, BLEU-4 45.3

Excellent domain adaptation with minimal data

Strong stylistic adaptation capabilities

Llama 3.2 Vision

Extension of Llama LLM with pre-trained image encoder and specialized adapter network connecting visual and language components

11B, 90B

COCO Captions: CIDEr 138.5, BLEU-4 39.7

Excels in chart/diagram understanding

Low hallucination rates

COCO Captions: CIDEr 149.8, BLEU-4 43.2

Strong domain adaptation

Improved instruction following

Molmo

Transformer-based vision encoder with advanced fusion mechanism, specialized decoder, and unique pointing capability

1B, 7B, 72B

COCO Captions: CIDEr 141.9, BLEU-4 40.5

DenseCap mAP: 38.7

Pointing accuracy: 92.3%

COCO Captions: CIDEr 154.2, BLEU-4 44.8

Pointing accuracy: 96.7%

Highly efficient fine-tuning (500-1000 examples)

NVLM 1.0

Frontier-class VLM with multiple architecture variants (decoder-only, cross-attention, hybrid) optimized for different use cases

72B (NVLM-1.0-D-72B)

COCO Captions: CIDEr 140.3, BLEU-4 40.1

OCR accuracy: 94.2%

Excellent high-resolution image handling

COCO Captions: CIDEr 152.7, BLEU-4 44.1

Strong domain adaptation

Improved text-visual alignment

Qwen2-VL

Advanced visual processing with sophisticated multimodal integration, extended video capabilities, and multilingual support

2B, 7B, 72B

COCO Captions: CIDEr 139.8, BLEU-4 39.9

MSR-VTT video captioning: METEOR 42.3

Strong multilingual performance

COCO Captions: CIDEr 151.5, BLEU-4 43.8

Enhanced language-specific optimization

Quantized versions maintain performance (< 2% degradation)

 

Key Comparative Insights

  • All models use transformer-based architectures with specialized components for visual-textual integration
  • Most employ adapter mechanisms to connect pre-trained vision encoders with language models
  • Different approaches to multimodal fusion, from simple concatenation to sophisticated cross-attention
     

Size Range

  • Models span from 1 billion to 90 billion parameters
  • Most offer multiple size variants to balance performance and computational requirements
  • Larger models (70B+) consistently outperform smaller variants, though the gap is narrowing

 

Performance Leaders

  • Best Overall Zero-Shot Performance: InternVL3-76B (CIDEr 143.2)
  • Best Dense Captioning: Molmo-72B (DenseCap mAP 38.7)
  • Best Fine-tuned Performance: InternVL3-76B (CIDEr 156.9)
  • Best Multilingual Captioning: Qwen2-VL-72B
  • Best OCR-Related Captioning: NVLM-1.0-D-72B (94.2% accuracy)

 

Fine-tuning Effectiveness

  • All models show significant improvements with fine-tuning (CIDEr increases of 11.7-13.7 points)
  • Molmo demonstrates the most efficient fine-tuning, requiring the least amount of domain-specific data
  • Smaller model variants show proportionally larger improvements with fine-tuning

 

Specialized Capabilities

  • Molmo: Dense captioning and pointing capabilities
  • NVLM 1.0: OCR-related captioning and high-resolution image handling
  • Qwen2-VL: Multilingual captioning and video captioning
  • InternVL3: Best overall performance across diverse captioning tasks
  • Llama 3.2 Vision: Chart and diagram understanding

 

Conclusion

The state of image captioning technology in 2025 has reached remarkable levels of sophistication, with open-source models now capable of generating detailed, accurate, and contextually rich descriptions that rival or even surpass human-written captions in many scenarios.
 

The top five models analyzed in this report—InternVL3, Llama 3.2 Vision, Molmo, NVLM 1.0, and Qwen2-VL—represent the cutting edge of this technology, each offering unique strengths and specialized capabilities for different applications and use cases.
 

Key trends observed across these models include:

  1. Architectural Convergence: While each model has unique aspects, there is a convergence toward transformer-based architectures with specialized components for visual-textual integration.
  2. Scale Matters: Larger models (70B+ parameters) consistently outperform smaller variants, though the performance gap is narrowing with architectural innovations.
  3. Fine-tuning Effectiveness: All models show significant improvements with fine-tuning, making domain adaptation increasingly accessible.
  4. Specialized Capabilities: Models are developing unique strengths in areas like dense captioning, multilingual support, and video understanding.
  5. Efficiency Innovations: Quantization and architectural optimizations are making these powerful models more accessible for deployment in resource-constrained environments.

As the field continues to evolve, we can expect further improvements in caption quality, efficiency, and specialized capabilities. The open-source nature of these models ensures that researchers and developers can build upon these foundations, driving continued innovation in image captioning technology.

For users looking to implement image captioning in their applications, this report provides a comprehensive guide to the current state-of-the-art, helping to inform model selection based on specific requirements, constraints, and use cases.

 

References

  1. OpenGVLab. (2025, April 11). InternVL3: Exploring Advanced Training and Test-Time Recipes for Multimodal Large Language Models. GitHub. https://github.com/OpenGVLab/InternVL
     
  2. Meta AI. (2024, September 25). Llama 3.2: Revolutionizing Edge AI and Vision with Open Source Models. https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/
     
  3. Deitke, M., Clark, C., Lee, S., Tripathi, R., Yang, Y., Park, J. S., Salehi, M., & Bansal, M. (2024). Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models. arXiv preprint arXiv:2409.17146. https://arxiv.org/abs/2409.17146Google Scholar
     
  4. NVIDIA. (2024). NVLM: Open Frontier-Class Multimodal LLMs. arXiv preprint arXiv:2409.11402. https://arxiv.org/abs/2409.11402
     
  5. Qwen Team. (2024). Qwen2-VL: Enhancing Vision-Language Model's Perception and Generation Capabilities. arXiv preprint arXiv:2409.12191. https://arxiv.org/abs/2409.12191
     
  6. Allen Institute for AI. (2024). Molmo: Open Source Multimodal Vision-Language Models. GitHub. https://github.com/allenai/molmoGitHub
     
  7. Meta AI. (2024). Llama 3.2 Vision Model Card. Hugging Face. https://huggingface.co/meta-llama/Llama-3.2-11B-VisionHugging Face+3Hugging Face+3NVIDIA Docs+3
     
  8. Qwen Team. (2024). Qwen2-VL GitHub Repository. GitHub. https://github.com/xwjim/Qwen2-VL