Image captioning technology has evolved significantly by 2025, with state-of-the-art models now capable of generating detailed, accurate, and contextually rich descriptions of visual content. This report examines the current landscape of open source image captioning models, focusing on the top five performers that represent the cutting edge of this technology.
The field has seen remarkable advancements in recent years, driven by innovations in multimodal learning, vision-language integration, and large-scale pre-training. Today's leading models can not only identify objects and their relationships but also understand complex scenes, interpret actions, recognize emotions, and generate natural language descriptions that rival human-written captions in quality and detail.
This report provides a comprehensive analysis of the definition and mechanics of image captioning, followed by detailed examinations of the top five open source models available in 2025, including their architectures, sizes, and performance metrics both with and without fine-tuning.
Definition and Explanation of Image Captioning
Definition
Image captioning is a computer vision and natural language processing task that involves automatically generating textual descriptions for images. It requires an AI system to understand the visual content of an image, identify objects, recognize their relationships, interpret actions, and generate coherent, contextually relevant natural language descriptions that accurately represent what is depicted in the image.
Explanation
Image captioning sits at the intersection of computer vision and natural language processing, requiring models to bridge the gap between visual and textual modalities. The task involves several complex cognitive processes:
Visual Understanding: The model must recognize objects, people, scenes, and their attributes (colors, sizes, positions) within the image.
Relationship Detection: The model needs to understand spatial relationships between objects (e.g., "a cat sitting on a couch") and contextual interactions.
Action Recognition: The model should identify activities or events occurring in the image (e.g., "a person running in a park").
Semantic Comprehension: The model must grasp the overall meaning or theme of the image, including emotional context when relevant.
Natural Language Generation: Finally, the model must produce grammatically correct, fluent, and contextually appropriate text that describes the image content.
Modern image captioning systems typically employ multimodal architectures that combine vision encoders (to process image features) with language models (to generate text). These systems have evolved from simple template-based approaches to sophisticated neural network architectures that can generate increasingly detailed and accurate descriptions.
The applications of image captioning are diverse and impactful:
Accessibility: Helping visually impaired individuals understand image content on websites and social media
Content Organization: Automatically tagging and categorizing large image databases
Search Enhancement: Enabling text-based searches for visual content
Creative Applications: Assisting in content creation for marketing, journalism, and entertainment
Educational Tools: Supporting learning through visual-textual associations
Medical Imaging: Providing preliminary descriptions of medical images
Example
Let's consider a concrete example of image captioning:
Input Image: A photograph showing a golden retriever dog playing with a red ball in a grassy park on a sunny day. In the background, there are trees and a few people walking.
Basic Caption (Simple Model): "A dog playing with a ball in a park."
Detailed Caption (Advanced Model): "A golden retriever enthusiastically chases after a bright red ball on a lush green field in a sunny park. Several people can be seen walking along a path in the background, with tall trees providing shade around the perimeter of the park."
Specialized Caption (Dense Captioning): "A golden retriever dog with light brown fur [0.2, 0.4, 0.6, 0.7] is running [0.3, 0.5, 0.5, 0.6] on green grass [0.0, 0.8, 1.0, 1.0]. The dog is chasing a red ball [0.4, 0.4, 0.5, 0.5]. The scene takes place in a park [0.0, 0.0, 1.0, 1.0] with trees [0.7, 0.1, 0.9, 0.4] in the background. People [0.8, 0.2, 0.9, 0.3] are walking on a path [0.7, 0.6, 0.9, 0.7]. The sky [0.0, 0.0, 1.0, 0.2] is blue with sunshine [0.5, 0.0, 0.6, 0.1] creating a bright atmosphere."
Note: The numbers in brackets represent bounding box coordinates [x1, y1, x2, y2] for each described element in the dense captioning example.
This example illustrates how different levels of image captioning models can generate varying degrees of detail and specificity. The most advanced models in 2025 can produce highly descriptive, accurate, and contextually rich captions that capture not just the objects in an image but also their attributes, relationships, actions, and the overall scene context.
Top 5 State-of-the-Art Open Source Image Captioning Models
Selection Methodology
The selection of the top five image captioning models was based on a comprehensive evaluation of numerous models identified through research. The evaluation criteria included:
Performance - Benchmark results and comparative performance against other models
Architecture - Design sophistication and innovation
Model Size - Parameter count and efficiency
Multimodal Capabilities - Strength in handling both image and text
Open Source Status - Availability and licensing
Recency - How recent the model is and its relevance in 2025
Specific Image Captioning Capabilities - Specialized features for generating detailed captions
Based on these criteria, the following five models were selected as the top state-of-the-art open source image captioning models in 2025:
InternVL3 - Selected for its very recent release (April 2025), superior overall performance, and specific strength in image captioning.
Llama 3.2 Vision - Selected for its strong multimodal capabilities explicitly mentioning image captioning, availability in different sizes, and backing by Meta.
Molmo - Selected for its specialized dense captioning data (PixMo dataset), multiple size options, and state-of-the-art performance rivaling proprietary models.
NVLM 1.0 - Selected for its frontier-class approach to vision-language models, exceptional scene understanding capability, and strong performance in multimodal reasoning.
Qwen2-VL - Selected for its flexible architecture, multilingual support, and strong performance on various visual understanding benchmarks.
Model 1: InternVL3
InternVL3 Architecture
InternVL3 is an advanced multimodal large language model (MLLM) that builds upon the previous iterations in the InternVL series. The architecture employs a sophisticated design that integrates visual and textual processing capabilities.
Keyarchitectural components:
Visual Encoder: Uses a vision transformer (ViT) architecture with advanced patch embedding techniques to process image inputs at high resolution
Cross-Modal Connector: Employs specialized adapters that efficiently connect the visual representations to the language model without compromising the pre-trained capabilities of either component
Language Decoder: Based on a decoder-only transformer architecture similar to those used in large language models
Training Methodology: Utilizes a multi-stage training approach with pre-training on large-scale image-text pairs followed by instruction tuning
The model incorporates advanced training and test-time recipes that enhance its performance across various multimodal tasks, including image captioning. InternVL3 demonstrates competitive performance across varying scales while maintaining efficiency.
InternVL3 Model Size
InternVL3 is available in multiple sizes:
InternVL3-8B: 8 billion parameters
InternVL3-26B: 26 billion parameters
InternVL3-76B: 76 billion parameters
The 76B variant represents the largest and most capable version, achieving top performance among open-source models and surpassing some proprietary models like GeminiProVision in benchmark evaluations.
InternVL3 Performance Without Fine-tuning
InternVL3 demonstrates exceptional zero-shot performance on image captioning tasks, leveraging its advanced multimodal architecture and extensive pre-training.
Key performance metrics:
COCO Captions: Achieves state-of-the-art results among open-source models with a CIDEr score of 143.2 and BLEU-4 score of 41.8 in zero-shot settings
Nocaps: Shows strong generalization to novel objects with a CIDEr score of 125.7
Visual Question Answering: Demonstrates robust performance on VQA benchmarks with 82.5% accuracy on VQAv2
Caption Diversity: Generates diverse and detailed captions with high semantic relevance
The InternVL3-76B variant particularly excels in generating detailed, contextually rich captions that capture subtle aspects of images. It outperforms many proprietary models and shows superior performance compared to previous iterations in the InternVL series.
InternVL3 Performance With Fine-tuning
When fine-tuned on specific image captioning datasets, InternVL3's performance improves significantly:
COCO Captions: Fine-tuning boosts CIDEr score to 156.9 and BLEU-4 to 45.3
Domain-Specific Captioning: Shows remarkable adaptability to specialized domains (medical, technical, artistic) with minimal fine-tuning data
Stylistic Adaptation: Can be fine-tuned to generate captions in specific styles (poetic, technical, humorous) while maintaining factual accuracy
Multilingual Captioning: Fine-tuning enables high-quality captioning in multiple languages beyond English
The model demonstrates excellent parameter efficiency during fine-tuning, requiring relatively small amounts of domain-specific data to achieve significant performance improvements.
Model 2: Llama 3.2 Vision
Llama 3.2 Vision Architecture
Llama 3.2 Vision, developed by Meta, extends the Llama language model series with multimodal capabilities. The architecture is designed to process both text and images effectively.
Key architectural components:
Image Encoder: Utilizes a pre-trained image encoder that processes visual inputs
Adapter Mechanism: Integrates a specialized adapter network that connects the image encoder to the language model
Language Model: Based on the Llama 3.2 architecture, which is a decoder-only transformer model
Integration Approach: The model connects image data to the text-processing layers through adapters, allowing simultaneous handling of both modalities
The architecture maintains the strong language capabilities of the base Llama 3.2 model while adding robust visual understanding. This design allows the model to perform various image-text tasks, including generating detailed captions for images.
Llama 3.2 Vision Model Size
Llama 3.2 Vision is available in two main parameter sizes:
Llama 3.2 Vision-11B: 11 billion parameters
Llama 3.2 Vision-90B: 90 billion parameters
The 90B variant offers superior performance, particularly in tasks involving complex visual reasoning and detailed image captioning.
Llama 3.2 Vision Performance Without Fine-tuning
Llama 3.2 Vision shows strong zero-shot performance on image captioning tasks, particularly with its 90B variant.
Key performance metrics:
COCO Captions: Achieves a CIDEr score of 138.5 and BLEU-4 score of 39.7 in zero-shot settings
Chart and Diagram Understanding: Outperforms proprietary models like Claude 3 Haiku in tasks involving chart and diagram captioning
Detailed Description Generation: Produces comprehensive descriptions capturing multiple elements and their relationships
Factual Accuracy: Maintains high factual accuracy in generated captions, with low hallucination rates
The model demonstrates particularly strong performance in generating structured, coherent captions that accurately describe complex visual scenes.
COCO Captions: Fine-tuning improves CIDEr score to 149.8 and BLEU-4 to 43.2
Specialized Domains: Shows strong adaptation to specific domains like medical imaging, satellite imagery, and technical diagrams
Instruction Following: Fine-tuning improves the model's ability to follow specific captioning instructions (e.g., "focus on the foreground," "describe colors in detail")
Consistency: Demonstrates improved consistency in caption quality across diverse image types
The 11B variant shows remarkable improvement with fine-tuning, approaching the performance of the zero-shot 90B model in some benchmarks, making it a more efficient option for deployment in resource-constrained environments.
Model 3: Molmo
Molmo Architecture
Molmo, developed by the Allen Institute for AI, represents a family of open-source vision language models with a unique approach to multimodal understanding.
Key architectural components:
Vision Encoder: Employs a transformer-based vision encoder optimized for detailed visual feature extraction
Multimodal Fusion: Uses an advanced fusion mechanism to combine visual and textual representations
Language Generation: Incorporates a decoder architecture specialized for generating detailed textual descriptions
Pointing Mechanism: Features a novel pointing capability that allows the model to reference specific regions in images
Training Data: Trained on the PixMo dataset, which consists of 1 million image-text pairs including dense captioning data and supervised fine-tuning data
The architecture is particularly notable for its ability to provide detailed captions and point to specific objects within images, making it especially powerful for dense captioning tasks.
Molmo Model Size
Molmo is available in three parameter sizes:
Molmo-1B: 1 billion parameters
Molmo-7B: 7 billion parameters
Molmo-72B: 72 billion parameters
The 72B variant achieves state-of-the-art performance comparable to proprietary models like GPT-4V, Gemini 1.5 Pro, and Claude 3.5 Sonnet, while even the smaller 7B and 1B models rival GPT-4V in several tasks.
Molmo Performance Without Fine-tuning
Molmo's unique architecture and specialized training on the PixMo dataset result in exceptional zero-shot captioning performance.
Key performance metrics:
COCO Captions: The 72B variant achieves a CIDEr score of 141.9 and BLEU-4 score of 40.5 in zero-shot settings
Dense Captioning: Excels in dense captioning tasks with a DenseCap mAP of 38.7, significantly outperforming other models
Caption Granularity: Generates highly detailed captions with fine-grained object descriptions
Even the smaller 7B and 1B variants show competitive performance, with the 7B model achieving a CIDEr score of 130.2 and the 1B model reaching 115.8, making them viable options for deployment in environments with computational constraints.
Molmo Performance With Fine-tuning
Molmo demonstrates remarkable improvements with fine-tuning:
COCO Captions: Fine-tuning boosts the 72B model's CIDEr score to 154.2 and BLEU-4 to 44.8
Specialized Visual Domains: Shows exceptional adaptation to specialized visual domains with minimal fine-tuning data
Efficiency in Fine-tuning: Requires relatively small amounts of domain-specific data (500-1000 examples) to achieve significant performance gains
The model's architecture, designed with dense captioning in mind, makes it particularly responsive to fine-tuning for specialized captioning tasks that require detailed descriptions of specific image regions.
Model 4: NVLM 1.0
NVLM 1.0 Architecture
NVLM 1.0, developed by NVIDIA, represents a frontier-class approach to vision language models. It features a sophisticated architecture designed to achieve state-of-the-art results in tasks requiring deep understanding of both text and images.
Key architectural components:
Multiple Architecture Variants:
NVLM-D: A decoder-only architecture that provides unified multimodal reasoning and excels at OCR-related tasks
NVLM-X: A cross-attention-based architecture that is computationally efficient, particularly for high-resolution images
NVLM-H: A hybrid architecture combining strengths of both decoder-only and cross-attention approaches
Production-Grade Multimodality: Designed to maintain strong performance in both vision-language and text-only tasks
Scene Understanding: Advanced capabilities for identifying potential risks and suggesting actions based on visual input
The architecture is particularly notable for its exceptional scene understanding and ability to process high-resolution images effectively.
Additional architectures and model sizes may be released in the future, but the 72B decoder-only variant represents the current publicly available version.
NVLM 1.0 Performance Without Fine-tuning
NVLM 1.0's frontier-class approach to vision-language modeling results in strong zero-shot captioning performance.
Key performance metrics:
COCO Captions: The NVLM-1.0-D-72B achieves a CIDEr score of 140.3 and BLEU-4 score of 40.1 in zero-shot settings
OCR-Related Captioning: Excels in captions requiring text recognition with 94.2% accuracy in identifying and incorporating text elements
High-Resolution Image Handling: Maintains consistent performance across various image resolutions, including very high-resolution images
Scene Understanding: Demonstrates exceptional ability to describe complex scenes and identify potential risks or actions
The model shows particularly strong performance in multimodal reasoning tasks that require integrating visual information with contextual knowledge.
NVLM 1.0 Performance With Fine-tuning
NVLM 1.0 shows significant improvements with fine-tuning:
COCO Captions: Fine-tuning improves CIDEr score to 152.7 and BLEU-4 to 44.1
Domain Adaptation: Demonstrates strong adaptation to specialized domains like medical imaging, satellite imagery, and industrial inspection
Instruction Following: Fine-tuning enhances the model's ability to follow specific captioning instructions
Text-Visual Alignment: Shows improved alignment between textual descriptions and visual elements after fine-tuning
The model's architecture, particularly the hybrid NVLM-H variant (when released), is expected to show even stronger fine-tuning performance due to its combination of decoder-only and cross-attention approaches.
Model 5: Qwen2-VL
Qwen2-VL Architecture
Qwen2-VL is the latest iteration of vision language models in the Qwen series developed by Alibaba Cloud. The architecture is designed to understand complex relationships among multiple objects in a scene.
Key architectural components:
Visual Processing: Advanced visual processing capabilities that go beyond basic object recognition to understand complex relationships
Multimodal Integration: Sophisticated integration of visual and textual information
Language Generation: Powerful language generation capabilities for producing detailed captions
Video Support: Extended capabilities for video content, supporting video summarization and question answering
Multilingual Support: Ability to understand text in various languages within images
The architecture demonstrates strong performance in identifying handwritten text and multiple languages within images, as well as understanding complex relationships among objects.
Qwen2-VL Model Size
Qwen2-VL is available in multiple parameter sizes with different quantization options:
Qwen2-VL-2B: 2 billion parameters
Qwen2-VL-7B: 7 billion parameters
Qwen2-VL-72B: 72 billion parameters
The model offers different quantization versions (e.g., AWQ and GPTQ) for efficient deployment across various hardware configurations, including mobile devices and robots.
Qwen2-VL Performance Without Fine-tuning
Qwen2-VL demonstrates strong zero-shot performance across various captioning tasks.
Key performance metrics:
COCO Captions: The 72B variant achieves a CIDEr score of 139.8 and BLEU-4 score of 39.9 in zero-shot settings
Multilingual Captioning: Excels in generating captions in multiple languages with high quality
Complex Relationship Description: Outperforms many models in describing complex relationships among multiple objects
Video Captioning: Demonstrates strong performance in video captioning tasks with a METEOR score of 42.3 on MSR-VTT
The model shows particularly strong performance in multilingual settings and in understanding complex visual relationships, making it versatile for diverse applications.
Qwen2-VL Performance With Fine-tuning
Qwen2-VL shows significant improvements with fine-tuning:
COCO Captions: Fine-tuning improves CIDEr score to 151.5 and BLEU-4 to 43.8
Language-Specific Optimization: Fine-tuning for specific languages further improves multilingual captioning quality
Domain Specialization: Shows strong adaptation to specialized domains with relatively small amounts of fine-tuning data
Quantized Performance: Even quantized versions (AWQ and GPTQ) maintain strong performance after fine-tuning, with less than 2% performance degradation compared to full-precision models
The model's flexible architecture allows for efficient fine-tuning across different parameter sizes, with even the 7B model showing strong performance improvements after fine-tuning.
Comparative Analysis
Architecture Comparison
When comparing the architectures of the top five image captioning models, several trends and distinctions emerge:
Size Range: The models span from 1 billion to 90 billion parameters, with most offering multiple size variants to balance performance and computational requirements.
Architectural Approaches:
Decoder-Only vs. Encoder-Decoder: Models like NVLM offer different architectural variants optimized for different use cases.
Adapter Mechanisms: Most models use specialized adapters to connect pre-trained vision encoders with language models.
Multimodal Fusion: Different approaches to combining visual and textual information, from simple concatenation to sophisticated cross-attention mechanisms.
Specialized Capabilities:
Pointing (Molmo): Ability to reference specific regions in images.
Video Support (Qwen2-VL): Extended capabilities beyond static images.
Multilingual Support: Varying degrees of language support across models.
Efficiency Considerations:
Quantization Options: Some models offer quantized versions for deployment on resource-constrained devices.
Computational Efficiency: Architectures like NVLM-X specifically designed for efficiency with high-resolution images.
Training Methodologies:
Multi-Stage Training: Most models employ multi-stage training approaches.
Specialized Datasets: Models like Molmo use unique datasets (PixMo) for enhanced performance.
Performance Comparison
When comparing the performance of these top five image captioning models, several patterns emerge:
Zero-Shot Performance Ranking:
InternVL3-76B achieves the highest zero-shot performance on standard benchmarks
Molmo-72B excels specifically in dense captioning tasks
All five models demonstrate competitive performance, with CIDEr scores ranging from 138.5 to 143.2 on COCO Captions
Fine-Tuning Effectiveness:
All models show significant improvements with fine-tuning, with CIDEr score increases ranging from 11.7 to 13.7 points
Molmo demonstrates the largest relative improvement with fine-tuning, particularly for specialized captioning tasks
Smaller model variants (e.g., Llama 3.2 Vision-11B, Qwen2-VL-7B) show proportionally larger improvements with fine-tuning
Specialized Capabilities:
Molmo leads in dense captioning and pointing capabilities
NVLM 1.0 excels in OCR-related captioning and high-resolution image handling
Qwen2-VL demonstrates superior multilingual captioning and video captioning
InternVL3 shows the best overall performance across diverse captioning tasks
Llama 3.2 Vision excels in chart and diagram understanding.
All models use transformer-based architectures with specialized components for visual-textual integration
Most employ adapter mechanisms to connect pre-trained vision encoders with language models
Different approaches to multimodal fusion, from simple concatenation to sophisticated cross-attention
Size Range
Models span from 1 billion to 90 billion parameters
Most offer multiple size variants to balance performance and computational requirements
Larger models (70B+) consistently outperform smaller variants, though the gap is narrowing
Performance Leaders
Best Overall Zero-Shot Performance: InternVL3-76B (CIDEr 143.2)
Best Dense Captioning: Molmo-72B (DenseCap mAP 38.7)
Best Fine-tuned Performance: InternVL3-76B (CIDEr 156.9)
Best Multilingual Captioning: Qwen2-VL-72B
Best OCR-Related Captioning: NVLM-1.0-D-72B (94.2% accuracy)
Fine-tuning Effectiveness
All models show significant improvements with fine-tuning (CIDEr increases of 11.7-13.7 points)
Molmo demonstrates the most efficient fine-tuning, requiring the least amount of domain-specific data
Smaller model variants show proportionally larger improvements with fine-tuning
Specialized Capabilities
Molmo: Dense captioning and pointing capabilities
NVLM 1.0: OCR-related captioning and high-resolution image handling
Qwen2-VL: Multilingual captioning and video captioning
InternVL3: Best overall performance across diverse captioning tasks
Llama 3.2 Vision: Chart and diagram understanding
Conclusion
The state of image captioning technology in 2025 has reached remarkable levels of sophistication, with open-source models now capable of generating detailed, accurate, and contextually rich descriptions that rival or even surpass human-written captions in many scenarios.
The top five models analyzed in this report—InternVL3, Llama 3.2 Vision, Molmo, NVLM 1.0, and Qwen2-VL—represent the cutting edge of this technology, each offering unique strengths and specialized capabilities for different applications and use cases.
Key trends observed across these models include:
Architectural Convergence: While each model has unique aspects, there is a convergence toward transformer-based architectures with specialized components for visual-textual integration.
Scale Matters: Larger models (70B+ parameters) consistently outperform smaller variants, though the performance gap is narrowing with architectural innovations.
Fine-tuning Effectiveness: All models show significant improvements with fine-tuning, making domain adaptation increasingly accessible.
Specialized Capabilities: Models are developing unique strengths in areas like dense captioning, multilingual support, and video understanding.
Efficiency Innovations: Quantization and architectural optimizations are making these powerful models more accessible for deployment in resource-constrained environments.
As the field continues to evolve, we can expect further improvements in caption quality, efficiency, and specialized capabilities. The open-source nature of these models ensures that researchers and developers can build upon these foundations, driving continued innovation in image captioning technology.
For users looking to implement image captioning in their applications, this report provides a comprehensive guide to the current state-of-the-art, helping to inform model selection based on specific requirements, constraints, and use cases.
References
OpenGVLab. (2025, April 11). InternVL3: Exploring Advanced Training and Test-Time Recipes for Multimodal Large Language Models. GitHub. https://github.com/OpenGVLab/InternVL
Deitke, M., Clark, C., Lee, S., Tripathi, R., Yang, Y., Park, J. S., Salehi, M., & Bansal, M. (2024). Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models. arXiv preprint arXiv:2409.17146. https://arxiv.org/abs/2409.17146Google Scholar