Computer vision has undergone remarkable advancements in recent years, with image classification remaining one of its most fundamental and widely applied tasks. As of 2025, state-of-the-art image classification models have achieved unprecedented levels of accuracy, efficiency, and versatility, enabling applications that were once considered science fiction.
This report provides a comprehensive overview of image classification, its applications, and the current leading models in the field. We begin with a definition and explanation of image classification, followed by detailed analyses of the top five open-source models available in 2025. For each model, we examine its architecture, size, and performance metrics both with and without fine-tuning.
The models featured in this report represent diverse approaches to image classification, from pure convolutional architectures to transformer-based designs and hybrid models that combine multiple techniques. By understanding these cutting-edge approaches, researchers and practitioners can make informed decisions about which models best suit their specific use cases and constraints.
Definition of Image Classification
Image classification is a fundamental computer vision task that involves categorizing an entire image into one or more predefined classes or labels. It is the process by which an artificial intelligence system analyzes the visual content of an image and assigns it to specific categories based on the patterns, features, and objects it contains. The goal of image classification is to accurately identify what an image represents at a holistic level, rather than identifying individual objects within the image or their precise locations.
In technical terms, image classification is a supervised learning problem where a model is trained on a dataset of labeled images. The model learns to extract meaningful features from the pixel data and map these features to class labels. During inference, when presented with a new, unseen image, the model processes the visual information and outputs a probability distribution across all possible classes, with the highest probability indicating the most likely classification.
Image classification serves as the foundation for numerous computer vision applications and has evolved significantly with the advancement of deep learning techniques, particularly convolutional neural networks (CNNs) and, more recently, vision transformers (ViTs) and hybrid architectures.
How Image Classification Works
The process of image classification typically involves several key steps:
Input Processing: The input image is preprocessed, which may include resizing, normalization, and data augmentation techniques to enhance model robustness.
Feature Extraction: The model extracts relevant features from the image. In traditional machine learning, this might involve manually engineered features, while deep learning models automatically learn hierarchical feature representations.
Classification: The extracted features are passed through a classifier that maps them to class probabilities.
Output: The model produces a probability distribution across all possible classes, and the class with the highest probability is typically chosen as the prediction.
Real-World Applications
Image classification has diverse applications across numerous domains:
Medical Diagnosis: In healthcare, image classification models analyze medical images such as X-rays, MRIs, and CT scans to detect abnormalities or diseases. For example, a model might classify a chest X-ray as showing signs of pneumonia, COVID-19, or appearing normal.
Agricultural Monitoring: Farmers use image classification to identify crop diseases, assess plant health, and monitor growth stages. A model might classify images of crop leaves as healthy or affected by specific diseases, enabling early intervention.
Retail and E-commerce: In retail, image classification helps categorize products, power visual search features, and enhance inventory management. For instance, a fashion retailer might use image classification to automatically tag clothing items by type, color, and style.
Security and Surveillance: Security systems employ image classification to detect suspicious activities or unauthorized access. A surveillance system might classify scenes as normal or potentially concerning based on the activities captured.
Autonomous Vehicles: Self-driving cars use image classification as part of their perception systems to identify road signs, traffic signals, pedestrians, and other vehicles, enabling safe navigation.
Example Scenario: Wildlife Conservation
Consider a wildlife conservation project that uses camera traps to monitor animal populations in a protected forest. The project generates thousands of images daily, making manual classification impractical. An image classification system can automatically categorize these images by:
Identifying which images contain animals versus empty scenes
Classifying the species of animals present
Detecting potential poaching activities
This automated classification enables researchers to efficiently track population trends, study animal behavior patterns, and allocate conservation resources effectively
Evolution of Image Classification Models
Image classification has evolved dramatically over the past decade, with several key milestones:
Traditional Machine Learning Era (pre-2012): Used hand-crafted features like SIFT, HOG, and traditional classifiers like SVMs.
CNN Revolution (2012-2017): AlexNet's victory in the 2012 ImageNet competition marked the beginning of deep learning dominance in image classification. This was followed by increasingly deep architectures like VGG, GoogLeNet (Inception), and ResNet.
Efficiency-Focused Models (2017-2020): Models like MobileNet and EfficientNet optimized the trade-off between accuracy and computational efficiency.
Transformer Era (2020-2023): Vision Transformer (ViT) and its variants adapted the transformer architecture from NLP to computer vision, challenging CNN dominance.
Multimodal and Hybrid Architectures (2023-2025): The latest models combine the strengths of CNNs and transformers, while also incorporating multimodal learning from both images and text.
The current state-of-the-art models in 2025 represent the culmination of these evolutionary trends, offering unprecedented accuracy, efficiency, and versatility across diverse applications.
Top 5 State-of-the-Art Models in 2025
After evaluating numerous open-source image classification models available in 2025, we have selected the following five models as the current state of the art, representing diverse approaches and trade-offs between performance and efficiency.
1. CoCa (Contrastive Captioners)
Model Architecture
CoCa (Contrastive Captioners) represents a significant advancement in image classification by combining contrastive learning and generative captioning in a unified framework. Developed as an image-text foundation model, CoCa employs an encoder-decoder architecture with several innovative design choices:
Dual-purpose Encoder: The image encoder extracts visual features using a Vision Transformer (ViT) backbone.
Cascaded Decoder: Unlike standard encoder-decoder transformers, CoCa's decoder is split into two parts:
The first half of decoder layers operates without cross-attention to encode unimodal text representations
The second half incorporates cross-attention to the image encoder, creating multimodal image-text representations
Dual Training Objectives: CoCa is trained with two complementary objectives:
A contrastive loss between unimodal image and text embeddings
A captioning loss on the multimodal decoder outputs that predicts text tokens autoregressively
This architecture allows CoCa to simultaneously learn strong visual representations through contrastive learning while developing generative capabilities through captioning, all within a single computational graph with minimal overhead.
Model Size
CoCa is available in several configurations, with the largest and most powerful possessing:
Parameters: 2.1 billion parameters
Image encoder: Based on ViT-L/14 architecture
Text decoder: Transformer with approximately 1B parameters
Training data: Combination of web-scale alt-text data and annotated images
Performance Without Fine-tuning (Zero-shot)
CoCa demonstrates exceptional zero-shot capabilities, leveraging its multimodal understanding to classify images without task-specific training:
ImageNet classification: 86.3% top-1 accuracy
Kinetics-400 video classification: 79.4% top-1 accuracy
Moments-in-Time: 44.5% top-1 accuracy
These zero-shot results are particularly impressive as they approach or exceed the performance of specialized models trained specifically for these tasks.
Performance With Fine-tuning
When fine-tuned on specific datasets, CoCa achieves state-of-the-art performance:
ImageNet classification: 91.0% top-1 accuracy (highest reported as of 2025)
With a frozen encoder and learned classification head: 90.6% top-1 accuracy
COCO image captioning: 143.6 CIDEr score
VQA: 80.4% accuracy
CoCa's fine-tuned performance demonstrates its exceptional ability to adapt to specific tasks while maintaining the benefits of its pre-trained multimodal representations.
2. DaViT (Dual Attention Vision Transformer)
Model Architecture
DaViT (Dual Attention Vision Transformer) introduces a novel approach to vision transformers by incorporating two complementary self-attention mechanisms:
Spatial Attention: Processes tokens along the spatial dimension, where:
The spatial dimension defines the token scope
The channel dimension defines the token feature dimension
Tokens are grouped into windows to maintain linear complexity
Channel Attention: Processes tokens along the channel dimension, where:
The channel dimension defines the token scope
The spatial dimension defines the token feature dimension
Each channel token contains an abstract representation of the entire image
These two attention mechanisms complement each other:
Channel attention naturally captures global interactions by considering all spatial positions
Spatial attention refines local representations through fine-grained interactions across spatial locations
The DaViT architecture is organized into stages with progressively increasing channel dimensions and decreasing spatial resolution, similar to hierarchical vision transformers.
Model Size
DaViT is available in several configurations:
DaViT-Tiny: 28.3M parameters
DaViT-Small: 49.7M parameters
DaViT-Base: 87.9M parameters
DaViT-Giant: 1.4B parameters (trained with 1.5B weakly supervised image and text pairs)
Performance Without Fine-tuning
DaViT models demonstrate strong performance even without task-specific fine-tuning:
DaViT-Giant: ~85% top-1 accuracy on ImageNet (zero-shot)
Strong transfer learning capabilities to downstream tasks like object detection and segmentation
Performance With Fine-tuning
When fine-tuned on specific datasets, DaViT achieves excellent results:
DaViT-Tiny: 82.8% top-1 accuracy on ImageNet-1K
DaViT-Small: 84.2% top-1 accuracy on ImageNet-1K
DaViT-Base: 84.6% top-1 accuracy on ImageNet-1K
DaViT-Giant: 90.4% top-1 accuracy on ImageNet-1K
DaViT also excels in other computer vision tasks:
Object detection on COCO: 54.6% mAP with DaViT-Base
Instance segmentation on COCO: 47.1% mask AP with DaViT-Base
Semantic segmentation on ADE20K: 53.2% mIoU with DaViT-Base
3. CLIP (Contrastive Language-Image Pretraining)
Model Architecture
CLIP (Contrastive Language-Image Pretraining) pioneered the approach of learning visual concepts from natural language supervision. Its architecture consists of two parallel encoders:
Image Encoder: Processes images to extract visual features
Can be implemented as either a Vision Transformer (ViT) or a ResNet
Multiple variants available (ViT-B/32, ViT-B/16, ViT-L/14, etc.)
Text Encoder: Processes text to extract textual features
Based on a Transformer architecture
Tokenizes and encodes text descriptions or labels
CLIP is trained using a contrastive learning approach:
The model learns to maximize the cosine similarity between embeddings of matching image-text pairs
It simultaneously minimizes similarity between non-matching pairs
This is achieved using a symmetric cross-entropy loss over the similarity matrix
This training approach allows CLIP to learn a joint embedding space where related images and text are positioned close together, enabling zero-shot classification by comparing image embeddings with text embeddings of potential labels.
CLIP's most distinctive feature is its zero-shot classification capability:
ImageNet: 76.2% top-1 accuracy (ViT-L/14)
CIFAR-100: 72.3% top-1 accuracy
Kinetics 400: 60.4% top-1 accuracy
Oxford Pets: 89.4% top-1 accuracy
These results are achieved without any training on the target datasets, demonstrating CLIP's ability to generalize from natural language supervision.
Performance With Fine-tuning
When fine-tuned on specific datasets, CLIP achieves even stronger results:
ImageNet: 85-89% top-1 accuracy (depending on model size and fine-tuning approach)
CIFAR-100: 90.1% top-1 accuracy
Oxford Pets: 93.5% top-1 accuracy
CLIP's fine-tuned performance is competitive with specialized models, while maintaining the flexibility of its multimodal representations.
4. ConvNeXt V2
Model Architecture
ConvNeXt V2 represents a modern evolution of convolutional neural networks, incorporating innovations from transformer architectures while maintaining the efficiency of CNNs. Key architectural features include:
Fully Convolutional Masked Autoencoder (FCMAE): A self-supervised pre-training approach that masks random patches of the input image and trains the network to reconstruct them
Global Response Normalization (GRN): A novel normalization layer that enhances inter-channel feature competition, improving representation quality
ConvNeXt Block: The basic building block includes:
Depthwise convolution with large kernel size (7×7)
Pointwise convolutions for channel mixing
Layer normalization and GELU activation functions
Residual connections
The architecture follows a hierarchical design with four stages, progressively reducing spatial resolution while increasing channel dimensions, similar to traditional CNN architectures but with modern design choices.
Model Size
ConvNeXt V2 is available in multiple configurations, ranging from extremely lightweight to very large:
ConvNeXt V2-Atto: 3.7M parameters, 0.55G FLOPs
ConvNeXt V2-Femto: 5.2M parameters, 0.78G FLOPs
ConvNeXt V2-Pico: 9.1M parameters, 1.37G FLOPs
ConvNeXt V2-Nano: 15.6M parameters, 2.45G FLOPs
ConvNeXt V2-Tiny: 28.6M parameters, 4.47G FLOPs
ConvNeXt V2-Base: 89M parameters, 15.4G FLOPs
ConvNeXt V2-Large: 198M parameters, 34.4G FLOPs
ConvNeXt V2-Huge: 660M parameters, 115G FLOPs
Performance Without Fine-tuning
ConvNeXt V2 models are pre-trained using the FCMAE approach, which provides strong representations for transfer learning:
Linear probing on ImageNet: 78.2% top-1 accuracy (ConvNeXt V2-Base)
Strong feature representations for various downstream tasks
Performance With Fine-tuning
When fine-tuned on ImageNet-1K:
ConvNeXt V2-Atto: 76.7% top-1 accuracy
ConvNeXt V2-Femto: 78.5% top-1 accuracy
ConvNeXt V2-Pico: 80.3% top-1 accuracy
ConvNeXt V2-Nano: 81.9% top-1 accuracy
ConvNeXt V2-Tiny: 83.0% top-1 accuracy
ConvNeXt V2-Base: 84.9% top-1 accuracy
ConvNeXt V2-Large: 85.8% top-1 accuracy
ConvNeXt V2-Huge: 86.3% top-1 accuracy
When fine-tuned on ImageNet-22K and then ImageNet-1K:
ConvNeXt V2-Large (384×384): 88.2% top-1 accuracy
ConvNeXt V2-Huge (512×512): 88.9% top-1 accuracy
ConvNeXt V2 also demonstrates excellent performance on object detection and segmentation tasks, showing the versatility of its learned representations.
5. EfficientNet
Model Architecture
EfficientNet pioneered a systematic approach to model scaling through compound scaling, which uniformly scales network width, depth, and resolution. The architecture includes:
MBConv (Mobile Inverted Bottleneck Convolution) blocks: The primary building block, inspired by MobileNetV2
Expands channels in the first 1×1 convolution
Applies depthwise convolution for spatial mixing
Projects back to a smaller number of channels
Includes squeeze-and-excitation optimization for channel attention
Compound Scaling Method: Uses a compound coefficient φ to uniformly scale:
Network depth (d = α^φ)
Network width (w = β^φ)
Input resolution (r = γ^φ)
Where α, β, and γ are constants determined through a grid search.
The architecture follows a mobile-first design philosophy, prioritizing efficiency while maintaining high accuracy.
Model Size
EfficientNet is available in multiple configurations, from B0 (smallest) to B7 (largest):
EfficientNet-B0: 5.3M parameters, 0.39B FLOPs
EfficientNet-B1: 7.8M parameters, 0.70B FLOPs
EfficientNet-B2: 9.2M parameters, 1.0B FLOPs
EfficientNet-B3: 12M parameters, 1.8B FLOPs
EfficientNet-B4: 19M parameters, 4.2B FLOPs
EfficientNet-B5: 30M parameters, 9.9B FLOPs
EfficientNet-B6: 43M parameters, 19B FLOPs
EfficientNet-B7: 66M parameters, 37B FLOPs
EfficientNetV2, an improved version, offers even better efficiency and training speed.
Performance Without Fine-tuning
EfficientNet models are typically trained in a supervised manner and don't have the same zero-shot capabilities as models like CLIP or CoCa. However, they serve as excellent feature extractors for transfer learning:
Linear probing on various datasets shows strong performance
Feature representations transfer well to downstream tasks
Performance With Fine-tuning
When fine-tuned on ImageNet-1K:
EfficientNet-B0: 77.1% top-1 accuracy
EfficientNet-B1: 79.1% top-1 accuracy
EfficientNet-B2: 80.1% top-1 accuracy
EfficientNet-B3: 81.6% top-1 accuracy
EfficientNet-B4: 82.9% top-1 accuracy
EfficientNet-B5: 83.6% top-1 accuracy
EfficientNet-B6: 84.0% top-1 accuracy
EfficientNet-B7: 84.3% top-1 accuracy
EfficientNetV2-L, when pretrained on ImageNet-21K and fine-tuned on ImageNet-1K, achieves 85.7% top-1 accuracy.
EfficientNet models excel in resource-constrained environments, offering an excellent balance between accuracy and computational efficiency, making them ideal for mobile and edge devices.
Comparative Analysis
When comparing the top five image classification models of 2025, several key trends and trade-offs emerge:
Performance vs. Model Size
Highest Accuracy: CoCa achieves the best overall performance with 91.0% top-1 accuracy on ImageNet after fine-tuning, but requires 2.1B parameters.
Efficiency Leader: EfficientNet provides the best accuracy-to-parameter ratio, with EfficientNet-B0 achieving 77.1% accuracy with only 5.3M parameters.
Middle Ground: ConvNeXt V2 offers a strong balance, with the Tiny variant (28.6M parameters) achieving 83.0% accuracy.
Zero-Shot Capabilities
Superior Zero-Shot: CLIP and CoCa excel in zero-shot classification, enabling them to generalize to new classes without specific training.
Limited Zero-Shot: ConvNeXt V2 and EfficientNet require fine-tuning for optimal performance on new tasks.
Emerging Capability: DaViT-Giant shows promising zero-shot abilities when scaled to larger sizes.
Architectural Approaches
Pure Transformer: CLIP (ViT variant) and DaViT are based primarily on transformer architectures.
Pure CNN: EfficientNet maintains a traditional CNN design with modern optimizations.
Hybrid Approaches: CoCa combines transformer-based vision and language models, while ConvNeXt V2 incorporates transformer-inspired elements into a CNN framework.
Deployment Considerations
Edge Devices: EfficientNet and smaller ConvNeXt V2 variants (Atto, Femto, Pico) are well-suited for mobile and edge deployment.
Cloud Deployment: Larger models like CoCa and DaViT-Giant are more appropriate for cloud-based applications where computational resources are abundant.
Versatility: CLIP offers unique capabilities for applications requiring flexible classification without retraining.
Comparison Table of State-of-the-Art Image Classification Models (2025)
Model Comparison by Key Metrics
Model
Architecture Brief
Sizes Available
Performance Without Fine-tuning
Performance After Fine-tuning
CoCa
Encoder-decoder with cascaded decoder; ViT-based image encoder
Single large model (2.1B parameters)
• ImageNet: 86.3% top-1 accuracy
• Kinetics-400: 79.4% top-1 accuracy
• ImageNet: 91.0% top-1 accuracy
• With frozen encoder: 90.6% top-1 accuracy
DaViT
Transformer with dual attention mechanisms (spatial + channel)
• Tiny: 28.3M parameters
• Small: 49.7M parameters
• Base: 87.9M parameters
• Giant: 1.4B parameters
• DaViT-Giant: ~85% top-1 accuracy on ImageNet (zero-shot)
• Tiny: 82.8% top-1 accuracy
• **Small: 84.2% top-1 accuracy**
• **Base: 84.6% top-1 accuracy**
• **Giant: 90.4% top-1 accuracy**
CLIP
Dual-encoder with separate image and text encoders
• ViT-B/32: ~150M parameters
• ViT-B/16: ~150M parameters
• ViT-L/14: ~400M parameters
• ResNet variants: 102-167M parameters
• ImageNet: 76.2% top-1 accuracy
• CIFAR-100: 72.3% top-1 accuracy
• Oxford Pets: 89.4% top-1 accuracy
• ImageNet: 85-89% top-1 accuracy
• **CIFAR-100: 90.1% top-1 accuracy**
• **Oxford Pets: 93.5% top-1 accuracy**
ConvNeXt V2
CNN-inspired architecture with transformer elements
• Atto: 3.7M parameters
• Femto: 5.2M parameters
• Pico: 9.1M parameters
• Nano: 15.6M parameters
• Tiny: 28.6M parameters
• Base: 89M parameters
• Large: 198M parameters
• Huge: 660M parameters
• Linear probing on ImageNet: 78.2% top-1 accuracy (Base)
Note: All performance metrics for fine-tuned models are on ImageNet-1K unless otherwise specified. The "Performance After Fine-tuning" column shows the accuracy achieved after model fine-tuning on specific datasets.
Conclusion
The landscape of image classification in 2025 reflects the remarkable progress made in computer vision over the past decade. The five models highlighted in this report—CoCa, DaViT, CLIP, ConvNeXt V2, and EfficientNet—represent diverse approaches to the fundamental task of categorizing images, each with its own strengths and optimal use cases.
Several key trends are evident in these state-of-the-art models:
Multimodal Learning: The integration of vision and language, as exemplified by CoCa and CLIP, has enabled more flexible and powerful classification systems that can leverage natural language supervision.
Architectural Convergence: The boundaries between CNNs and transformers are blurring, with hybrid approaches like ConvNeXt V2 incorporating the best aspects of both paradigms.
Scaling Efficiency: Models like EfficientNet and the smaller ConvNeXt V2 variants demonstrate that thoughtful architecture design can yield impressive performance even with limited parameters.
Zero-Shot Capabilities: The ability to classify images without specific training on target categories, pioneered by CLIP and enhanced by CoCa, represents a significant advancement toward more general visual intelligence.
As computer vision continues to evolve, we can expect further innovations that build upon these foundations, potentially combining the efficiency of CNNs, the representational power of transformers, and the flexibility of multimodal learning into even more capable systems.
For practitioners, the choice of model should be guided by specific requirements:
For maximum accuracy with abundant computational resources, CoCa represents the current pinnacle.
For deployment on resource-constrained devices, EfficientNet and smaller ConvNeXt V2 variants offer excellent efficiency.
For applications requiring flexible classification without retraining, CLIP provides unmatched zero-shot capabilities.
For a balance of global and local feature modeling, DaViT offers a compelling dual-attention approach.
As these models continue to be refined and new approaches emerge, image classification will remain a cornerstone of computer vision, enabling increasingly sophisticated applications across diverse domains.
References
Yu, J., Wang, Z., Vasudevan, V., Yeung, L., Seyedhosseini, M., & Wu, Y. (2022). CoCa: Contrastive Captioners are Image-Text Foundation Models. arXiv:2205.01917.
Ding, M., Xiao, B., Codella, N., Luo, P., Wang, J., & Yuan, L. (2022). DaViT: Dual Attention Vision Transformers. arXiv:2204.03645.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. (2021). Learning transferable visual models from natural language supervision. In International Conference on Machine Learning.
Woo, S., Debnath, S., Hu, R., Chen, X., Liu, Z., Kweon, I. S., & Xie, S. (2023). ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders. arXiv:2301.00808.
Tan, M., & Le, Q. (2019). EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In International Conference on Machine Learning.
Tan, M., & Le, Q. (2021). EfficientNetV2: Smaller Models and Faster Training. In International Conference on Machine Learning.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., ... & Houlsby, N. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations.
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022). Masked Autoencoders Are Scalable Vision Learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
Liu, Z., Mao, H., Wu, C. Y., Feichtenhofer, C., Darrell, T., & Xie, S. (2022). A ConvNet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., ... & Fei-Fei, L. (2015). ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision, 115(3), 211-252.