Image segmentation is a fundamental computer vision task that has seen remarkable advancements in recent years. As of 2025, the field has evolved significantly with the emergence of foundation models, unified architectures, and specialized networks that push the boundaries of what's possible in visual understanding. This report provides a comprehensive overview of image segmentation, its applications, and the top five state-of-the-art models currently dominating the field.
Definition and Explanation
Image segmentation is a computer vision technique that divides a digital image into multiple segments or regions, each corresponding to a different object or part of the image. Unlike simple classification that identifies what is in an image, or object detection that locates objects with bounding boxes, image segmentation creates a pixel-level understanding of the image by assigning a class label to each pixel. This process transforms the representation of an image from a grid of pixels into a more meaningful and easier-to-analyze collection of segments.
The goal of image segmentation is to simplify and/or change the representation of an image into something more meaningful and easier to analyze. It is typically used to locate objects and boundaries (lines, curves, etc.) in images. More precisely, image segmentation is the process of assigning a label to every pixel in an image such that pixels with the same label share certain characteristics.
Types of Image Segmentation
There are several types of image segmentation tasks, each serving different purposes:
Semantic Segmentation: Assigns a class label to each pixel in the image without differentiating between instances of the same class. For example, all pixels belonging to "person" would have the same label regardless of how many people are in the image.
Instance Segmentation: Goes beyond semantic segmentation by distinguishing between different instances of the same class. For example, if there are multiple people in an image, each person would be segmented separately with a unique identifier.
Panoptic Segmentation: Combines semantic and instance segmentation, providing a complete scene understanding. It segments both countable objects (like people, cars) as individual instances and uncountable background elements (like sky, road) as semantic regions.
Video Segmentation: Extends image segmentation to video frames, maintaining temporal consistency across frames to track objects over time.
Interactive Segmentation: Allows user input (like clicks or rough outlines) to guide the segmentation process, enabling more precise control over the results.
Open-Vocabulary Segmentation: Can segment objects described by arbitrary text prompts, even if they weren't explicitly included in the training data.
How Image Segmentation Works
Modern image segmentation approaches primarily use deep learning techniques, particularly Convolutional Neural Networks (CNNs) and Transformer architectures. These models typically follow an encoder-decoder structure:
Encoder: Extracts features from the input image at multiple scales, capturing both fine details and broader contextual information.
Decoder: Uses the encoded features to generate a segmentation mask, often through upsampling operations that restore the spatial resolution of the image.
Skip Connections: Many architectures use skip connections between encoder and decoder layers to preserve fine spatial details that might otherwise be lost during encoding.
The output is a segmentation mask—a matrix with the same dimensions as the input image where each element corresponds to a pixel's class assignment.
Example of Image Segmentation
Consider a street scene photograph containing cars, pedestrians, buildings, and a road. Image segmentation would process this image as follows:
Input: The original RGB image (e.g., 1024×768 pixels).
Processing: The segmentation model analyzes the image, identifying patterns and features that correspond to different objects.
Output: A segmentation mask where each pixel is assigned a class label. For instance:
Red pixels might represent cars
Blue pixels might represent pedestrians
Green pixels might represent vegetation
Gray pixels might represent the road
Brown pixels might represent buildings
This segmentation mask provides a detailed understanding of the scene, showing precisely where each object is located down to the pixel level. In instance segmentation, each car and each pedestrian would have a unique identifier, allowing the system to count and track individual objects.
Applications of Image Segmentation
Image segmentation has numerous practical applications across various domains:
Autonomous Driving: Identifying road boundaries, vehicles, pedestrians, and obstacles for navigation and safety.
Medical Imaging: Detecting and outlining tumors, organs, or other structures in MRI, CT, or ultrasound scans to assist in diagnosis and treatment planning.
Satellite Imagery Analysis: Mapping land use, monitoring deforestation, urban planning, and disaster response.
Augmented Reality: Enabling realistic object placement and interaction by understanding the 3D structure of scenes.
Industrial Inspection: Detecting defects in manufacturing processes, quality control, and product sorting.
Video Editing and Production: Facilitating background replacement, special effects, and object removal in video content.
Robotics: Helping robots understand their environment for navigation, manipulation, and interaction.
Agriculture: Monitoring crop health, detecting diseases, and optimizing resource usage in precision farming.
The versatility and precision of image segmentation make it a fundamental technique in computer vision with far-reaching implications for how machines perceive and interact with the visual world.
Top 5 Image Segmentation Models in 2025
After comprehensive research and evaluation of the latest state-of-the-art open source AI models used for image segmentation in 2025, the following five models have been identified as the leaders in the field:
1. SAM 2 (Segment Anything Model 2)
Architecture
SAM 2 is Meta's latest foundation model for image and video segmentation, building upon the success of the original SAM. It features a unified architecture that can handle both image and video segmentation tasks through a transformer-based framework with streaming memory.
The architecture consists of:
Image Encoder: Processes input images to extract high-level features
Video Encoder: Extends image capabilities to video with temporal modeling
Prompt Encoder: Transforms various types of prompts (points, boxes, masks, text) into embeddings
Mask Decoder: Generates segmentation masks based on the encoded features and prompts
Streaming Memory: Enables efficient processing of video sequences
Building upon SAM 2, Grounded SAM 2 integrates additional models to enhance its capabilities:
Grounding DINO: Provides open-set object detection, allowing the model to identify and localize objects based on textual prompts. Florence-2: A multimodal model that facilitates open-vocabulary object detection and grounding, enabling the system to understand and process complex visual tasks.
This integration allows Grounded SAM 2 to perform tasks such as grounding and tracking any object in videos using textual prompts, enhancing its applicability in various domains.
Model Size and Variants
SAM 2 comes in four distinct variants to accommodate different computational requirements:
SAM 2 Tiny: 38.9 million parameters, optimized for speed (47.2 FPS on A100 GPU)
SAM 2 Small: Balanced performance and speed
SAM 2 Base Plus: Enhanced capabilities for complex tasks
SAM 2 Large: Maximum accuracy for demanding applications
Performance Without Fine-tuning (Zero-Shot)
SAM 2 demonstrates exceptional zero-shot capabilities:
Excellent generalization on open-domain images
Strong performance on common objects and scenes
Can segment almost anything without prior training on specific classes
Handles both image and video segmentation tasks
Struggles with domain-specific tasks (industrial inspection, medical imaging)
Issues with edge alignment and fragmented masks in specialized domains
Performance With Fine-tuning
When fine-tuned on specific domains, SAM 2 shows significant improvements:
Better edge alignment and contour definition
Reduced fragmentation in masks
Improved handling of domain-specific artifacts and lighting conditions
Enhanced ability to respond to non-standard prompts
Critical performance improvements for industrial QA, pathology, and satellite imaging
Fine-tuning on VIPOSeg training set improves performance to G=79.7 on VIPOSeg validation
Training Dataset
SA-V dataset: ~600K+ masklets on ~51K videos
Geographically diverse data from 47 countries
Annotations include whole objects, parts, and challenging occlusions
2. OMG-Seg (One Model for Many Segmentation Tasks)
Architecture
OMG-Seg is a unified segmentation framework capable of handling 10 different segmentation tasks in a single model. It follows a transformer-based encoder-decoder architecture with specific modifications:
VLM Encoder as Backbone: Uses a frozen CLIP model as a feature extractor
Pixel Decoder: Consists of multi-layer deformable attention layers that transform frozen features into fused features
Combined Object Queries: Generates mask outputs for different tasks
Shared Multi-task Decoder: Produces segmentation masks for all supported tasks
Model Size and Variants
ConvNeXt-Large (frozen) backbone: Primary variant
ConvNeXt-XL Large (frozen) backbone: Enhanced variant for higher accuracy
Performance Without Fine-tuning (Zero-Shot)
OMG-Seg demonstrates strong zero-shot capabilities due to its CLIP backbone:
Can generalize to unseen classes without specific training
Performs well on open-vocabulary tasks without additional training
Comparable performance to specialized models in zero-shot settings
Effective across both image and video domains
Performance With Fine-tuning
Performance improves significantly with task-specific fine-tuning:
Co-training on multiple datasets enhances cross-task performance
Fine-tuning on specific domains yields 5-15% improvement in accuracy
Training conducted using 32 A100 GPUs in a distributed environment
Performance Across Tasks
Semantic Segmentation (COCO-PS): 33.5 mAP
Semantic Segmentation (Cityscapes-PS): 65.7 mAP
Instance Segmentation (COCO-IS): 44.5 mAP
Panoptic Segmentation (VIPSeg-VPS): 49.1 mAP
Video Semantic Segmentation (YT-VIS-19): 60.3 mAP
Video Instance Segmentation (YT-VIS-21-OV): 55.2 mAP
DeepLabV3+ is an advanced semantic segmentation model with an encoder-decoder structure. Key architectural components include:
Encoder: Typically uses Xception network as backbone
Atrous (Dilated) Convolution: Enables multi-scale feature extraction without increasing parameters
Atrous Spatial Pyramid Pooling (ASPP): Captures multi-scale context by applying parallel atrous convolutions with different rates
Decoder Module: Refines segmentation boundaries through upsampling and skip connections
Model Size and Variants
Standard DeepLabV3+: ~40-60M parameters depending on backbone
MST-DeepLabV3+: Uses MobileNetV2 as backbone to reduce parameters while incorporating SENet attention mechanism
LM-DeepLabV3+: Lightweight version aimed at reducing parameters and computations
Performance Without Fine-tuning (Zero-Shot)
Traditional DeepLabV3+ is not designed for zero-shot learning:
Limited generalization to unseen classes without fine-tuning
Requires domain-specific training for optimal performance
Recent adaptations incorporate foundation model features to improve zero-shot capabilities
Performance With Fine-tuning
DeepLabV3+ shows excellent performance when fine-tuned:
MST-DeepLabV3+ on ISPRS dataset: 82.47% Mean IoU, 92.13% Overall Accuracy
Strong performance on high-resolution images
Effective edge detection and boundary preservation
Adaptable to various domains through transfer learning
Fine-tuning on domain-specific data shows 10-20% improvement over zero-shot approaches
4. HRNet (Modified 2025 Version)
Architecture
High-Resolution Network (HRNet) maintains high-resolution representations throughout the network, which is crucial for precise segmentation. The 2025 modified version includes:
Parallel Multi-Resolution Subnetworks: Processes information at multiple scales simultaneously
Repeated Multi-Scale Fusions: Exchanges information across parallel subnetworks
Feature Pyramids: Extracts multi-scale features for comprehensive scene understanding
Optimized Feature Blocks: Enhanced feature extraction in the 2025 version
Advanced Feature Extraction Techniques: Improved computational efficiency while maintaining accuracy
Model Size and Variants
HRNet-W18: Smaller variant with ~10M parameters
HRNet-W32: Medium variant with ~28M parameters
HRNet-W48: Larger variant with ~65M parameters
Modified HRNet (2025): Enhanced architecture with optimized blocks
Performance Without Fine-tuning (Zero-Shot)
Similar to DeepLabV3+, traditional HRNet is not designed for zero-shot segmentation:
Requires task-specific training for optimal performance
Limited generalization to unseen domains without adaptation
Recent modifications incorporate foundation model features to improve zero-shot capabilities
Performance With Fine-tuning
The 2025 modified HRNet shows significant improvements when fine-tuned:
Cityscapes dataset: 85.8% validation accuracy, 63.43% Mean IoU
Improvement over original HRNet: 3.39% (accuracy) and 3.43% (mIoU)
Produces more defined segmentation contours
Accurate object identifications across diverse scales
Robust handling of diverse object scales and complexities
Precise delineation of intricate landscapes
5. Mask-RCNN
Architecture
Mask R-CNN is a two-stage instance segmentation model that extends Faster R-CNN with a mask prediction branch:
Backbone Network: Typically ResNet-50 or ResNet-101 for feature extraction
Region Proposal Network (RPN): Generates region proposals for potential objects
RoI Align: Precisely aligns extracted features with input regions
Parallel Branches: Separate branches for classification, bounding box regression, and mask prediction
Model Size and Variants
Mask R-CNN with ResNet-50 backbone: ~44M parameters
Mask R-CNN with ResNet-101 backbone: ~63M parameters
Mask R-CNN with FPN (Feature Pyramid Network): Additional ~2M parameters
Mask R-CNN with ResNeXt-101 backbone: ~85M parameters
Performance Without Fine-tuning (Zero-Shot)
Traditional Mask R-CNN is not designed for zero-shot learning:
Limited generalization to unseen classes without fine-tuning
Resource-Constrained Environments: SAM 2 Tiny or lightweight DeepLabV3+ variants
Multi-Task Requirements: OMG-Seg
Interactive Segmentation: SAM 2
Video Segmentation: SAM 2 or OMG-Seg
Future Trends
The field of image segmentation continues to evolve rapidly, with several emerging trends that will likely shape its future:
Unified Multi-Task Models: Following OMG-Seg's approach, more models will aim to handle multiple segmentation tasks within a single architecture, reducing the need for task-specific models.
Foundation Model Integration: Traditional segmentation architectures will increasingly incorporate features from foundation models like CLIP to improve zero-shot capabilities and generalization.
Efficient Zero-Shot Learning: Research will focus on improving zero-shot segmentation performance while reducing computational requirements, making these capabilities more accessible.
Video-First Approaches: As demonstrated by SAM 2, future models will be designed with video segmentation as a primary capability rather than an extension of image segmentation.
Edge Deployment Optimization: Continued development of lightweight variants and quantization techniques to enable high-quality segmentation on edge devices.
Domain-Specific Fine-Tuning Techniques: More efficient methods for adapting general-purpose models to specialized domains with minimal data and computational resources.
Multimodal Integration: Increasing integration of text, audio, and other modalities to enhance segmentation capabilities and enable more intuitive interfaces.
Comparison Table of Top Image Segmentation Models
Model
Architecture (Brief)
Sizes Available (Model size considered for accuracy)
• Cityscapes: 85.8% validation accuracy, 63.43% Mean IoU
• 3.39% accuracy and 3.43% mIoU improvement over original HRNet
• More defined segmentation contours
• Accurate object identification across scales
Mask-RCNN
Two-stage instance segmentation model extending Faster R-CNN with a mask prediction branch, including backbone network, region proposal network, RoI Align, and parallel branches
• Fine-tuning on 10 examples per class yields significant improvements
• 15-25% improvement over training from scratch
Conclusion
Image segmentation has evolved significantly in 2025, with models like SAM 2 and OMG-Seg pushing the boundaries of what's possible in visual understanding. The trend toward unified architectures capable of handling multiple tasks represents a significant shift from the specialized models of previous years. While traditional architectures like DeepLabV3+, HRNet, and Mask-RCNN continue to be relevant, especially in specific domains, the integration of foundation model capabilities is transforming the field.
The choice between zero-shot capabilities and fine-tuned performance presents an important trade-off, with different models excelling in different scenarios. For applications requiring immediate deployment without task-specific training, SAM 2 and OMG-Seg offer compelling options. For scenarios where maximum accuracy is critical and domain-specific data is available, fine-tuned models like DeepLabV3+ and HRNet remain strong choices.
As the field continues to advance, we can expect further improvements in model efficiency, generalization capabilities, and ease of adaptation to specific domains, making powerful image segmentation increasingly accessible across a wide range of applications.
References
Li, X., Yuan, H., Li, W., Ding, H., Wu, S., Zhang, W., Li, Y., Chen, K., & Loy, C. C. (2024). OMG-Seg: Is One Model Good Enough For All Segmentation? Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). https://arxiv.org/abs/2401.10229
GitHub. (n.d.). matterport/Mask_RCNN: Mask R-CNN for object detection and instance segmentation on Keras and TensorFlow. https://github.com/matterport/Mask_RCNN