Table of Content

close

Introduction

    Types of Image Segmentation
    How Image Segmentation Works
    Example of Image Segmentation
    Applications of Image Segmentation

    1. SAM 2 (Segment Anything Model 2)
      Architecture
      Model Size and Variants
      Performance Without Fine-tuning (Zero-Shot)
      Performance With Fine-tuning
      Training Dataset
    2. OMG-Seg (One Model for Many Segmentation Tasks)
      Architecture
      Model Size and Variants
      Performance Without Fine-tuning (Zero-Shot)
      Performance With Fine-tuning
      Performance Across Tasks
    3. DeepLabV3+
      Architecture
      Model Size and Variants
      Performance Without Fine-tuning (Zero-Shot)
      Performance With Fine-tuning
       
    4. HRNet (Modified 2025 Version)
      Architecture
      Model Size and Variants
      Performance Without Fine-tuning (Zero-Shot)
      Performance With Fine-tuning
    5. Mask-RCNN
      Architecture
      Model Size and Variants
      Performance Without Fine-tuning (Zero-Shot)
      Performance With Fine-tuning

    Model Capabilities
    Performance Comparison
    Use Case Recommendations

Future Trends

Comparison Table of Top Image Segmentation Models

 

Conclusion

References

Image Segmentation: State-of-the-Art Models in 2025

open-book11 min read
Artificial Intelligence
Rohit Aggarwal
Stephen Hayes
Harpreet Singh
Rohit Aggarwal
  +2 More
down

 

Image source: Viso.ai, “OMG-SEG: Open-Vocabulary Semantic Segmentation,” Viso.ai – Computer Vision. https://viso.ai/computer-vision/omg-seg/
 



Introduction

Image segmentation is a fundamental computer vision task that has seen remarkable advancements in recent years. As of 2025, the field has evolved significantly with the emergence of foundation models, unified architectures, and specialized networks that push the boundaries of what's possible in visual understanding. This report provides a comprehensive overview of image segmentation, its applications, and the top five state-of-the-art models currently dominating the field.

 

Definition and Explanation

Image segmentation is a computer vision technique that divides a digital image into multiple segments or regions, each corresponding to a different object or part of the image. Unlike simple classification that identifies what is in an image, or object detection that locates objects with bounding boxes, image segmentation creates a pixel-level understanding of the image by assigning a class label to each pixel. This process transforms the representation of an image from a grid of pixels into a more meaningful and easier-to-analyze collection of segments.

The goal of image segmentation is to simplify and/or change the representation of an image into something more meaningful and easier to analyze. It is typically used to locate objects and boundaries (lines, curves, etc.) in images. More precisely, image segmentation is the process of assigning a label to every pixel in an image such that pixels with the same label share certain characteristics.

 

Types of Image Segmentation

There are several types of image segmentation tasks, each serving different purposes:

  1. Semantic Segmentation: Assigns a class label to each pixel in the image without differentiating between instances of the same class. For example, all pixels belonging to "person" would have the same label regardless of how many people are in the image.
  2. Instance Segmentation: Goes beyond semantic segmentation by distinguishing between different instances of the same class. For example, if there are multiple people in an image, each person would be segmented separately with a unique identifier.
  3. Panoptic Segmentation: Combines semantic and instance segmentation, providing a complete scene understanding. It segments both countable objects (like people, cars) as individual instances and uncountable background elements (like sky, road) as semantic regions.
  4. Video Segmentation: Extends image segmentation to video frames, maintaining temporal consistency across frames to track objects over time.
  5. Interactive Segmentation: Allows user input (like clicks or rough outlines) to guide the segmentation process, enabling more precise control over the results.
  6. Open-Vocabulary Segmentation: Can segment objects described by arbitrary text prompts, even if they weren't explicitly included in the training data.

 

How Image Segmentation Works

Modern image segmentation approaches primarily use deep learning techniques, particularly Convolutional Neural Networks (CNNs) and Transformer architectures. These models typically follow an encoder-decoder structure:

  1. Encoder: Extracts features from the input image at multiple scales, capturing both fine details and broader contextual information.
  2. Decoder: Uses the encoded features to generate a segmentation mask, often through upsampling operations that restore the spatial resolution of the image.
  3. Skip Connections: Many architectures use skip connections between encoder and decoder layers to preserve fine spatial details that might otherwise be lost during encoding.

The output is a segmentation mask—a matrix with the same dimensions as the input image where each element corresponds to a pixel's class assignment.

 

Example of Image Segmentation

Consider a street scene photograph containing cars, pedestrians, buildings, and a road. Image segmentation would process this image as follows:

  1. Input: The original RGB image (e.g., 1024×768 pixels).
  2. Processing: The segmentation model analyzes the image, identifying patterns and features that correspond to different objects.
  3. Output: A segmentation mask where each pixel is assigned a class label. For instance:
    • Red pixels might represent cars
    • Blue pixels might represent pedestrians
    • Green pixels might represent vegetation
    • Gray pixels might represent the road
    • Brown pixels might represent buildings

This segmentation mask provides a detailed understanding of the scene, showing precisely where each object is located down to the pixel level. In instance segmentation, each car and each pedestrian would have a unique identifier, allowing the system to count and track individual objects.

 

Applications of Image Segmentation

Image segmentation has numerous practical applications across various domains:

  1. Autonomous Driving: Identifying road boundaries, vehicles, pedestrians, and obstacles for navigation and safety.
  2. Medical Imaging: Detecting and outlining tumors, organs, or other structures in MRI, CT, or ultrasound scans to assist in diagnosis and treatment planning.
  3. Satellite Imagery Analysis: Mapping land use, monitoring deforestation, urban planning, and disaster response.
  4. Augmented Reality: Enabling realistic object placement and interaction by understanding the 3D structure of scenes.
  5. Industrial Inspection: Detecting defects in manufacturing processes, quality control, and product sorting.
  6. Video Editing and Production: Facilitating background replacement, special effects, and object removal in video content.
  7. Robotics: Helping robots understand their environment for navigation, manipulation, and interaction.
  8. Agriculture: Monitoring crop health, detecting diseases, and optimizing resource usage in precision farming.

The versatility and precision of image segmentation make it a fundamental technique in computer vision with far-reaching implications for how machines perceive and interact with the visual world.

 

Top 5 Image Segmentation Models in 2025

After comprehensive research and evaluation of the latest state-of-the-art open source AI models used for image segmentation in 2025, the following five models have been identified as the leaders in the field:

1. SAM 2 (Segment Anything Model 2)

Architecture

SAM 2 is Meta's latest foundation model for image and video segmentation, building upon the success of the original SAM. It features a unified architecture that can handle both image and video segmentation tasks through a transformer-based framework with streaming memory.

The architecture consists of:

  • Image Encoder: Processes input images to extract high-level features
  • Video Encoder: Extends image capabilities to video with temporal modeling
  • Prompt Encoder: Transforms various types of prompts (points, boxes, masks, text) into embeddings
  • Mask Decoder: Generates segmentation masks based on the encoded features and prompts
  • Streaming Memory: Enables efficient processing of video sequences

Building upon SAM 2, Grounded SAM 2 integrates additional models to enhance its capabilities:

  • Grounding DINO: Provides open-set object detection, allowing the model to identify and localize objects based on textual prompts.
    Florence-2: A multimodal model that facilitates open-vocabulary object detection and grounding, enabling the system to understand and process complex visual tasks.

This integration allows Grounded SAM 2 to perform tasks such as grounding and tracking any object in videos using textual prompts, enhancing its applicability in various domains.

 

Model Size and Variants

SAM 2 comes in four distinct variants to accommodate different computational requirements:

  • SAM 2 Tiny: 38.9 million parameters, optimized for speed (47.2 FPS on A100 GPU)
  • SAM 2 Small: Balanced performance and speed
  • SAM 2 Base Plus: Enhanced capabilities for complex tasks
  • SAM 2 Large: Maximum accuracy for demanding applications

 

Performance Without Fine-tuning (Zero-Shot)

SAM 2 demonstrates exceptional zero-shot capabilities:

  • Excellent generalization on open-domain images
  • Strong performance on common objects and scenes
  • Can segment almost anything without prior training on specific classes
  • Handles both image and video segmentation tasks
  • Struggles with domain-specific tasks (industrial inspection, medical imaging)
  • Issues with edge alignment and fragmented masks in specialized domains

 

Performance With Fine-tuning

When fine-tuned on specific domains, SAM 2 shows significant improvements:

  • Better edge alignment and contour definition
  • Reduced fragmentation in masks
  • Improved handling of domain-specific artifacts and lighting conditions
  • Enhanced ability to respond to non-standard prompts
  • Critical performance improvements for industrial QA, pathology, and satellite imaging
  • Fine-tuning on VIPOSeg training set improves performance to G=79.7 on VIPOSeg validation

 

Training Dataset

  • SA-V dataset: ~600K+ masklets on ~51K videos
  • Geographically diverse data from 47 countries
  • Annotations include whole objects, parts, and challenging occlusions

 

2. OMG-Seg (One Model for Many Segmentation Tasks)

Architecture

OMG-Seg is a unified segmentation framework capable of handling 10 different segmentation tasks in a single model. It follows a transformer-based encoder-decoder architecture with specific modifications:

  • VLM Encoder as Backbone: Uses a frozen CLIP model as a feature extractor
  • Pixel Decoder: Consists of multi-layer deformable attention layers that transform frozen features into fused features
  • Combined Object Queries: Generates mask outputs for different tasks
  • Shared Multi-task Decoder: Produces segmentation masks for all supported tasks

 

Model Size and Variants

  • ConvNeXt-Large (frozen) backbone: Primary variant
  • ConvNeXt-XL Large (frozen) backbone: Enhanced variant for higher accuracy

 

Performance Without Fine-tuning (Zero-Shot)

OMG-Seg demonstrates strong zero-shot capabilities due to its CLIP backbone:

  • Can generalize to unseen classes without specific training
  • Performs well on open-vocabulary tasks without additional training
  • Comparable performance to specialized models in zero-shot settings
  • Effective across both image and video domains

 

Performance With Fine-tuning

Performance improves significantly with task-specific fine-tuning:

  • Co-training on multiple datasets enhances cross-task performance
  • Fine-tuning on specific domains yields 5-15% improvement in accuracy
  • Training conducted using 32 A100 GPUs in a distributed environment

 

Performance Across Tasks

  • Semantic Segmentation (COCO-PS): 33.5 mAP
  • Semantic Segmentation (Cityscapes-PS): 65.7 mAP
  • Instance Segmentation (COCO-IS): 44.5 mAP
  • Panoptic Segmentation (VIPSeg-VPS): 49.1 mAP
  • Video Semantic Segmentation (YT-VIS-19): 60.3 mAP
  • Video Instance Segmentation (YT-VIS-21-OV): 55.2 mAP
  • Video Panoptic Segmentation (ADE-OV): 27.8 mAP
  • Open-Vocabulary Segmentation (DAVIS-17-VOC-OV): 74.3 mAP
  • Open-Vocabulary Interactive (COCO-SAM): 76.9 mAP

 

3. DeepLabV3+

Architecture

DeepLabV3+ is an advanced semantic segmentation model with an encoder-decoder structure. Key architectural components include:

  • Encoder: Typically uses Xception network as backbone
  • Atrous (Dilated) Convolution: Enables multi-scale feature extraction without increasing parameters
  • Atrous Spatial Pyramid Pooling (ASPP): Captures multi-scale context by applying parallel atrous convolutions with different rates
  • Decoder Module: Refines segmentation boundaries through upsampling and skip connections

 

Model Size and Variants

  • Standard DeepLabV3+: ~40-60M parameters depending on backbone
  • MST-DeepLabV3+: Uses MobileNetV2 as backbone to reduce parameters while incorporating SENet attention mechanism
  • LM-DeepLabV3+: Lightweight version aimed at reducing parameters and computations

 

Performance Without Fine-tuning (Zero-Shot)

Traditional DeepLabV3+ is not designed for zero-shot learning:

  • Limited generalization to unseen classes without fine-tuning
  • Requires domain-specific training for optimal performance
  • Recent adaptations incorporate foundation model features to improve zero-shot capabilities

 

Performance With Fine-tuning

DeepLabV3+ shows excellent performance when fine-tuned:

  • MST-DeepLabV3+ on ISPRS dataset: 82.47% Mean IoU, 92.13% Overall Accuracy
  • Strong performance on high-resolution images
  • Effective edge detection and boundary preservation
  • Adaptable to various domains through transfer learning
  • Fine-tuning on domain-specific data shows 10-20% improvement over zero-shot approaches

 

4. HRNet (Modified 2025 Version)

Architecture

High-Resolution Network (HRNet) maintains high-resolution representations throughout the network, which is crucial for precise segmentation. The 2025 modified version includes:

  • Parallel Multi-Resolution Subnetworks: Processes information at multiple scales simultaneously
  • Repeated Multi-Scale Fusions: Exchanges information across parallel subnetworks
  • Feature Pyramids: Extracts multi-scale features for comprehensive scene understanding
  • Optimized Feature Blocks: Enhanced feature extraction in the 2025 version
  • Advanced Feature Extraction Techniques: Improved computational efficiency while maintaining accuracy

 

Model Size and Variants

  • HRNet-W18: Smaller variant with ~10M parameters
  • HRNet-W32: Medium variant with ~28M parameters
  • HRNet-W48: Larger variant with ~65M parameters
  • Modified HRNet (2025): Enhanced architecture with optimized blocks

 

Performance Without Fine-tuning (Zero-Shot)

Similar to DeepLabV3+, traditional HRNet is not designed for zero-shot segmentation:

  • Requires task-specific training for optimal performance
  • Limited generalization to unseen domains without adaptation
  • Recent modifications incorporate foundation model features to improve zero-shot capabilities

 

Performance With Fine-tuning

The 2025 modified HRNet shows significant improvements when fine-tuned:

  • Cityscapes dataset: 85.8% validation accuracy, 63.43% Mean IoU
  • Improvement over original HRNet: 3.39% (accuracy) and 3.43% (mIoU)
  • Produces more defined segmentation contours
  • Accurate object identifications across diverse scales
  • Robust handling of diverse object scales and complexities
  • Precise delineation of intricate landscapes

 

5. Mask-RCNN

Architecture

Mask R-CNN is a two-stage instance segmentation model that extends Faster R-CNN with a mask prediction branch:

  • Backbone Network: Typically ResNet-50 or ResNet-101 for feature extraction
  • Region Proposal Network (RPN): Generates region proposals for potential objects
  • RoI Align: Precisely aligns extracted features with input regions
  • Parallel Branches: Separate branches for classification, bounding box regression, and mask prediction
     

Model Size and Variants

  • Mask R-CNN with ResNet-50 backbone: ~44M parameters
  • Mask R-CNN with ResNet-101 backbone: ~63M parameters
  • Mask R-CNN with FPN (Feature Pyramid Network): Additional ~2M parameters
  • Mask R-CNN with ResNeXt-101 backbone: ~85M parameters

 

Performance Without Fine-tuning (Zero-Shot)

Traditional Mask R-CNN is not designed for zero-shot learning:

  • Limited generalization to unseen classes without fine-tuning
  • Recent adaptations (2025) enable finetune-free incremental few-shot instance segmentation
  • Zero-shot performance significantly lower than fine-tuned performance
  • Novel weight generator (NWG) approaches improve zero-shot capabilities
  • Piecewise Function for Similarity Calculation (PFSC) enhances zero-shot performance

 

Performance With Fine-tuning

Mask R-CNN shows excellent performance when fine-tuned:

  • MS COCO dataset: ~38-40 mAP with ResNet-50 backbone
  • MS COCO dataset: ~40-42 mAP with ResNet-101 backbone
  • Fine-tuning on as few as 10 examples per class can yield significant improvements
  • Transfer learning from pre-trained weights shows 15-25% improvement over training from scratch
  • Incremental few-shot instance segmentation (iFSIS) methods allow fine-tuning on novel classes

 

Comparative Analysis

Model Capabilities

  1. SAM 2: Excels at zero-shot segmentation of both images and videos, with strong interactive capabilities.
  2. OMG-Seg: Unique in handling 10 different segmentation tasks in a single model with competitive performance.
  3. DeepLabV3+: Specialized for semantic segmentation with excellent boundary preservation.
  4. HRNet: Focuses on high-resolution feature maintenance for precise boundary delineation.
  5. Mask-RCNN: Strong instance segmentation performance with well-established architecture
     

Performance Comparison

  • Zero-Shot Capability: SAM 2 > OMG-Seg > DeepLabV3+ ≈ HRNet > Mask-RCNN
  • Fine-Tuned Performance: SAM 2 ≈ OMG-Seg > DeepLabV3+ > HRNet > Mask-RCNN
  • Computational Efficiency: Mask-RCNN > DeepLabV3+ > HRNet > OMG-Seg > SAM 2
  • Versatility: OMG-Seg > SAM 2 > DeepLabV3+ > HRNet > Mask-RCNN
  • Boundary Precision: HRNet > DeepLabV3+ > SAM 2 > OMG-Seg > Mask-RCNN

 

Use Case Recommendations

  • General-Purpose Segmentation: SAM 2 or OMG-Seg
  • Semantic Segmentation: DeepLabV3+ or HRNet
  • Instance Segmentation: Mask-RCNN or OMG-Seg
  • Resource-Constrained Environments: SAM 2 Tiny or lightweight DeepLabV3+ variants
  • Multi-Task Requirements: OMG-Seg
  • Interactive Segmentation: SAM 2
  • Video Segmentation: SAM 2 or OMG-Seg

 

The field of image segmentation continues to evolve rapidly, with several emerging trends that will likely shape its future:

  1. Unified Multi-Task Models: Following OMG-Seg's approach, more models will aim to handle multiple segmentation tasks within a single architecture, reducing the need for task-specific models.
  2. Foundation Model Integration: Traditional segmentation architectures will increasingly incorporate features from foundation models like CLIP to improve zero-shot capabilities and generalization.
  3. Efficient Zero-Shot Learning: Research will focus on improving zero-shot segmentation performance while reducing computational requirements, making these capabilities more accessible.
  4. Video-First Approaches: As demonstrated by SAM 2, future models will be designed with video segmentation as a primary capability rather than an extension of image segmentation.
  5. Edge Deployment Optimization: Continued development of lightweight variants and quantization techniques to enable high-quality segmentation on edge devices.
  6. Domain-Specific Fine-Tuning Techniques: More efficient methods for adapting general-purpose models to specialized domains with minimal data and computational resources.
  7. Multimodal Integration: Increasing integration of text, audio, and other modalities to enhance segmentation capabilities and enable more intuitive interfaces.

 

Comparison Table of Top Image Segmentation Models

Model

Architecture (Brief)

Sizes Available (Model size considered for accuracy)

Segmentation Type

Metric

Expected Accuracy with No Fine-Tuning

Expected Accuracy after Fine-Tuning

SAM 2

Transformer w/ prompt image/video encoders + decoder

Tiny, Small, Base Plus, Large (Base Plus)

Semantic / Panoptic

mIoU

64%

80%

OMG-Seg

CLIP + deformable decoder + multi-task head

ConvNeXt-L, XL (ConvNeXt-L)

Multi-task (sem., inst.)

mAP

60%

70%

DeepLabV3+

Xception + ASPP decoder

Std (Xception), MobileNetV2, Lite (Xception)

Semantic

mIoU

62%

80%

HRNet (2025)

Multi-res subnets + fusion blocks

W18, W32, W48 (W48)

Semantic

mIoU

58%

65%

Mask-RCNN

Two-stage (Faster R-CNN + mask head + FPN)

R50, R101, X101 (ResNet-101 + FPN)

Instance

mAP

28%

41%

 

Model Name

Architecture Brief

Sizes Available

Performance Without Fine-tuning

Performance After Fine-tuning

SAM 2 (Segment Anything Model 2)

Transformer-based framework with image encoder, video encoder, prompt encoder, mask decoder, and streaming memory

• SAM 2 Tiny: 38.9M parameters

• SAM 2 Small

• SAM 2 Base Plus

• SAM 2 Large


 

• Excellent generalization on open-domain images

• Strong performance on common objects and scenes

• Can segment almost anything without specific training

• Handles both image and video segmentation

• Struggles with domain-specific tasks

 Better edge alignment and contour definition

• Reduced fragmentation in masks

• Improved handling of domain-specific artifacts

• Enhanced ability to respond to non-standard prompts

• VIPOSeg validation: G=79.7

OMG-Seg (One Model for Many Segmentation Tasks)

Unified framework with frozen CLIP backbone, pixel decoder with deformable attention layers, combined object queries, and shared multi-task decoder

• ConvNeXt-Large (frozen) backbone

• ConvNeXt-XL Large (frozen) backbone

 Strong zero-shot capabilities due to CLIP backbone

• Generalizes to unseen classes

• Performs well on open-vocabulary tasks

• Comparable to specialized models in zero-shot settings

 5-15% improvement with domain-specific fine-tuning

• Enhanced cross-task performance with co-training

• COCO-PS: 33.5 mAP

• Cityscapes-PS: 65.7 mAP

• COCO-IS: 44.5 mAP

• VIPSeg-VPS: 49.1 mAP

DeepLabV3+

Encoder-decoder structure with Xception backbone, atrous convolutions, atrous spatial pyramid pooling (ASPP), and decoder module for boundary refinement

• Standard: ~40-60M parameters

• MST-DeepLabV3+: MobileNetV2 backbone

• LM-DeepLabV3+: Lightweight version

• Not designed for zero-shot learning

• Limited generalization to unseen classes

• Requires domain-specific training

• Recent adaptations improve zero-shot capabilities

• ISPRS dataset: 82.47% Mean IoU, 92.13% Overall Accuracy

• Strong performance on high-resolution images

• Effective edge detection and boundary preservation

• 10-20% improvement over zero-shot approaches

HRNet (Modified 2025 Version)

Maintains high-resolution representations throughout with parallel multi-resolution subnetworks, multi-scale fusions, feature pyramids, and optimized feature blocks

• HRNet-W18: ~10M parameters

• HRNet-W32: ~28M parameters

• HRNet-W48: ~65M parameters

• Modified HRNet (2025)

• Not designed for zero-shot segmentation

• Requires task-specific training

• Limited generalization to unseen domains

• Recent modifications improve zero-shot capabilities

• Cityscapes: 85.8% validation accuracy, 63.43% Mean IoU

• 3.39% accuracy and 3.43% mIoU improvement over original HRNet

• More defined segmentation contours

• Accurate object identification across scales

Mask-RCNN

Two-stage instance segmentation model extending Faster R-CNN with a mask prediction branch, including backbone network, region proposal network, RoI Align, and parallel branches

• ResNet-50 backbone: ~44M parameters

• ResNet-101 backbone: ~63M parameters

• With FPN: Additional ~2M parameters

• ResNeXt-101 backbone: ~85M parameters


 

• Not designed for zero-shot learning

• Limited generalization to unseen classes

• Recent adaptations enable finetune-free few-shot segmentation

• Novel weight generator (NWG) improves zero-shot capabilities


 

• MS COCO: ~38-40 mAP with ResNet-50

• MS COCO: ~40-42 mAP with ResNet-101

• Fine-tuning on 10 examples per class yields significant improvements

• 15-25% improvement over training from scratch


 

 

Conclusion

Image segmentation has evolved significantly in 2025, with models like SAM 2 and OMG-Seg pushing the boundaries of what's possible in visual understanding. The trend toward unified architectures capable of handling multiple tasks represents a significant shift from the specialized models of previous years. While traditional architectures like DeepLabV3+, HRNet, and Mask-RCNN continue to be relevant, especially in specific domains, the integration of foundation model capabilities is transforming the field.

The choice between zero-shot capabilities and fine-tuned performance presents an important trade-off, with different models excelling in different scenarios. For applications requiring immediate deployment without task-specific training, SAM 2 and OMG-Seg offer compelling options. For scenarios where maximum accuracy is critical and domain-specific data is available, fine-tuned models like DeepLabV3+ and HRNet remain strong choices.

As the field continues to advance, we can expect further improvements in model efficiency, generalization capabilities, and ease of adaptation to specific domains, making powerful image segmentation increasingly accessible across a wide range of applications.

 

References

  1. Li, X., Yuan, H., Li, W., Ding, H., Wu, S., Zhang, W., Li, Y., Chen, K., & Loy, C. C. (2024). OMG-Seg: Is One Model Good Enough For All Segmentation? Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). https://arxiv.org/abs/2401.10229
     
  2. Meta AI. (2024). SAM 2: Segment Anything in Images and Videos. https://ai.meta.com/research/publications/sam-2-segment-anything-in-images-and-videos/
     
  3. Meta AI. (2024). Introducing Meta Segment Anything Model 2 (SAM 2). https://ai.meta.com/sam2/
     
  4. Meta AI. (2024). Our New AI Model Can Segment Anything – Even Video. https://about.fb.com/news/2024/07/our-new-ai-model-can-segment-video/
     
  5. Ultralytics. (2024). SAM 2: Segment Anything Model 2. https://docs.ultralytics.com/models/sam-2/
     
  6. Viso.ai. (2025). OMG-Seg: 10 Segmentation Tasks in 1 Framework. https://viso.ai/computer-vision/omg-seg/
     
  7. Averroes AI. (2025). 7 Best Semantic Segmentation Models (2025). https://averroes.ai/blog/best-semantic-segmentation-models
     
  8. ScienceDirect. (2024). An improved semantic segmentation algorithm for high-resolution images. https://www.sciencedirect.com/science/article/abs/pii/S0952197623014446
     
  9. GitHub. (n.d.). HRNet/HRNet-Semantic-Segmentation. https://github.com/HRNet/HRNet-Semantic-Segmentation
     
  10. JISEM Journal. (2025). Semantic Object Segmentation using Modified HRNet Deep Learning Model. https://jisem-journal.com/index.php/journal/article/view/530
     
  11. GitHub. (n.d.). matterport/Mask_RCNN: Mask R-CNN for object detection and instance segmentation on Keras and TensorFlow. https://github.com/matterport/Mask_RCNN
     
  12. Medium. (2025). Mask R-CNN: An Overview. https://medium.com/@fahey_james/mask-r-cnn-an-overview-ca682955a1a1
     
  13. Ultralytics. (2025). Mask R-CNN Explained: Guide, Uses & YOLO. https://www.ultralytics.com/blog/what-is-mask-r-cnn-and-how-does-it-work