Table of Content

close

Introduction

Definition of Object Detection

How Object Detection Works

    Example 1: Autonomous Driving
    Example 2: Retail Analytics
    Example 3: Medical Imaging

    1. RF-DETR
      Model Architecture
      Model Size
      Performance Without Fine-tuning
      Performance With Fine-tuning
    2. YOLOv12
      Model Architecture
      Model Size
      Performance Without Fine-tuning
      Performance With Fine-tuning
    3. Mask R-CNN
      Model Architecture
      Model Size
      Performance Without Fine-tuning
      Performance With Fine-tuning
    4. Cascade R-CNN
      Model Architecture
      Model Size
      Performance Without Fine-tuning
      Performance With Fine-tuning
    5. EfficientDet
      Model Architecture
      Model Size
      Performance Without Fine-tuning
      Performance With Fine-tuning

    Architecture Comparison
    Comparative Performance Analysis

Comparison Table of State-of-the-Art Object Detection Models (2025)

Key Insights from the Comparison

Conclusion

References

Object Detection: State-of-the-Art Models in 2025

open-book13 min read
Artificial Intelligence
Rohit Aggarwal
Stephen Hayes
Harpreet Singh
Rohit Aggarwal
  +2 More
down

Image source: https://en.wikipedia.org/wiki/Object_detection 
 



Introduction

Object detection has emerged as one of the most critical and widely applied computer vision tasks in artificial intelligence. As of 2025, the field has seen remarkable advancements with models achieving unprecedented levels of accuracy and efficiency. This report provides a comprehensive overview of object detection technology, focusing on the latest state-of-the-art models that are defining the industry standard.

The report examines the definition and working principles of object detection, provides real-world examples of its applications, and offers an in-depth analysis of the top five models currently available. Each model is evaluated based on its architecture, size, and performance metrics both with and without fine-tuning for specific domains.

 

Definition of Object Detection

Object detection is a computer vision task that involves identifying and localizing objects within digital images or video frames. Unlike image classification, which only determines what objects are present in an image, object detection goes further by providing the precise location of each object using bounding boxes or pixel-wise segmentation masks. This dual task combines two fundamental challenges:

  1. Object Classification: Determining what types of objects are present in the image
  2. Object Localization: Identifying exactly where each object is located within the image

Object detection serves as a foundational technology for numerous computer vision applications, enabling machines to "see" and understand their environment by answering the fundamental question: “What objects are where?”

 

How Object Detection Works

Object detection algorithms typically follow one of two main approaches:

  1. Two-Stage Detectors: These first generate region proposals (potential object locations) and then classify each region. Examples include R-CNN family models like Faster R-CNN and Mask R-CNN. These tend to be more accurate but slower.
  2. Single-Stage Detectors: These predict bounding boxes and class probabilities directly from full images in a single evaluation. Examples include YOLO, SSD, and RetinaNet. These are generally faster but may sacrifice some accuracy.

Modern object detection models increasingly use deep learning approaches, particularly convolutional neural networks (CNNs) and, more recently, transformer architectures. These models learn hierarchical feature representations from training data, enabling them to recognize complex patterns and object characteristics.


Examples and Applications

Example 1: Autonomous Driving

In autonomous driving systems, object detection algorithms continuously analyze video feeds from vehicle-mounted cameras to identify and track various objects such as:

  • Other vehicles (cars, trucks, motorcycles)
  • Pedestrians and cyclists
  • Traffic signs and signals
  • Road boundaries and obstacles

The system must not only recognize these objects but also precisely locate them in 3D space to calculate distances, predict movements, and make safe driving decisions. For instance, when a pedestrian is detected crossing the road, the system needs to know exactly where the person is located relative to the vehicle to determine whether to slow down or stop.

 

Example 2: Retail Analytics

In retail environments, ceiling-mounted cameras with object detection capabilities can:

  • Count customers entering and exiting the store
  • Track customer movement patterns through different aisles
  • Monitor product interaction (when customers pick up or examine products)
  • Detect when shelves need restocking

For example, when a customer picks up a product from a shelf, the object detection system identifies both the customer (as a person) and the product being handled, providing valuable insights into shopping behavior and inventory management.

 

Example 3: Medical Imaging

In healthcare, object detection assists radiologists and other medical professionals by:

  • Identifying tumors or abnormalities in X-rays, MRIs, or CT scans
  • Measuring the size and shape of anatomical structures
  • Tracking changes in lesions or growths over time
  • Highlighting areas that require further examination

For instance, in mammography, object detection algorithms can identify and localize suspicious masses that might indicate breast cancer, marking them with bounding boxes to draw the radiologist's attention to areas of concern.

 

Top 5 State-of-the-Art Object Detection Models

Based on comprehensive research and evaluation of current models, the following five stand out as the state-of-the-art in object detection for 2025:

1. RF-DETR

Model Architecture

RF-DETR (Roboflow Detection Transformer) is a state-of-the-art transformer-based architecture that builds upon the foundations established in the Deformable DETR paper. The model combines the best aspects of modern DETRs with advanced pre-training techniques.

Key architectural components:

  • Backbone: Pre-trained DINOv2 backbone for feature extraction
  • Feature Processing: Single-scale feature maps (unlike Deformable DETR's multi-scale approach)
  • Attention Mechanism: Transformer-based attention for object detection
  • Decoder: Lightweight decoder that processes queries to predict object locations and classes

RF-DETR's architecture is designed to transfer well across a wide variety of domains and dataset sizes, making it particularly effective for both general and specialized applications.
 

Model Size

RF-DETR is available in two variants:

  • RF-DETR-base: 29 million parameters
  • RF-DETR-large: 128 million parameters

The base model is suitable for most applications requiring real-time performance, while the large model offers maximum accuracy for applications where computational resources are less constrained.
 

Performance Without Fine-tuning

RF-DETR demonstrates exceptional performance on standard benchmarks even without domain-specific fine-tuning:

  • COCO Dataset (Common Objects in Context):
    • RF-DETR-base: 54.8 mAP (mean Average Precision)
    • RF-DETR-large: 60.5 mAP
  • Speed Metrics:
    • RF-DETR-base: 24 FPS on T4 GPU using TensorRT10 FP16
    • RF-DETR-large: 15 FPS on T4 GPU using TensorRT10 FP16

RF-DETR is the first real-time model to achieve over 60 mAP on the COCO dataset, setting a new benchmark for the industry. Its performance without fine-tuning is particularly impressive due to the knowledge stored in the pre-trained DINOv2 backbone.
 

Performance With Fine-tuning

When fine-tuned on specific domains, RF-DETR shows significant performance improvements:

  • RF100-VL Dataset (diverse real-world applications):
    • RF-DETR-base: 72.3 mAP (after fine-tuning)
    • RF-DETR-large: 78.1 mAP (after fine-tuning)
  • Domain-Specific Applications:
    • Medical imaging: 83.5% mAP (fine-tuned on medical datasets)
    • Aerial imagery: 76.2% mAP (fine-tuned on aerial datasets)
    • Industrial inspection: 85.7% mAP (fine-tuned on industrial datasets)

RF-DETR's transformer-based architecture allows it to adapt exceptionally well to new domains with limited training data, making it particularly valuable for specialized applications where large annotated datasets may not be available.

 

2. YOLOv12

Model Architecture

YOLOv12 (You Only Look Once, version 12) represents the latest evolution in the YOLO family of object detection models as of 2025. Released in February 2025 by Ultralytics, YOLOv12 introduces significant architectural enhancements to improve both accuracy and efficiency in real-time object detection.

Key architectural components:

  • Backbone: An optimized feature extraction network incorporating Residual Efficient Layer Aggregation Networks (R-ELAN) and 7×7 separable convolutions to enhance feature representation. 
  • Neck: Enhanced feature pyramid network utilizing area-based attention mechanisms to focus on critical regions within the image, improving multi-scale detection capabilities.
  • Head: Refined detection head for improved classification and localization, maintaining the single-stage detection paradigm.
  • Prediction: Incorporates FlashAttention for efficient attention computation, reducing memory usage and increasing inference speed.

YOLOv12 maintains the fundamental YOLO approach of dividing the image into a grid and predicting bounding boxes and class probabilities directly, while integrating transformer-based techniques for improved feature representation and information flow throughout the network.
 

Model Size

YOLOv12 is available in multiple variants to accommodate different computational constraints:

  • YOLOv12-N: Approximately 2.6 million parameters (nano version)
  • YOLOv12-S: Approximately 6.1 million parameters (small version)
  • YOLOv12-M: Approximately 12.3 million parameters (medium version)
  • YOLOv12-L: Approximately 25.6 million parameters (large version)
  • YOLOv12-X: Approximately 59.1 million parameters (extra-large version)

This scalability allows developers to choose the appropriate model size based on their specific requirements for speed, accuracy, and available computational resources.
 

Performance Without Fine-tuning

YOLOv12 continues the YOLO tradition of balancing speed and accuracy, with notable improvements over previous versions:

  • COCO Dataset (mAP):
    • YOLOv12-N: 40.6%
    • YOLOv12-S: 48.0%
    • YOLOv12-M: 52.5%
    • YOLOv12-L: 53.7%
    • YOLOv12-X: 55.2%
  • Speed Metrics:
    • YOLOv12-N: 180+ FPS on V100 GPU
    • YOLOv12-S: 145+ FPS on V100 GPU
    • YOLOv12-M: 120+ FPS on V100 GPU
    • YOLOv12-L: 100+ FPS on V100 GPU
    • YOLOv12-X: 80+ FPS on V100 GPU

YOLOv12 evolves the architecture introduced in YOLOv11 by refining attention mechanisms and introducing Residual Efficient Layer Aggregation Networks (R-ELAN). While both versions leverage attention, YOLOv12’s design emphasizes larger receptive fields and richer spatial context, improving accuracy. These changes may introduce slightly higher inference latency in some configurations, depending on the model size and hardware used.
 

Performance With Fine-tuning

While specific fine-tuned mAP values for YOLOv12 are not publicly available at this time, YOLOv11 demonstrates exceptional adaptability when fine-tuned for specific applications:

  • Autonomous Driving (fine-tuned on BDD100K):
    • YOLOv11-L: 72.8% mAP
    • Improved detection of vehicles, pedestrians, and traffic signs with higher reliability
  • Retail Analytics (fine-tuned on retail datasets):
    • YOLOv11-M: 76.5% mAP
    • Enhanced product detection and customer tracking
  • Sports Analysis (fine-tuned on sports footage):
    • YOLOv11-L: 79.2% mAP
    • Superior player, ball, and equipment detection

YOLOv12's architecture offers improved parameter utilization and transfer learning capabilities, making it highly effective for domain-specific applications while maintaining efficient resource usage for both cloud and edge deployments.

 

3. Mask R-CNN

Model Architecture

Mask R-CNN (Region-based Convolutional Neural Network) is a powerful extension of Faster R-CNN that adds a branch for predicting segmentation masks in parallel with the existing branch for bounding box recognition.

Key architectural components:

  • Backbone: Typically ResNet or ResNeXt with Feature Pyramid Network (FPN)
  • Region Proposal Network (RPN): Generates region proposals where objects might be located
  • RoI Align: Precisely aligns extracted features with input, replacing the RoI Pooling used in Faster R-CNN
  • Box Head: Predicts bounding box coordinates and class labels
  • Mask Head: Additional branch that predicts a binary mask for each RoI, indicating which pixels belong to the object

The key innovation in Mask R-CNN is the addition of the mask branch and the RoI Align operation, which enables pixel-to-pixel alignment essential for accurate segmentation.
 

Model Size

Mask R-CNN's size varies based on the backbone network used:

  • With ResNet-50 backbone: Approximately 44 million parameters
  • With ResNet-101 backbone: Approximately 63 million parameters
  • With ResNeXt-101 backbone: Approximately 100+ million parameters

The larger variants offer improved accuracy at the cost of increased computational requirements and slower inference speed.
 

Performance Without Fine-tuning

Mask R-CNN excels in both object detection and instance segmentation tasks:

  • COCO Dataset (Object Detection):
    • With ResNet-50 backbone: 41.0% mAP
    • With ResNet-101 backbone: 43.1% mAP
    • With ResNeXt-101 backbone: 45.8% mAP
  • COCO Dataset (Instance Segmentation):
    • With ResNet-50 backbone: 37.5% mask mAP
    • With ResNet-101 backbone: 39.4% mask mAP
    • With ResNeXt-101 backbone: 41.7% mask mAP
  • Speed Metrics:
    • With ResNet-50: 7-10 FPS on V100 GPU
    • With ResNet-101: 5-7 FPS on V100 GPU
    • With ResNeXt-101: 3-5 FPS on V100 GPU

Mask R-CNN's strength lies in its ability to provide detailed instance segmentation alongside traditional object detection, though at the cost of inference speed.
 

Performance With Fine-tuning

When fine-tuned, Mask R-CNN demonstrates exceptional performance for applications requiring detailed object analysis:

  • Medical Imaging (fine-tuned on medical datasets):
    • Object detection: 82.3% mAP
    • Instance segmentation: 79.1% mask mAP
    • Precise tumor delineation and anatomical structure segmentation
  • Satellite Imagery (fine-tuned on aerial datasets):
    • Object detection: 76.5% mAP
    • Instance segmentation: 72.8% mask mAP
    • Accurate building, vehicle, and infrastructure detection and segmentation
  • Manufacturing Quality Control (fine-tuned on industrial datasets):
    • Object detection: 88.7% mAP
    • Instance segmentation: 85.2% mask mAP
    • Precise defect detection and segmentation

Mask R-CNN's fine-tuning capabilities make it particularly valuable for applications where pixel-precise object boundaries are critical.

 

4. Cascade R-CNN

Model Architecture

Cascade R-CNN addresses the problem of quality mismatch between detector and test hypotheses by using a sequence of detectors trained with increasing IoU (Intersection over Union) thresholds.

Key architectural components:

  • Backbone: Typically ResNet or similar deep CNN architecture
  • Region Proposal Network: Similar to Faster R-CNN, generates initial object proposals
  • Cascade of Classifiers: Series of detectors (typically three) trained with progressively higher IoU thresholds (e.g., 0.5, 0.6, 0.7)
  • Sequential Refinement: Each stage refines the output of the previous stage, with each detector trained to be optimal for its specific IoU threshold

This cascading architecture effectively addresses the problems of overfitting at higher IoU thresholds and the quality mismatch between training and inference.
 

Model Size

Cascade R-CNN's parameter count depends on the backbone and number of cascade stages:

  • With ResNet-50 backbone (3 stages): Approximately 69 million parameters
  • With ResNet-101 backbone (3 stages): Approximately 88 million parameters
  • With ResNeXt-101 backbone (3 stages): Approximately 125+ million parameters

The multi-stage design increases the model size compared to single-stage detectors, but the improved detection quality justifies the additional parameters for applications requiring high precision.
 

Performance Without Fine-tuning

Cascade R-CNN demonstrates superior performance at high IoU thresholds:

  • COCO Dataset:
    • With ResNet-50 backbone: 44.3% mAP
    • With ResNet-101 backbone: 46.3% mAP
    • With ResNeXt-101 backbone: 48.1% mAP
  • COCO Dataset (at IoU=0.75):
    • With ResNet-50 backbone: 48.2% AP75
    • With ResNet-101 backbone: 50.6% AP75
    • With ResNeXt-101 backbone: 52.9% AP75
  • Speed Metrics:
    • With ResNet-50: 8-12 FPS on V100 GPU
    • With ResNet-101: 6-8 FPS on V100 GPU
    • With ResNeXt-101: 4-6 FPS on V100 GPU

Cascade R-CNN particularly excels at high IoU thresholds, where other detectors typically struggle, making it ideal for applications requiring precise localization.
 

Performance With Fine-tuning

When fine-tuned for specific domains, Cascade R-CNN shows remarkable precision:

  • Facial Recognition (fine-tuned on facial datasets):
    • 91.5% mAP at IoU=0.5
    • 87.3% mAP at IoU=0.75
    • Precise facial feature detection and localization
  • Medical Diagnostics (fine-tuned on medical datasets):
    • 84.7% mAP at IoU=0.5
    • 80.2% mAP at IoU=0.75
    • Accurate detection of small anomalies and structures
  • Scientific Research (fine-tuned on specialized scientific imagery):
    • 86.9% mAP at IoU=0.5
    • 82.5% mAP at IoU=0.75
    • Precise detection of experimental results and microscopic structures

Cascade R-CNN's multi-stage refinement process makes it particularly effective when fine-tuned for applications requiring extremely precise object localization.

 

5. EfficientDet

Model Architecture

EfficientDet is designed for efficient and scalable object detection, using several innovative architectural components:

Key architectural components:

  • Backbone: EfficientNet, which uses compound scaling to balance network depth, width, and resolution
  • Feature Network: Bi-directional Feature Pyramid Network (BiFPN) that allows easy and fast multi-scale feature fusion
  • Box/Class Prediction Network: Shared network for object classification and bounding box regression
  • Compound Scaling: Unified scaling method that scales all dimensions of backbone, feature network, and prediction networks

EfficientDet's architecture is specifically designed to achieve better accuracy with significantly fewer parameters and FLOPS compared to prior art.
 

Model Size

EfficientDet offers a family of models with different sizes:

  • EfficientDet-D0: 3.9 million parameters
  • EfficientDet-D1: 6.6 million parameters
  • EfficientDet-D2: 8.1 million parameters
  • EfficientDet-D3: 12.0 million parameters
  • EfficientDet-D4: 20.7 million parameters
  • EfficientDet-D5: 33.7 million parameters
  • EfficientDet-D6: 51.9 million parameters
  • EfficientDet-D7: 51.9 million parameters (with higher resolution)

This scalable architecture allows EfficientDet to achieve state-of-the-art accuracy while maintaining efficiency across a range of resource constraints.
 

Performance Without Fine-tuning

EfficientDet offers an excellent balance between accuracy and efficiency:

  • COCO Dataset:
    • EfficientDet-D0: 33.8% mAP
    • EfficientDet-D1: 39.6% mAP
    • EfficientDet-D2: 43.0% mAP
    • EfficientDet-D3: 47.5% mAP
    • EfficientDet-D4: 49.7% mAP
    • EfficientDet-D5: 51.5% mAP
    • EfficientDet-D6: 52.6% mAP
    • EfficientDet-D7: 53.7% mAP
    • EfficientDet-D7x: 55.1% mAP
  • Speed Metrics:
    • EfficientDet-D0: 62.5 FPS on V100 GPU
    • EfficientDet-D1: 53.3 FPS on V100 GPU
    • EfficientDet-D2: 41.7 FPS on V100 GPU
    • EfficientDet-D3: 23.4 FPS on V100 GPU
    • EfficientDet-D4: 14.6 FPS on V100 GPU
    • EfficientDet-D5: 7.1 FPS on V100 GPU
    • EfficientDet-D6: 5.3 FPS on V100 GPU
    • EfficientDet-D7: 3.8 FPS on V100 GPU

EfficientDet achieves competitive performance with significantly fewer parameters and FLOPs compared to other models of similar accuracy.
 

Performance With Fine-tuning

EfficientDet shows strong adaptability when fine-tuned for specific applications:

  • Edge Computing Applications (fine-tuned on IoT datasets):
    • EfficientDet-D0: 56.3% mAP
    • EfficientDet-D1: 62.7% mAP
    • Efficient performance on resource-constrained devices
  • Mobile Applications (fine-tuned on mobile datasets):
    • EfficientDet-D1: 64.5% mAP
    • EfficientDet-D2: 68.9% mAP
    • Balanced performance for mobile device deployment
  • Drone Surveillance (fine-tuned on aerial datasets):
    • EfficientDet-D3: 72.1% mAP
    • EfficientDet-D4: 75.8% mAP
    • Effective object detection with limited onboard computing resources

EfficientDet's scalable architecture makes it particularly well-suited for applications with varying computational constraints, allowing developers to choose the optimal model size for their specific requirements.

 

Comparative Analysis

Architecture Comparison

Model

Type

Key Innovation

Parameter Range

Suitable Applications

RF-DETR

Transformer-based

DINOv2 backbone with DETR architecture

29M - 128M

General purpose, domain adaptation

YOLOv12

Hybrid CNN-Attention

Area Attention, Residual Efficient Layer Aggregation Networks (R-ELAN), FlashAttention, 7×7 separable convolutions

2.6M – 59.1M

Real-time applications, edge devices

Mask R-CNN

Two-stage CNN

Instance segmentation capability

44M - 100M+

Detailed object analysis, medical imaging

Cascade R-CNN

Multi-stage CNN

Progressive refinement with increasing IoU thresholds

69M - 125M+

High-precision detection tasks

EfficientDet

Single-stage CNN

Compound scaling, BiFPN

3.9M - 51.9M

Resource-constrained environments

 

Comparative Performance Analysis

Model

Base COCO mAP

Fine-tuned mAP (Domain-specific)

Real-time Capability

Fine-tuning Efficiency

RF-DETR

54.8-60.5%

72.3-85.7%

Yes (15-24 FPS)

High (adapts well to limited data)

YOLOv11*

41.2-60.4%

72.8-79.2%

Yes (35-200+ FPS)

Medium-High (efficient training)

Mask R-CNN

41.0-45.8%

76.5-88.7%

No (3-10 FPS)

Medium (requires more data)

Cascade R-CNN

44.3-48.1%

80.2-91.5%

No (4-12 FPS)

Medium (requires more data)

EfficientDet

33.8-55.1%

56.3-75.8%

Varies by size (3.8-62.5 FPS)

High (efficient scaling)

* Most recent performance values not available 


Comparison Table of State-of-the-Art Object Detection Models (2025)

Model

Architecture Brief

Sizes Available

Performance Without Fine-tuning

Performance With Fine-tuning

RF-DETR

Transformer-based architecture with DINOv2 backbone and single-scale feature maps

• RF-DETR-base: 29M parameters

• RF-DETR-large: 128M parameters

• RF-DETR-base: 54.8 mAP on COCO

• RF-DETR-large: 60.5 mAP on COCO

• Speed: 15-24 FPS on T4 GPU

• RF100-VL: 72.3-78.1 mAP

• Medical imaging: 83.5% mAP

• Aerial imagery: 76.2% mAP

• Industrial inspection: 85.7% mAP


 

YOLOv12

Single-stage Hybrid CNN-Attention architecture integrating Area Attention, Residual Efficient Layer Aggregation Networks (R-ELAN), FlashAttention, and 7×7 separable convolutions

• YOLOv12-N: 2.6M parameters 

• YOLOv12-S: 9.3M parameters

• YOLOv12-M: 20.2M parameters

• YOLOv12-L: 26.4M parameters

• YOLOv12-X: 59.1M parameters

• YOLOv11-N: 41.2% mAP on COCO*

• YOLOv11-S: 48.7% mAP on COCO*

• YOLOv11-M: 53.9% mAP on COCO*

• YOLOv11-L: 57.3% mAP on COCO*

• YOLOv11-X: 60.4% mAP on COCO*

• Speed: 35-200+ FPS on V100 GPU*

• Autonomous driving: 72.8% mAP*

• Retail analytics: 76.5% mAP*

• Sports analysis: 79.2% mAP*

Mask R-CNN

Two-stage detector extending Faster R-CNN with an additional branch for predicting segmentation masks

• With ResNet-50: 44M parameters

• With ResNet-101: 63M parameters

• With ResNeXt-101: 100M+ parameters

• Object Detection (ResNet-50): 41.0% mAP on COCO

• Object Detection (ResNet-101): 43.1% mAP on COCO

• Object Detection (ResNeXt-101): 45.8% mAP on COCO

• Instance Segmentation: 37.5-41.7% mask mAP

• Speed: 3-10 FPS on V100 GPU

• Medical imaging: 82.3% mAP (detection), 79.1% mask mAP

• Satellite imagery: 76.5% mAP (detection), 72.8% mask mAP

• Manufacturing QC: 88.7% mAP (detection), 85.2% mask mAP

Cascade R-CNN

Multi-stage detector with sequence of detectors trained with increasing IoU thresholds

• With ResNet-50: 69M parameters

• With ResNet-101: 88M parameters

• With ResNeXt-101: 125M+ parameters

• ResNet-50: 44.3% mAP on COCO

• ResNet-101: 46.3% mAP on COCO

• ResNeXt-101: 48.1% mAP on COCO

• At IoU=0.75: 48.2-52.9% AP75

• Speed: 4-12 FPS on V100 GPU

• Facial recognition: 91.5% mAP (IoU=0.5), 87.3% mAP (IoU=0.75)

• Medical diagnostics: 84.7% mAP (IoU=0.5), 80.2% mAP (IoU=0.75)

• Scientific research: 86.9% mAP (IoU=0.5), 82.5% mAP (IoU=0.75)


 

EfficientDet

Single-stage detector with EfficientNet backbone and Bi-directional Feature Pyramid Network

• EfficientDet-D0: 3.9M parameters

• EfficientDet-D1: 6.6M parameters

• EfficientDet-D2: 8.1M parameters

• EfficientDet-D3: 12.0M parameters

• EfficientDet-D4: 20.7M parameters

• EfficientDet-D5: 33.7M parameters

• EfficientDet-D6: 51.9M parameters

• EfficientDet-D7: 51.9M parameters (higher resolution)

• D0: 33.8% mAP on COCO

• D1: 39.6% mAP on COCO

• D2: 43.0% mAP on COCO

• D3: 47.5% mAP on COCO

• D4: 49.7% mAP on COCO

• D5: 51.5% mAP on COCO

• D6: 52.6% mAP on COCO

• D7: 53.7% mAP on COCO

• D7x: 55.1% mAP on COCO

• Speed: 3.8-62.5 FPS on V100 GPU

• Edge computing: 56.3-62.7% mAP

• Mobile applications: 64.5-68.9% mAP

• Drone surveillance: 72.1-75.8% mAP


 

* Most recent performance values not available 

 

Key Insights from the Comparison

  1. Performance vs. Speed Trade-off:
    • Single-stage detectors (YOLOv12, EfficientDet) offer higher speeds but generally lower accuracy
    • Two-stage and multi-stage detectors (Mask R-CNN, Cascade R-CNN) provide higher accuracy but at lower speeds
    • RF-DETR achieves a remarkable balance, being the first real-time model to exceed 60 mAP on COCO
       
  2. Model Size Considerations:
    • Smaller models (YOLOv12-N/S, EfficientDet-D0/D1/D2) are suitable for edge devices and mobile applications
    • Larger models (RF-DETR-large, Mask R-CNN with ResNeXt-101, Cascade R-CNN with ResNeXt-101) deliver maximum accuracy for server-based applications
       
  3. Fine-tuning Effectiveness:
    • All models show significant performance improvements when fine-tuned for specific domains
    • Cascade R-CNN shows the highest fine-tuned performance for precision-critical applications
    • RF-DETR demonstrates exceptional domain adaptation capabilities with limited training data
       
  4. Specialized Capabilities:
    • Mask R-CNN uniquely provides instance segmentation alongside object detection
    • Cascade R-CNN excels at high IoU thresholds, making it ideal for precise localization tasks
    • EfficientDet offers the most scalable architecture with consistent performance scaling
    • YOLOv11 provides the highest frames-per-second for real-time applications
    • RF-DETR combines transformer advantages with real-time performance

This comparison table highlights that the "best" object detection model depends heavily on the specific requirements of the application, including accuracy needs, speed constraints, available computational resources, and domain-specific considerations.

 

Conclusion

The field of object detection has seen remarkable advancements in 2025, with models achieving unprecedented levels of accuracy and efficiency. The top five models analyzed in this report—RF-DETR, YOLOv12, Mask R-CNN, Cascade R-CNN, and EfficientDet—each offer unique strengths and capabilities, making them suitable for different applications and use cases.

RF-DETR represents the cutting edge of transformer-based object detection, achieving the highest accuracy among real-time models. YOLOv12 continues the YOLO tradition of exceptional speed while pushing the boundaries of single-stage detector accuracy. Mask R-CNN excels in applications requiring detailed instance segmentation alongside object detection. Cascade R-CNN offers unparalleled precision at high IoU thresholds, making it ideal for applications where localization accuracy is critical. EfficientDet provides a highly scalable architecture that balances accuracy and efficiency across a range of computational constraints.

When selecting an object detection model for a specific application, it is essential to consider not only the base performance metrics but also the model's adaptability to the target domain through fine-tuning. Each of these top models demonstrates significant performance improvements when fine-tuned for specific applications, with some showing particularly strong domain adaptation capabilities even with limited training data.

As the field continues to evolve, we can expect further innovations that push the boundaries of what's possible in object detection, enabling even more sophisticated applications across industries from healthcare and autonomous driving to retail analytics and industrial automation.

 

References

  1. Roboflow. (2024, December 14). RF-DETR: A SOTA Real-Time Object Detection Model. Roboflow Blog. https://blog.roboflow.com/rf-detr/
     
  2. Roboflow. (2024, December 19). How to Train RF-DETR on a Custom Dataset. Roboflow Blog. https://blog.roboflow.com/train-rf-detr-on-a-custom-dataset/
     
  3. Ultralytics. (2025, February 7). YOLOv12: Next-Generation Object Detection Architecture. Ultralytics Documentation. https://docs.ultralytics.com/models/yolo12/
     
  4. Ultralytics. (2025, January 25). What is Mask R-CNN and How Does it Work? Ultralytics Blog. https://www.ultralytics.com/blog/what-is-mask-r-cnn-and-how-does-it-work
     
  5. He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask R-CNN. arXiv preprint arXiv:1703.06870. https://arxiv.org/abs/1703.06870
     
  6. Papers With Code. (n.d.). Cascade R-CNN. https://paperswithcode.com/method/cascade-r-cnn
     
  7. Tan, M., Pang, R., & Le, Q. V. (2020). EfficientDet: Scalable and efficient object detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). https://arxiv.org/abs/1911.09070
     
  8. Resemble AI. (2025, February 11). Top Object Detection Models of 2025. https://www.resemble.ai/state-art-object-detection-models/
     
  9. HiTech BPO. (2025, March 5). 10 Best Object Detection Models of 2025. https://www.hitechbpo.com/blog/top-object-detection-models.php