RF100 Dataset Catalog

Overview

The RoboFlow 100 (RF100) benchmark consists of 34 diverse object detection datasets organized into 6 collections. This vignette provides a comprehensive catalog to help you find the right dataset for your task.

The RF100 datasets cover a wide range of domains including:

  • Biology: Microscopy, cells, bacteria, parasites (9 datasets)
  • Medical: X-rays, MRI, pathology (8 datasets)
  • Infrared: Thermal imaging, FLIR cameras (4 datasets)
  • Damage: Defect detection, infrastructure inspection (3 datasets)
  • Underwater: Marine life, coral, infrastructure (4 datasets)
  • Document: OCR, document parsing, diagrams (6 datasets)

Example: Finding a Photovoltaic Dataset

One of the motivations for this catalog was answering questions like: “Is there a photovoltaic dataset in torchvision?”

# Search for solar/photovoltaic datasets
search_rf100("solar")
search_rf100("photovoltaic")

# Result shows:
# - solar_panel in infrared collection
# - solar_panel in damage collection

Complete Catalog

Here’s the complete catalog of all RF100 datasets:

library(torchvision)
library(knitr)

catalog <- get_rf100_catalog()

# Display key columns
kable(catalog[, c("collection", "dataset", "description", "total_size_mb", "estimated_images")])

Collections

Biology Collection (9 datasets)

Microscopy and biological imaging datasets for research and diagnostics:

search_rf100(collection = "biology")

Available datasets:

  • stomata_cell: Plant stomata cells for biology research
  • blood_cell: Blood cell detection (RBC, WBC, platelets)
  • parasite: Parasite detection in microscopy images
  • cell: General cell detection in microscopy
  • bacteria: Bacteria detection in microscopy images
  • cotton_disease: Cotton plant disease detection
  • mitosis: Mitosis phase detection in cell images
  • phage: Bacteriophage detection in microscopy
  • liver_disease: Liver disease pathology detection

Medical Collection (8 datasets)

Medical imaging datasets for clinical and research applications:

search_rf100(collection = "medical")

Available datasets:

  • radio_signal: Radio signal detection in medical imaging
  • rheumatology: Rheumatology X-ray abnormality detection
  • knee: ACL and knee X-ray analysis
  • abdomen_mri: Abdomen MRI organ detection
  • brain_axial_mri: Brain axial MRI structure detection
  • gynecology_mri: Gynecology MRI structure detection
  • brain_tumor: Brain tumor detection in MRI scans
  • fracture: Bone fracture detection in X-rays

Infrared Collection (4 datasets)

Thermal and infrared imaging datasets:

search_rf100(collection = "infrared")

Available datasets:

  • thermal_dog_and_people: Thermal imaging of dogs and people
  • solar_panel: Solar panel detection in infrared imagery
  • thermal_cheetah: Thermal imaging of cheetahs
  • ir_object: FLIR camera object detection

Damage Collection (3 datasets)

Infrastructure damage and defect detection:

search_rf100(collection = "damage")

Available datasets:

  • liquid_crystals: 4-fold defect detection in LCD displays
  • solar_panel: Solar panel defect and damage detection
  • asbestos: Asbestos detection for safety inspection

Underwater Collection (4 datasets)

Marine and underwater imaging datasets:

search_rf100(collection = "underwater")

Available datasets:

  • pipes: Underwater pipe detection for infrastructure
  • aquarium: Aquarium fish and species detection
  • objects: Underwater object detection
  • coral: Coral reef detection and monitoring

Document Collection (6 datasets)

Document analysis and OCR datasets:

search_rf100(collection = "document")

Available datasets:

  • tweeter_post: Twitter post element detection
  • tweeter_profile: Twitter profile element detection
  • document_part: Document structure and part detection
  • activity_diagram: Activity diagram element detection
  • signature: Signature detection in documents
  • paper_part: Academic paper structure detection

Usage Example

Once you’ve found a dataset, loading it is straightforward:

library(torchvision)

# Search for blood cell dataset
search_rf100("blood")

# Load the dataset
ds <- rf100_biology_collection(
  dataset = "blood_cell",
  split = "train",
  download = TRUE
)

# Inspect a sample
item <- ds[1]
print(item$y$labels)  # Object classes
print(item$y$boxes)   # Bounding boxes

# Visualize with bounding boxes
boxed <- draw_bounding_boxes(item)
tensor_image_browse(boxed)

Dataset Statistics

catalog <- get_rf100_catalog()

# Total size of all datasets
sum(catalog$total_size_mb) / 1024  # In GB

# Datasets by size
catalog[order(-catalog$total_size_mb), c("dataset", "collection", "total_size_mb")]

# Smallest and largest datasets
catalog[which.min(catalog$total_size_mb), ]
catalog[which.max(catalog$total_size_mb), ]

# Average size by collection
aggregate(total_size_mb ~ collection, data = catalog, FUN = mean)

Filtering and Exploration

The catalog is a regular data frame, so you can use standard R operations:

# Find small datasets (< 20 MB total)
subset(catalog, total_size_mb < 20)

# Find large datasets (> 200 MB total)
subset(catalog, total_size_mb > 200)

# Find datasets with specific keywords
subset(catalog, grepl("tumor|cancer|disease", description, ignore.case = TRUE))

# Datasets with all three splits
subset(catalog, has_train & has_test & has_valid)

Additional Resources

  • RoboFlow Universe: Browse datasets at https://universe.roboflow.com/browse/
  • Collection Functions: See ?rf100_biology_collection, ?rf100_medical_collection, etc.
  • Visualization: See ?draw_bounding_boxes for visualizing detections

Citation

If you use RF100 datasets in your research, please cite:

@article{roboflow100,
  title={Roboflow 100: A Rich, Multi-Domain Object Detection Benchmark},
  author={Roboflow},
  journal={arXiv preprint},
  year={2022}
}