- 8 min read

Top 10 Open-Source Datasets for Computer Vision

Explore top 10 open-source datasets for diverse applications in computer vision!

Top 10 Open-Source Datasets for Computer Vision
Document OCR | Unitlab AI

For the past few months, we've been crafting an educational series on the best resources for learning and using AI, ML, and Computer Vision. This series includes articles on top open-source computer vision models, computer vision courses, computer vision blogs, AI podcasts, and more. Continuing this effort, we're now offering an overview of the top ten open-source datasets for machine learning, with a primary focus on computer vision.

Computer Vision - Unitlab Blogs

Computer Vision Posts | Unitlab Annotate

Open source has been a game-changer in technology, evolving from Linux to programming languages, deep learning algorithms, and now, accessible open-source datasets available on platforms like Kaggle, Papers with Code, and Hugging Face. These diverse datasets cater to countless applications, allowing learners, independent engineers, and researchers to make significant strides.

Below, we'll highlight ten prominent open-source datasets for computer vision, each focusing on different aspects of AI and ML.

🎓
Subscribe to our blog for quality content on computer vision.

Datasets

1. SA-1B Dataset

The SA-1B dataset boasts 11 million diverse, high-quality images with 1.1 billion pixel-perfect annotations, making it an ideal resource for training and evaluating advanced computer vision models.

SA-1B Dataset

The SA-1B (Segment Anything 1 Billion) dataset, developed by Meta AI, marks a monumental leap in computer vision, particularly for image semantic segmentation. With over 11 million varied images and 1.1 billion segmentation masks, it stands as the largest dataset for this specific task.

This dataset was designed for diversity and inclusivity, featuring examples from all domains, objects, people, and nations—truly a golden standard. Its immense scale and variety enable the training of highly robust and generalizable segmentation models. No doubt why the very popular SAM (Segment Anything Model) was trained on this dataset.

  • Focus: Semantic Segmentation
  • Research Paper: Segment Anything
  • Authors: Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick
  • Dataset Size: 11 million images, 1.1 billion masks, 1500*2250 image resolution
  • License: Limited; Research purpose only
  • Access Link: Official Webpage
đź“–
Learn more about the SAM here with examples.

2. VisualQA

The Visual Question Answering (VQA) dataset includes over 260,000 images depicting abstract scenes from COCO, each with multiple questions and answers, alongside an automatic evaluation metric. This dataset challenges ML models to combine vision, language, and common knowledge to comprehend images and respond to open-ended questions.

VisualQA Dataset

Visual Question Answering (VQA) is a complex task that merges computer vision and natural language processing. The VisualQA dataset provides a rich collection of images paired with natural language questions and their corresponding answers. It encourages models to grasp both visual content and linguistic subtleties to provide accurate responses.

This dataset is perfect for training NLP models that need to interact with visual data.

3. ADE20K

The ADE20K dataset features over 25,000 diverse and densely annotated images, serving as a key benchmark for developing computer vision models focused on semantic segmentation.

ADE20K Dataset

The ADE20K dataset, created by the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL), is another comprehensive scene parsing benchmark, offering pixel-level annotations for a wide array of scenes and objects. It contains more than 20,000 images with detailed segmentation masks for objects and their parts, making it invaluable for training models that need to understand the composition of complex visual scenes.

4. YouTube-8M

YouTube-8M is a large-scale video dataset comprising 7 million YouTube videos annotated with visual and audio labels for various machine learning tasks.

YouTube-8M: Videos for a RTS Game

This video dataset was designed specifically for video understanding and classification. It includes millions of YouTube video IDs with segment-level labels drawn from a rich vocabulary of 4,716 classes.

Essentially, you can find videos on almost any topic you need. For instance, we tested it by searching for a popular RTS game from 2003. Its massive size and variety make it well-suited for training powerful video analysis models.

5. Google's Open Images

Google's Open Images is a publicly accessible dataset providing 8 million labeled images, offering a valuable resource for diverse computer vision tasks and research.

Google's Open Images

Google's Open Images dataset is an expansive collection of images featuring rich annotations, including image-level labels, object bounding boxes, object segmentation masks, and visual relationships.

6. MS Coco

MS COCO (Common Objects in Context) is a widely used large-scale dataset featuring 330,000 diverse images with extensive annotations for object detection, segmentation, and captioning.

MS COCO Dataset

This popular dataset contains a large number of images with annotations for common objects found in their natural environments, including bounding boxes, segmentation masks, and five descriptive captions per image. Many foundational computer vision models have been developed using the MS COCO dataset.

  • Focus: Object Detection, Image Captioning, Segmentation
  • Research Paper: Microsoft COCO: Common Objects in Context
  • Authors: Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan C., Lawrence Zitnick, Piotr Dollar
  • Dataset Size: 330,000 images, 1.5 million object instances, 80 object categories, and 91 stuff categories
  • License: CC By 4.0
  • Access Links: Official Webpage, PyTorch, TensorFlow

7. CT Medical Images

The CT Medical Image dataset is a small sample drawn from the Cancer Imaging Archive, specifically selected for meeting certain criteria regarding age, modality, and contrast tags.

CT Medical Images

This dataset is designed to train models that can recognize image textures, statistical patterns, and highly correlated features. This capability allows for the development of straightforward tools to automatically classify misclassified images and identify outliers. Such outliers might indicate suspicious cases, inaccurate measurements, or inadequately calibrated machines in cancer treatment.

đź’‰
Learn how computer vision is being implemented in healthcare.

8. Aff-Wild2

The Aff-Wild2 dataset comprises 564 videos, totaling approximately 2.8 million frames from 554 subjects, designed for the task of emotion recognition using facial images.

Aff-Wild2 Dataset

Aff-Wild2 is a challenging dataset for the automatic analysis of in-the-wild facial expressions and affective states. It contains videos of participants exhibiting a wide range of emotions and expressions—such as sadness, anger, and satisfaction—in unconstrained environments. These videos are richly annotated for valence, arousal, and discrete emotion categories.

9. DensePose-COCO

DensePose-COCO includes 50,000 images with dense human pose estimation annotations for each person in the COCO dataset, allowing for a detailed understanding of the human body's pose and shape.

DensePose-COCO Dataset

DensePose-COCO extends the MS COCO dataset by offering dense correspondence annotations between 2D images and a 3D surface model of the human body. This enables mapping every pixel of a human body in an image to a specific location on a 3D model, providing a fine-grained understanding of human pose and shape.

10. BDD100K

The BDD100K dataset is a large-scale, diverse driving video dataset containing over 100,000 videos.

BDD100K Dataset

BDD100K comprises 100,000 videos of diverse driving scenarios with rich annotations, including object bounding boxes, drivable areas, lane markings, and traffic lights. Its scale and variety make it ideal for training robust perception models for self-driving cars.

Conclusion

These open-source datasets listed above are just a fraction of the valuable datasets available to the global machine learning and AI community. If you are a researcher, AI/ML engineer, or just a learner, you can use them to train, run, and test your AI/ML models and even implement on a smaller scale.

Their accessibility and richness have accelerated research and development across numerous industries, from computer vision and natural language processing to medical imaging and autonomous systems. Leveraging these datasets is crucial for building robust, generalizable, and high-performing machine learning models.

Explore More

For additional insights into datasets, check out these resources:

References

  1. Akruti Acharya (Jun 27, 2023). Top 10 Open Source Datasets for Machine Learning. Encord Blog: Link
  2. Michael Abramov (Feb 9, 2024). Best Datasets for Training Semantic Segmentation Models. Keymakr Blog: Link