Efficient dataset management is essential for building AI and Machine Learning (ML) models and computer vision projects. Unitlab simplifies this process with version control, similar to git
branches, allowing datasets to evolve over time while staying consistent and organized. Annotators can release specific versions for use, update them as new data are added, and keep track of all changes, ensuring smooth collaboration and reproducibility.
Unitlab centralizes dataset management by streamlining data ingestion and storage. It ensures datasets remain consistent and accessible, promoting collaboration and reproducibility. The platform assigns access permissions, tracks usage, and automates workflows to maintain data integrity.
Releasing a dataset involves creating a version that is ready for use in AI or business applications. As the dataset grows, newer versions can be released to reflect improvements. Version control ensures that every change is tracked, with options to revert to earlier versions or compare updates. This prevents inconsistencies across computer vision projects and fosters better teamwork.
Datasets can have different data annotation formats, depending on the use case. This flexibility is useful as formats can evolve with a project's needs and requirements.
Enhance AI Development Speed: Annotate Data 15x Faster with Unitlab
Setting up a project
In our previous post, we learned how to set up a project at Unitlab. To create and release a dataset, we need a project. If you do not have a project, you can check out our guide.
Alternatively, users can clone a public dataset and start working with the dataset. There are numerous public official and user-created datasets available at Unitlab.
Data Labeling
To release a dataset, we first need to label our data. If an image is not labeled, then it will not be included in the dataset that will be released. The project contains all images, whereas the released dataset contains only the labeled images. Depending on how the project is set up, an AI Model can be used to automatically annotate, or annotators can label images themselves.
After the data annotation process, the first version of our dataset will be ready to be released and used for business purposes for computer vision. In our case, 6 of the images in our project are labeled. When we release this dataset, only the labeled images will be included in its first version.
Releasing
After annotating data of interest, we can release a dataset in the dashboard of our project. Because we are releasing the dataset for the first time, we do not have previous versions. The version automatically starts at 0.01. We can now release a dataset, by clicking the "+ Release Dataset" button.
Since we are currently under the free plan. which is ideal for hobbyists, students, and individual users, we cannot release a private dataset. Under the current plan, the released dataset will become a public dataset in the Unitlab database. If you need private datasets, you can check out other plans available at Unitlab.
We must choose an annotation format or model for our dataset. We can choose COCO, YOLOv5 or YOLOv8. The COCO format is a general purpose model. However, it is possible to choose other formats depending on the business needs.
Currently, we are making our dataset public, for which we need to choose a license. An open-source license grants certain permissions to the user that is using a particular piece of software or dataset. In essence, the license dictates what can and cannot be done with the dataset. There are, as of now, four licensing options available at Unitlab:
- MIT - Permissive, allows commercial use
- CC BY 4.0 - Requires attribution
- Public Domain - No restrictions
- BY-NC-SA 4.0 - Restricts commercial use and requires the same license for derivatives
For our purposes here, we use the MIT license, which is widely used for open-source software and open datasets. After choosing the format and the license, we can release the dataset.
Downloading
Our dataset is now released and ready to be used. Users can now clone it and use it for their own purposes. It is a public dataset that can be cloned and used by others under the MIT license. Since we annotated only 6 images, the resulting dataset contains only these labeled images.
If we want to download the image, we can download the images in a zip format. We can separately download the annotations in a JSON format. As the primary use case of annotation is for AI/ML purposes, the models expect the data annotations in a standardized format that a machine can understand.
We have three options to download the results:
- Web interface: the standard browser functionality.
- Unitlab CLI: the command-line interface.
- Python package: the
unitlab
Python package.
For this tutorial, we’ll use the first and simplest option, the web browser.
Conclusion
Unitlab’s version-controlled dataset management ensures that data remains consistent and accessible throughout the project. Whether the goal is to build an AI/ML model, collaborate with a team, or share work publicly, the platform provides a smooth, reliable way to manage and release datasets. With options for tracking changes, selecting licenses, and flexible download methods, Unitlab makes it easy to create structured, reproducible datasets for any purpose.
By keeping datasets organized and up to date, projects benefit from improved collaboration, reduced errors, and better AI performance.