These datasets are used in machine learning (ML) research and have been cited in peer-reviewed academic journals. Datasets are an integral part of the field of machine learning. Major advances in this field can result from advances in learning algorithms (such as deep learning), computer hardware, and, less-intuitively, the availability of high-quality training datasets.[1] High-quality labeled training datasets for supervised and semi-supervised machine learning algorithms are usually difficult and expensive to produce because of the large amount of time needed to label the data. Although they do not need to be labeled, high-quality datasets for unsupervised learning can also be difficult and costly to produce.[2][3][4]
Many organizations, including governments, publish and share their datasets. The datasets are classified, based on the licenses, as Open data and Non-Open data.
The datasets from various governmental-bodies are presented in List of open government data sites. The datasets are ported on open data portals. They are made available for searching, depositing and accessing through interfaces like Open API. The datasets are made available as various sorted types and subtypes.
The data portal is classified based on its type of license. The open source license based data portals are known as open data portals which are used by many government organizations and academic institutions.
https://github.com/sebneu/ckan_instances/blob/master/instances.csv
https://dataverse.org/metrics
The data portal sometimes lists a wide variety of subtypes of datasets pertaining to many machine learning applications.
The data portals which are suitable for a specific subtype of machine learning application are listed in the subsequent sections.
These datasets consist primarily of text for tasks such as natural language processing, sentiment analysis, translation, and cluster analysis.
Categorization
citation analysis
These datasets consist of sounds and sound features used for tasks such as speech recognition and speech synthesis.
(WAV)
Datasets containing electric signal information requiring some sort of signal processing for further analysis.
Datasets from physical systems.
Datasets from biological systems.
Dataset[259]
[329] [330]
[331]
This section includes datasets that deals with structured data.
Further details are provided in the project's GitHub repository and respective Hugging Face dataset card.
This section includes datasets that contains multi-turn text with at least two actors, a "user" and an "agent". The user makes requests for the agent, which performs the request.
Taskmaster-2: 17,289 dialogs in the seven domains (restaurants, food ordering, movies, hotels, flights, music and sports).
Taskmaster-3: 23,757 movie ticketing dialogs.
Taskmaster-3: conversation id, utterances, vertical, scenario, instructions.
For further details check the project's GitHub repository or the Hugging Face dataset cards (taskmaster-1, taskmaster-2, taskmaster-3).
Additionally, each ask contains a task definition.
Further information is provided in the GitHub repository of the project and the Hugging Face data card.
The dataset can be downloaded here, and the rejected data here.
The scripts to process the data are available in the GitHub repo mentioned on the paper: https://github.com/google-research/FLAN/tree/main/flan.
Another FLAN GitHub repo was created as well. This is the one associated with the dataset card in Hugging Face.
Mechanisms of Attack Domains of Attack
Software Development Hardware Design[permanent dead link]Research Concepts
2009, 2010 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022.
Data files can also be downloaded here.
Data is also available here.
Alternate list of reports.
Workshops
As datasets come in myriad formats and can sometimes be difficult to use, there has been considerable work put into curating and standardizing the format of datasets to make them easier to use for machine learning research.
{{cite arXiv}}
{{cite journal}}
|journal=
{{cite web}}