Compared to visual datasets, audio datasets are hard to make because the data collection and manual data annotation require a lot of work. This page tries to maintain a list of datasets suitable for artistic research using audio datasets.

Speech Datasets

Vox Celeb | Celebrity Identification Dataset

VoxCeleb is speaker identification dataset to isolate and identify which superstar the voice belongs to. It contains around 100,000 utterances by 1,251 celebrities, extracted from YouTube videos. The data is mostly gendered balanced (males comprise of 55%). The celebrities span a diverse range of accents, professions, and age.

File Size : 150 GB
Link :

MELD | Audio Emotion Identification Dataset

MELD has more than 1400 dialogues and 13000 utterances from Friends TV series. Multiple speakers participated in the dialogues. Each utterance in dialogue has been labeled by any of these seven emotions – Anger, Disgust, Sadness, Joy, Neutral, Surprise and Fear. MELD also has sentiment (positive, negative and neutral) annotation for each utterance.

File Size : 10.1 GB
Link :

TED-LIUM | Audio Database of TEDx talks

1495 TED talks audio recordings along with full-text transcriptions of those recordings, created by Laboratoire d’Informatique de l’Université du Maine (LIUM).

File Size : 54 GB
Link :

Environmental Sound Datasets

Urban Sound Classification

The dataset is called UrbanSound and contains 8732 labeled sound excerpts ( under 4s) of urban sounds from 10 classes: - The dataset contains 8732 sound excerpts of urban sounds from 10 classes, namely: Air Conditioner Car Horn Children Playing Dog bark Drilling Engine Idling Gun Shot Jackhammer Siren Street Music.

File Size : 6 GB
Link :

Mivia | Audio Dataset for Surveillance Application

6,000 events of surveillance applications, namely glass breaking, gunshots, and screams. The events are divided into a training set composed of 4,200 events and a test set composed of 1,800 events. In audio surveillance applications, the events of interest (for instance a scream) can occur at different distances from the microphone that correspond to different levels of the signal-to-noise ratio. Moreover, in these applications the events are generally mixed with a complex background, usually composed of several types of different sounds depending on the specific environments both indoor and outdoor (household appliances, cheering of crowds, talking people, traffic jam, passing cars or motorbikes etc…)

File Size : 6 GB *Registration Required
Link :

  • audio_dataset.txt
  • Last modified: 2019/07/17 10:31
  • by waag