Machine Learning requires data. To obtain some, one way is to scrape or retrieve data directly from somewhere. Another way is to reuse datasets freely available over the internet, specially made for most common tasks. Because it has become so frequent for data scientists to take much time searching and processing data, Google has its own Dataset search tool. There are also big lists of datasets for research purpose, such as this one.
Here is my personal list of datasets I have used or been interested in using, organized by types of data.
Music
Name | Notes |
---|---|
MIDI DB | midi, music |
Feelyoursound | music, chord, progression |
Jingle | jingle, music |
Lakh pianoroll | midi, pianoroll, music |
RWC | popular, music |
Composing.ai | midi |
Chord progressions of 5000 songs | chords |
Links to dataset lists
Lyrics
MoodyLyrics is a sentiment annotated lyrics dataset. DALI is a dataset of words, lines and paragraphs in lyrics aligned with notes.
Text
Name | Notes |
---|---|
Sentiment140 | sentiment, twitter |
Twitter 2010 | |
Twitter sentiment | twitter, sentiment |
OPUS | parallel, translation |
Emergent | stance, politics |
Europarl | speech |
DEFT | speech, french |
Links to dataset lists
- ICSWSM list of Twitter datasets