Dataset Guide

This page has two main aims. First, to be a database of datasets for various data science and machine learning tasks; and second, to keep a record of how easy it is to access these datasets. Only resources that I have actually used are included.

Free and immediately-accessible datasets

NLP datasets

  • WikiPlots: “A dataset containing story plots from Wikipedia (books, movies, etc.)”. The dataset is made up of 112,936 such story plots. One sentence per line; stories separated by a special line. 236MB unzipped (including a title list). Note: The GitHub respository also includes the code for extracting plots from Wikipedia dumps.
  • The Westbury Lab Wikipedia corpus (2010): “This corpus was created from a snapshot of all the articles in the English part of the Wikipedia that was taken in April 2010. It was processed, as described in detail below, to remove all links and irrelevant material (navigation text, etc) The corpus is untagged, raw text.” The dataset is made up of 990,248,478 words from over 2 million documents. One paragraph per line; titles/headings/subheadings included as sentences; articles separated by special lines. 6.1GB unzipped.

Other guides/lists/repositories

NLP

  • Nicolas Iderhoff’s NLP Datasets repository: “Alphabetical list of free/public domain datasets with text data for use in Natural Language Processing (NLP)”. These are not all easily/immediately accessible or free to download. For example, several require the user to fill in forms and await a response. Many of the larger datasets are hosted on AWS, and at least some of these (e.g. the arXiv dataset) use Requester Pays Buckets, where the user requesting the data is required to pay for the bandwidth required to download the data.
  • Julian McAuley’s Recommender Systems Datasets: Datasets include user/item interactions; star ratings; timestamps; product reviews; social networks; item-to-item relationships (e.g. copurchases); product images; price, brand, and category information; GPS data; and other metadata.

Other

  • Awesome Public Datasets: A large number of datasets organised by content.
  • Vincent Arel-Bundock’s Rdatasets Archive: “[A] collection of over 1200 datasets that were originally distributed alongside the statistical software environment R and some of its add-on packages”. Data downloadable in CSV format, and listed by numbers of rows and columns.
  • UC Irvine Machine Learning Repository: A collection of 452 datasets (as of 2018-11-15) of various types for machine learning.
  • Kaggle Datasets: A collection of 12,398 datasets (as of 2018-11-15) of various types related to various Kaggle data science/machine learning completitions and challenges.