This page is for archiving text-based datasets for natural language processing.




Yelp Reviews | Reviews from Yelp

Open dataset released by Yelp for learning purposes. It consists of millions of user reviews, businesses attributes and over 200,000 pictures from multiple metropolitan areas.


File Size : 2.66 GB
Link : Yelp Reviews



IMDB Reviews | 25,000 highly polar movie reviews

Database from movie reviews from IMDB with multiple review scores. This dataset is meant for binary sentiment classification and has far more data than any previous datasets in this field. Apart from the training and test review examples, there is further unlabeled data for use as well.


File Size : 80MB
Link : IMDB Reviews


Sentiment 140 | Dataset for sentiment analysis

Sentiment 140 is a dataset based on Twitter text that can be used for sentiment analysis. Emotions have been pre-removed from the data. The final dataset has the below 6 features:

  • polarity of the tweet
  • id of the tweet
  • date of the tweet
  • the query
  • username of the tweeter
  • text of the tweet


File Size : 80MB
Link : Sentiment 140


Wikipedia Corpus | Full text of Wikipedia

This dataset is a collection of the full text on Wikipedia. It contains almost 1.9 billion words from more than 4 million articles. You can search by word, phrase or part of a paragraph itself.


File Size : 20MB
Link : Wikipedia Corpus


  • text_dataset.txt
  • Last modified: 2019/08/04 15:57
  • by waag