Text Dataset
This page is for archiving text-based datasets for natural language processing.
Yelp Reviews | Reviews from Yelp
Open dataset released by Yelp for learning purposes. It consists of millions of user reviews, businesses attributes and over 200,000 pictures from multiple metropolitan areas.
File Size : 2.66 GB
Link : Yelp Reviews
IMDB Reviews | 25,000 highly polar movie reviews
Database from movie reviews from IMDB with multiple review scores. This dataset is meant for binary sentiment classification and has far more data than any previous datasets in this field. Apart from the training and test review examples, there is further unlabeled data for use as well.
File Size : 80MB
Link : IMDB Reviews
Sentiment 140 | Dataset for sentiment analysis
Sentiment 140 is a dataset based on Twitter text that can be used for sentiment analysis. Emotions have been pre-removed from the data. The final dataset has the below 6 features:
- polarity of the tweet
- id of the tweet
- date of the tweet
- the query
- username of the tweeter
- text of the tweet
File Size : 80MB
Link : Sentiment 140
Wikipedia Corpus | Full text of Wikipedia
This dataset is a collection of the full text on Wikipedia. It contains almost 1.9 billion words from more than 4 million articles. You can search by word, phrase or part of a paragraph itself.
File Size : 20MB
Link : Wikipedia Corpus