Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
text_dataset [2019/07/17 10:44]
waag
text_dataset [2019/08/04 15:57] (current)
waag
Line 1: Line 1:
-Underdevelopment+===== Text Dataset ===== 
 + 
 +This page is for archiving text-based datasets for natural language processing. 
 + 
 +\\ 
 + 
 +---- 
 +\\ 
 +== Yelp Reviews | Reviews from Yelp == 
 + 
 +Open dataset released by Yelp for learning purposes. It consists of millions of user reviews, businesses attributes and over 200,000 pictures from multiple metropolitan areas.  
 + 
 + 
 +\\ 
 +File Size : 2.66 GB  
 +\\ 
 +Link : [[https://​www.yelp.com/​dataset | Yelp Reviews ]] 
 + 
 +---- 
 +\\ 
 +== IMDB Reviews | 25,000 highly polar movie reviews == 
 + 
 +Database from movie reviews from IMDB with multiple review scores. This dataset is meant for binary sentiment classification and has far more data than any previous datasets in this field. Apart from the training and test review examples, there is further unlabeled data for use as well.  
 + 
 +\\ 
 +File Size : 80MB  
 +\\ 
 +Link : [[http://​ai.stanford.edu/​~amaas/​data/​sentiment/​ | IMDB Reviews ]] 
 + 
 +---- 
 + 
 +== Sentiment 140 | Dataset for sentiment analysis == 
 + 
 +Sentiment 140 is a dataset based on Twitter text that can be used for sentiment analysis. Emotions have been pre-removed from the data. The final dataset has the below 6 features: 
 + 
 +  * polarity of the tweet 
 +  * id of the tweet 
 +  * date of the tweet 
 +  * the query 
 +  * username of the tweeter 
 +  * text of the tweet 
 + 
 +\\ 
 +File Size : 80MB  
 +\\ 
 +Link : [[http://​help.sentiment140.com/​for-students/​ | Sentiment 140 ]] 
 + 
 +---- 
 + 
 +== Wikipedia Corpus | Full text of Wikipedia == 
 + 
 +This dataset is a collection of the full text on Wikipedia. It contains almost 1.9 billion words from more than 4 million articles. You can search by word, phrase or part of a paragraph itself. 
 + 
 + 
 +\\ 
 +File Size : 20MB  
 +\\ 
 +Link : [[https://​nlp.cs.nyu.edu/​wikipedia-data/​ | Wikipedia Corpus ]] 
 + 
 +----
  • text_dataset.txt
  • Last modified: 2019/08/04 15:57
  • by waag