Part I. English Corpus

As part of a project on extracting sentiment from text, we have collected a corpus of movie, book, and consumer product reviews. For more information on the corpus collection, and the project it is part of, see the Project Description for "Computational analysis of text sentiment" and the following publications:

  • Taboada, M., C. Anthony and K. Voll (2006) Methods for Creating Semantic Orientation Dictionaries. Proceedings of 5th International Conference on Language Resources and Evaluation (LREC). Genoa, Italy. May 2006. pp. 427-432.
  • Taboada, M. and J. Grieve (2004) Analyzing Appraisal Automatically. American Association for Artificial Intelligence Spring Symposium on Exploring Attitude and Affect in Text. Stanford. March 2004. AAAI Technical Report SS-04-07. (pp.158-161). Paper | Poster

The reviews were downloaded in 2004 from the Epinions web site by Jack Grieve. They are divided in the following categories, with 25 positive and 25 negative reviews in each category. The classification into positive and negative was based on the "recommended" or "not recommended" tag that the reviewer provided.

  • Books
  • Cars
  • Computers
  • Cookware
  • Hotels
  • Movies
  • Music
  • Phones

Raw corpus

The entire corpus in .txt format, structured in directories by product and with file names indicating positive and negative.

Annotated corpus

1. RST annotations

The corpus is annotated with RST relations at the sentence level (i.e., no full-text analysis; only those relations found within sentences). Texts were annotated with RST Tool, and the tool is necessary to view the annotaions. Annotations by Maite Taboada and Montana Hay.

2. Appraisal annotations

Annotations of subjectivity types using the Appraisal framework. The very first step in the annotation involved the creation of system networks to use in the coding process. The following are system networks created using UAM Corpus Tool, and based on the explanations in J. Martin and P. White (2005) The Language of Evaluation: Appraisal in English. London: Palgrave McMillan. The system networks and annotations were created by Maite Taboada and Patrick Larrivee-Woods.
Only a subsection of the corpus has been annotated using Appraisal: movies, books and hotels (150 texts).

3. Negation and speculation annotations

Annotation of negation and speculation, loosely based on the BioScope corpus annotation. For more detail on the annotation, please see:

Konstantinova, N., S. de Sousa, N. P. Cruz, M. J. Maña, M. Taboada and R. Mitkov (2012) A review corpus annotated for negation, speculation and their scope. Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC). Istanbul, Turkey. May 2012.

Part II. Spanish Corpus

This corpus is a collection of 400 reviews on cars, hotels, washing machines, books, cell phones, music, computers, and movies. Each category contains 50 positive and 50 negative reviews, defined as positive or negative based on the number of stars given by the reviewer (1-2=negative; 4-5=positive; 3-star review are not included). The reviews were collected from the website Ciao.es. They are intended to be a Spanish parallel to the SFU Review Corpus (in English).

Any comments or suggestions on the corpora and the annotations are more than welcome. Please let me know if you use them in your research: Maite Taboada (mtaboada@sfu.ca).

©2004-2017 Maite Taboada, Julian Brooke, Jack Grieve, Montana Hay, Patrick Larrivee-Woods
©2017 Maite Taboada, Natalia Konstantinova, Sheila de Sousa