SFU Opinion and Comments Corpus

July 13, 2024

The Discourse Processing Lab has released the SFU Opinion & Comments Corpus (SOCC)!

It contains five years of Globe & Mail opinion articles with all the comments, from January 2012 to December 2016. That's a total of 10, 339 articles (6, 895, 696 words), plus 1, 280, 454 comments (77, 238, 179 words).

The corpus currently includes three annotated subsets: the constructiveness & toxicity corpus, negation corpus, and appraisal corpus.

The SOCC can be used to study, among other aspects, the connections between articles and comments; the connections of comments to each other; the types of topics discussed in comments; the nice (constructive) or mean (toxic) ways in which commenters respond to each other; and how language is used to convey very specific types of evaluation.

The corpus presented here constitutes an invaluable resource for the study of online comments. While our focus is comments posted in response to opinion news articles, the phenomena in this corpus are likely to be present in many commenting platforms: other news comments, comments and replies in fora such as Reddit, feedback on blogs, or YouTube comments.

You can find the SOCC on GitHub.