1.3 million labelled comments from Reddit
LicenseData files © Original Authors
Tagsarts and entertainment, internet, online communities, social science
Context
This dataset contains 1.3 million Sarcastic comments from the Internet commentary website Reddit. The dataset was generated by scraping comments from Reddit (not by me :)) containing the \s
( sarcasm) tag. This tag is often used by Redditors to indicate that their comment is in jest and not meant to be taken seriously, and is generally a reliable indicator of sarcastic comment content.
Content
Data has balanced and imbalanced (i.e true distribution) versions. (True ratio is about 1:100). The
corpus has 1.3 million sarcastic statements, along with what they responded to as well as many non-sarcastic comments from the same source.
Labelled comments are in the train-balanced-sarcasm.csv
file.
Acknowledgements
The data was gathered by: Mikhail Khodak and Nikunj Saunshi and Kiran Vodrahalli for their article “A Large Self-Annotated Corpus for Sarcasm“. The data is hosted here.
Citation:
@unpublished{SARC,
authors={Mikhail Khodak and Nikunj Saunshi and Kiran Vodrahalli},
title={A Large Self-Annotated Corpus for Sarcasm},
url={https://arxiv.org/abs/1704.05579},
year=2017
}
Annotation of files in the original dataset: readme.txt.
Inspiration
- Predicting sarcasm and relevant NLP features (e.g. subjective determinant, racism, conditionals, sentiment heavy words, “Internet Slang” and specific phrases).
- Sarcasm vs Sentiment
- Unusual linguistic features such as caps, italics, or elongated words. e.g., “Yeahhh, I’m sure THAT is the right answer”.
- Topics that people tend to react to sarcastically