Sarcasm on Reddit

user1

3 years ago

1.3 million labelled comments from Reddit

Tagsarts and entertainment, internet, online communities, social science

Context

This dataset contains 1.3 million Sarcastic comments from the Internet commentary website Reddit. The dataset was generated by scraping comments from Reddit (not by me :)) containing the \s ( sarcasm) tag. This tag is often used by Redditors to indicate that their comment is in jest and not meant to be taken seriously, and is generally a reliable indicator of sarcastic comment content.

Content

Data has balanced and imbalanced (i.e true distribution) versions. (True ratio is about 1:100). The
corpus has 1.3 million sarcastic statements, along with what they responded to as well as many non-sarcastic comments from the same source.

Labelled comments are in the train-balanced-sarcasm.csv file.

Acknowledgements

The data was gathered by: Mikhail Khodak and Nikunj Saunshi and Kiran Vodrahalli for their article “A Large Self-Annotated Corpus for Sarcasm“. The data is hosted here.

Citation:

@unpublished{SARC,
  authors={Mikhail Khodak and Nikunj Saunshi and Kiran Vodrahalli},
  title={A Large Self-Annotated Corpus for Sarcasm},
  url={https://arxiv.org/abs/1704.05579},
  year=2017
}

Annotation of files in the original dataset: readme.txt.

Inspiration

Predicting sarcasm and relevant NLP features (e.g. subjective determinant, racism, conditionals, sentiment heavy words, “Internet Slang” and specific phrases).
Sarcasm vs Sentiment
Unusual linguistic features such as caps, italics, or elongated words. e.g., “Yeahhh, I’m sure THAT is the right answer”.
Topics that people tend to react to sarcastically