Site icon Zdataset

Sarcasm on Reddit

1.3 million labelled comments from Reddit

LicenseData files © Original Authors

Tagsarts and entertainmentinternetonline communitiessocial science

Context

This dataset contains 1.3 million Sarcastic comments from the Internet commentary website Reddit. The dataset was generated by scraping comments from Reddit (not by me :)) containing the \s ( sarcasm) tag. This tag is often used by Redditors to indicate that their comment is in jest and not meant to be taken seriously, and is generally a reliable indicator of sarcastic comment content.

Content

Data has balanced and imbalanced (i.e true distribution) versions. (True ratio is about 1:100). The
corpus has 1.3 million sarcastic statements, along with what they responded to as well as many non-sarcastic comments from the same source.

Labelled comments are in the train-balanced-sarcasm.csv file.

Acknowledgements

The data was gathered by: Mikhail Khodak and Nikunj Saunshi and Kiran Vodrahalli for their article “A Large Self-Annotated Corpus for Sarcasm“. The data is hosted here.

Citation:

@unpublished{SARC,
  authors={Mikhail Khodak and Nikunj Saunshi and Kiran Vodrahalli},
  title={A Large Self-Annotated Corpus for Sarcasm},
  url={https://arxiv.org/abs/1704.05579},
  year=2017
}

Annotation of files in the original dataset: readme.txt.

Inspiration

Exit mobile version