Site icon Zdataset

Ubuntu Dialogue Corpus

26 million turns from natural two-person dialogues

LicenseOther (specified in description)

Tagsonline communitieslinguisticslanguages

Context:

Building dialogue systems, where a human can have a natural-feeling conversation with a virtual agent, is a difficult task in Natural Language Processing and the focus of much ongoing research. Some of the challenges include linking references to the same entity over time, tracking what’s happened in the conversation previously, and generating appropriate responses. This corpus of naturally-occurring dialogues can be helpful for building and evaluating dialogue systems.

Content:

The new Ubuntu Dialogue Corpus consists of almost one million two-person conversations extracted from the Ubuntu chat logs, used to receive technical support for various Ubuntu-related problems. The conversations have an average of 8 turns each, with a minimum of 3 turns. All conversations are carried out in text form (not audio).

The full dataset contains 930,000 dialogues and over 100,000,000 words and is available here. This dataset contains a sample of this dataset spread across .csv files. This dataset contains more than 269 million words of text, spread out over 26 million turns.

Acknowledgements:

This dataset was collected by Ryan Lowe, Nissan Pow , Iulian V. Serban† and Joelle Pineau. It is made available here under the Apache License, 2.0. If you use this data in your work, please include the following citation:

Ryan Lowe, Nissan Pow, Iulian V. Serban and Joelle Pineau, “The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems”, SIGDial 2015. URL: http://www.sigdial.org/workshops/conference16/proceedings/pdf/SIGDIAL40.pdf

Inspiration:

Can you use these chat logs to build a chatbot that offers help with Ubuntu?

Exit mobile version