Ubuntu Dialogue Corpus

user1

3 years ago

26 million turns from natural two-person dialogues

Tagsonline communities, linguistics, languages

Context:

Building dialogue systems, where a human can have a natural-feeling conversation with a virtual agent, is a difficult task in Natural Language Processing and the focus of much ongoing research. Some of the challenges include linking references to the same entity over time, tracking what’s happened in the conversation previously, and generating appropriate responses. This corpus of naturally-occurring dialogues can be helpful for building and evaluating dialogue systems.

Content:

The new Ubuntu Dialogue Corpus consists of almost one million two-person conversations extracted from the Ubuntu chat logs, used to receive technical support for various Ubuntu-related problems. The conversations have an average of 8 turns each, with a minimum of 3 turns. All conversations are carried out in text form (not audio).

The full dataset contains 930,000 dialogues and over 100,000,000 words and is available here. This dataset contains a sample of this dataset spread across .csv files. This dataset contains more than 269 million words of text, spread out over 26 million turns.

folder: The folder that a dialogue comes from. Each file contains dialogues from one folder .
dialogueID: An ID number for a specific dialogue. Dialogue ID’s are reused across folders.
date: A timestamp of the time this line of dialogue was sent.
from: The user who sent that line of dialogue.
to: The user to whom they were replying. On the first turn of a dialogue, this field is blank.
text: The text of that turn of dialogue, separated by double quotes (“). Line breaks (\n) have been removed.

Acknowledgements:

This dataset was collected by Ryan Lowe, Nissan Pow , Iulian V. Serban† and Joelle Pineau. It is made available here under the Apache License, 2.0. If you use this data in your work, please include the following citation:

Ryan Lowe, Nissan Pow, Iulian V. Serban and Joelle Pineau, “The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems”, SIGDial 2015. URL: http://www.sigdial.org/workshops/conference16/proceedings/pdf/SIGDIAL40.pdf

Inspiration:

Can you use these chat logs to build a chatbot that offers help with Ubuntu?