Associated Links
McGill & UdeM
A Survey of Available Corpora for Building Data-Driven Dialogue Systems

A basic outline of a dialog system.
During the past decade, several areas of speech and language understanding have witnessed substantial breakthroughs from the use of data-driven models. In the area of dialogue systems, the trend is less obvious, and most practical systems are still built through significant engineering and expert knowledge. Nevertheless, several recent results suggest that data-driven approaches are feasible and quite promising. To facilitate research in this area, we have carried out a wide survey of publicly available datasets suitable for data-driven learning of dialogue systems. We discuss important characteristics of these datasets, how they can be used to learn diverse dialogue strategies, and their other potential uses. We also examine methods for transfer learning between datasets and the use of external knowledge. Finally, we discuss appropriate choice of evaluation metrics for the learning objective.
The authors gratefully acknowledge financial support by the Samsung Advanced Institute of Technology (SAIT), the Natural Sciences and Engineering Research Council of Canada (NSERC), the Canada Research Chairs, the Canadian Institute for Advanced Research (CIFAR) and Compute Canada. Early versions of the manuscript benefited greatly from the proofreading of Melanie Lyman-Abramovitch, and later versions were extensively revised by Genevieve Fried and Nicolas Angelard-Gontier. The authors also thank Nissan Pow, Michael Noseworthy, Chia-Wei Liu, Gabriel Forgues, Alessandro Sordoni, Yoshua Bengio and Aaron Courville for helpful discussions.
1Although the model does not require intermediate labels, it consists of sub-components whose parameters are trained with different objective function. Therefore, strictly speaking, this is not an end-to-end model.
2Machine-machine dialogue corpora are not of interest to us, because they typically differ significantly from natural human language. Furthermore, user simulation models are outside the scope of this survey.
3For more information on dialogue hot spots and how they relate to dialogue acts, see [Wrede and Shriberg, 2003].
15Most of the largest technical support datasets are based on commercial technical support channels, which are proprietary and never released to the public for privacy reasons.