Exploring Datasets And Approaches For Conversational Summarization

5 minute read

Published:

In this blog we will be exploring the available datasets.

1. AMI Meeting Dataset

The AMI Meeting Corpus consists of 100 hours of meeting recordings. The recordings use a range of signals synchronized to a common timeline. These include close-talking and far-field microphones, individual and room-view video cameras, and output from a slide projector and an electronic whiteboard. During the meetings, the participants also have unsynchronized pens available to them that record what is written. The meetings were recorded in English using three different rooms with different acoustic properties, and include mostly non-native speakers.

The AMI Meeting Corpus includes high quality, manually produced orthographic transcription for each individual speaker, including word-level timings that have derived by using a speech recognizer in forced alignment mode. It also contains a wide range of other annotations, not just for linguistic phenomena but also detailing behaviours in other modalities. These include dialogue acts; topic segmentation; extractive and abstractive summaries; named entities; the types of head gesture, hand gesture, and gaze direction that are most related to communicative intention; movement around the room; emotional state; and where heads are located on the video frames.

2. Multimodal Conversation Datasets

  • This dataset contains the :
    • speaker
    • utterance
    • question-type
  • Dataset statistics
    • Train : 105439 chat sessions
    • Valid : 22595 chat sessions
    • Test : 22595 chat sessions
  • The dataset also contains automata states.
  • The conversation is between the customer and the agent.
    • Examples : https://amritasaha1812.github.io/MMD/_pages/example.html

3. Multimodal EmotionLines Dataset

  • Multimodal EmotionLines Dataset (MELD) has been created by enhancing and extending EmotionLines dataset. MELD contains the same dialogue instances available in EmotionLines, but it also encompasses audio and visual modality along with text.
  • MELD has more than 1400 dialogues and 13000 utterances from Friends TV series.
  • Each utterance in a dialogue has been labeled by any of these seven emotions – Anger, Disgust, Sadness, Joy, Neutral, Surprise and Fear. MELD also has sentiment (positive, negative and neutral) annotation for each utterance.
  • No. of dialogues :
    • Train : 1039
    • Dev : 114
    • Test : 100
  • No. of utterances :
    • Train : 9989
    • Dev : 1109
    • Test : 2610

4. Situated Interactive MultiModal Conversations (SIMMC)

This paper provides two SIMMC datasets totalling ~13K human-human dialogs (~169K utterances). The dataset is collected using a multimodal Wizard-of-Oz (WoZ) setup, on two shopping domains :

  • Furniture
  • Fashion

5. SAMSum Corpus

  • The SAMSum dataset contains about 16k messenger-like conversations with summaries.
  • The style and register are diversified - conversations could be informal, semi-formal or formal, they may contain slang words, emoticons and typos.
  • Data Fields
    • Dialogue
    • Summary
    • Id
  • Data splits
    • Train : 14732
    • Val : 818
    • Test : 819

6. Enron Corpus

7. How2 Dataset

How2 is a multimodal dataset of instructional videos paired with spoken utterances, English subtitles and their crowdsourced Portuguese translations, as well as English video summaries.

The How2 dataset consists of 79,114 instructional videos (2,000 hours in total, with an average length of 90 seconds) with English subtitles.

To re-create the corpus using metadata and scripts use this repository.

Method of creating dataset :

  • The videos are downloaded from youtube along with various types of metadata including ground-truth subtitles and summaries in English written by the video creators.
  • Videos were scraped from the YouTube platform using a keyword based spider.
  • In order to produce a multilingual and multimodal dataset, the English subtitles were first re-segmented into full sentences, which were then aligned to the speech at the word level.
  • To generate translation Figure Eight crowdsourcing platform was used.

Topic Distribution :

  • The dataset consists of 22 topics.

8. Yelp Dataset

The Yelp dataset is a subset of the Yelp’s businesses, reviews, and user data for use in personal, educational, and academic purposes. It is available as JSON files. Each file is composed of a single object type, one JSON-object per-line.

The data is available as :

  • business.json
    • Contains business data including location data, attributes, and categories.
  • review.json
    • Contains full review text data including the user_id that wrote the review and the business_id the review is written for.
  • user.json
    • Contains user data including the user’s friend mapping and all the metadata associated with the user.
  • checkin.json
    • Contains checkins on a business.
  • tip.json
    • Tips written by a user on a business. Tips are shorter than reviews and tend to convey quick suggestions.
  • photo.json
    • Contains photo data including the caption and classification (one of “food”, “drink”, “menu”, “inside” or “outside”).

The dataset contains :

  • 8635403 reviews
  • 160585 businesses
  • 200000 pictures
  • 8 metropolitan areas
  • 1162119 tips
  • 2189457 users
  • Over 1.2 million business attributes like hours, parking, availability, and ambience Aggregated check-ins over time for each of the 138,876 businesses

9. Recommender Systems and Personalization Datasets

These set of datasets are provided by Julian McAuley. This contains a lot of different domain datasets.

10. MultiWOZ Dataset

11. Taskmaster

Approaches for Conversational Data Generation

Following approaches can be used :

  • Abstractive Unsupervised Multimodal summarization
    • Auto encoder based technique and calculating similarity scores.
    • Can also use self attention which are sentence level.
    • Cycle GANs for learning summaries from reviews.
    • Replacement Strategy is also another way.
    • Reference :
      • https://arxiv.org/pdf/1810.05739.pdf
      • https://arxiv.org/abs/2009.06851
      • https://aclanthology.org/2021.acl-long.471.pdf
  • Supervised Multimodal summarization
    • Using bert for key sentence selector and generating extractive summary. Which are encoded and decoded for generating abstractive summaries.
    • Reference :
      • https://www.analyticsvidhya.com/blog/2021/02/dialogue-summarization-a-deep-learning-approach/
      • https://yale-lily.github.io/public/s2020/tony.pdf

Few Important models relevant to this project :

  • BART
  • CycleGANs
  • Auto-encoders
  • Seq2Seq