Exploring Image And Text Modality Datasets For Conversational Summarization

5 minute read

Published: October 25, 2021

In this blog we will be exploring and brainstorming on the image and text modality datasets for conversation summarization.

The first step is to list down the requirements and questions for which we are finding the answers :

What all conversation based Image+Text Datasets are available? How large are they?
What are their process of generation?
What scope do we have of this method in our project?
What all different things can be incorporated to make this process of dataset generation simpler?
How to get the ground truth summaries?

For Image-Text dataset without conversation, WIT comes out to be a clear winner.

For the Image-Text Conversation Modality, I explored the following datasets :

Image-Chat
PhotoChat

WIT : Wikipedia-Based Image-Text Dataset

Information : https://ai.googleblog.com/2021/09/announcing-wit-wikipedia-based-image.html
About the dataset :
- Text-Image Tuples : 37.6M
- Unique Images : 11.5M
- Ref.Text : ~17M
Process of data collection :
- Crawl Data
  - Started with content pages
  - Ignored other pages that have discussion, comments, etc:
    - This however is really important for us.
  - Extracted images and texts related to the images.
  - This yields ~150M tuples.
- Texts were filtered using description sections in wikipedia.
- Text based filtering
  - Retained texts of at least length 3
  - Removed generic phrases
  - Alt-text filtering
- Image & Image-Text based filtering
  - Filtering on image height and width
  - Filtering generic or irrelevant images
  - Retained only licensed images
  - Removed generic image-text pairs
- Additional Filtering
  - Removed objectionable images or texts.
  - Kept top 100 languages
Learnings for our own dataset :
- Process of crawling the dataset and applying the steps to filter.
- WIT dataset ignores the discussions and comments, however, we are interested in them as they can be a conversation it itself.
  - This gives the cue to think around twitter or any publicly available platform which contains images, posts, replies and comments.
- The paper has not outlined the filtering details as in how to extract the text which is relevant to the image or how to determine if some image should be removed.

Image-Chat : Engaging Grounded Conversations

Paper : https://aclanthology.org/2020.acl-main.219.pdf
About the dataset :
- Images : ~200000
- Dialogues : ~200000
- Utterances : ~400000
Process of data collection :
- Collected a large set of human-human crowd-worker conversations.
- Each dialogue consists of consecutive turns by speaker A and B. No particular constraints are placed on the kinds of utterance, only that they are asked to both use the provided style trait, and to respond to the given image and dialogue history in an engaging way.
- Style traits :
  - 215 style traits.
  - Traits are classified as :
    - Positive
    - Netural
    - Negative
- Images : YFCC100M Dataset
  - Contains 99.2 Million photos
  - Contains 0.8 million videos
- Dialogue : Generated using crowdworkers.
Learnings for our own dataset :
- Detailed process of annotation.
- Using attention for combination of different modalities as well.
- A very useful and challenging dataset for summarization as it gives
  - images
  - dialogue
  - style

Paper : https://aclanthology.org/2021.acl-long.479.pdf
About the paper :
- Photo based chat
- Propose tasks for
  - Photo sharing intent
  - Dialogue-based image retrieval
About the dataset :
- Images : 10917
- Dialogues : 12286
Process of data collection :
- Collected photos from Open Image Dataset V4 (OID)
- Open-ended conversations collected from Mech. turk
- Filtered image based on :
  - People
  - Food
  - Animal
  - Product (shopping scenario)
- Conversation Generation
  - Crowdworkers
Learnings for our own dataset :
- Provides a good dataset for multiple images and dialogue.
- However, the dataset is small, can prove to be good for testing.

Image-Grounded Conversations: Multimodal Context for Natural Question and Response Generation

Paper : https://aclanthology.org/I17-1047.pdf
About the paper :
- Images based conversations
- For Question Answering
- Proposal of retrieval approaches
About the ICG dataset :
- Images : 4222
- Utterances : 25332
Process of data collection :
- QA dialogues from VQG dataset.
- Photo gallery selection from crowd-sourcing platform.
- IGC-Twitter dataset consists of
  - 250K quadruples of visual context, textual context, question and response.
  - This dataset was further refined.
Learnings for our own dataset :
- ICG-Twitter dataset can be used for training
- Also, the paper enlists the process of fetching conversation based data from twitter.
- However, the ICG dataset is small, can prove to be good for testing.

Towards Building Large Scale Multimodal Domain-Aware Conversation Systems

Paper : https://arxiv.org/pdf/1704.00200.pdf
About the dataset :
- Dialogue : 150000
Process of data collection :
- Domain knowledge curation
  - Crawling
  - Parsing
  - Taxonomical filtering
  - Relevent fashion attributes identification
  - Creating fashion profile
- Gathering multimodal dialogs :
  - Crowdsourcing
Learnings for our own dataset :
- A useful dataset with multiple images
- Can be used for summarization
- Lays out the details of data collection

Situated Interactive MultiModal Conversations (SIMMC)

This dataset is also along similar lines.

Other important papers or dataset :

RecipeQA: A Challenge Dataset for Multimodal Comprehension of Cooking Recipes
- Paper : https://arxiv.org/pdf/1809.00812.pdf
Ubuntu dialogue corpus
- http://dataset.cs.mcgill.ca/ubuntu-corpus-1.0/
Conversational AI datasets
- https://www.topbots.com/conversational-ai-datasets/
The Dialogue Dodecathlon
- Paper : https://arxiv.org/pdf/1911.03768.pdf

What all different things can be incorporated to make this process of dataset generation simpler?

As pointed out by some papers already creating our own dataset can be really laborious and demanding.
In addition to these I believe twitter conversations are a very good place to start at.
Filtering with images based posts can be really helpful.

How to get the ground truth summaries?

Getting ground truth summaries still does not have any shortcut method.
Unsupervised learning to the rescue!

Share on

Twitter Facebook Google+ LinkedIn

Ashwin Pathak

Exploring Image And Text Modality Datasets For Conversational Summarization

WIT : Wikipedia-Based Image-Text Dataset

Image-Chat : Engaging Grounded Conversations

Image-Grounded Conversations: Multimodal Context for Natural Question and Response Generation

Towards Building Large Scale Multimodal Domain-Aware Conversation Systems

Situated Interactive MultiModal Conversations (SIMMC)

Other important papers or dataset :

What all different things can be incorporated to make this process of dataset generation simpler?

How to get the ground truth summaries?

Share on

You May Also Enjoy

GSOC 2017 - Week 4 of GSoC 17

GSOC 2017 - Week 3 of GSoC 17

GSOC 2017 - Week 2 of GSoC 17

GSOC 2017 - Week 1 of GSoC 17

Ashwin Pathak

WIT : Wikipedia-Based Image-Text Dataset

Image-Chat : Engaging Grounded Conversations

PhotoChat : A Human-Human Dialogue Dataset with Photo Sharing Behavior for Joint Image-Text Modeling

Image-Grounded Conversations: Multimodal Context for Natural Question and Response Generation

Towards Building Large Scale Multimodal Domain-Aware Conversation Systems

Situated Interactive MultiModal Conversations (SIMMC)

Other important papers or dataset :

What all different things can be incorporated to make this process of dataset generation simpler?

How to get the ground truth summaries?

Share on

You May Also Enjoy

GSOC 2017 - Week 4 of GSoC 17

GSOC 2017 - Week 3 of GSoC 17

GSOC 2017 - Week 2 of GSoC 17

GSOC 2017 - Week 1 of GSoC 17