The model trained on data size of 5714 and 36 classes didn’t perform that good. I tried with approaches based on unlabeled data as well as fully labeled data, however, the accuracy still remained low. The graphs are attached as follows :
Multimodal abstractive summarization (MAS) aims to take advantage of data from multiple modalities and provides a short, concise and readable textual summary to let users quickly acquire their essential information.
Opinion summarization is the task of automatically generating summaries from multiple documents containing users’ thoughts on businesses or products. This summarization of users’ opinions can provide information that helps other users with their decision-making on consumption.
When dealing with multi-modal information retrieval tasks, the extent to which a particular modality contributes to the final output might differ from other modalities. Amongst the modalities, there is often a preferable mode of representation based on the significance and ability to fulfill the task. We denote these preferred modalities as key modalities or central modalities (will be referred to as central modalities from here onwards). The other modalities help assist the central modalities in fulfilling the desired task, and are known as adjacent modalities. The adjacent modalities can enhance the user experience by either supplementing or by complementing the information represented via the central modality. When these adjacent modalities reinforce the facts and ideas presented in central modality, the enhancement is known as supplementary enhancement. On the other hand, when these adjacent modalities complete the central modality, by providing additional or alternate information that is relevant, albeit not covered by the central modality, the enhancement is known as complementary enhancement.
Li et. al proposed a hierarchical attention model for the multimodal sentence summarization task, while the image is not involved in the process of text encoding. Obviously, it will be easier for the decoder to generate an accurate summary if the encoder can filter out trivial information when encoding the input sentence. Based on this idea, paper proposes a multimodal selective mechanism which aims to select the highlights from the input text using visual signals, and then the decoder generates the summary using the filtered encoding information. Concretely, an encoder reads the input text and generates the hidden representations. Then, multimodal selective gates measure the relevance between the input words and the image to construct the selected hidden representation. Finally, a decoder generates the summary using the selected hidden representation.
Multimodal summarization for open-domain videos is an emerging task, aiming to generate a summary from multisource information (video, audio, transcript). Despite the success of recent multiencoder-decoder frameworks on this task, existing methods lack finegrained multimodality interactions of multisource inputs. Besides, unlike other multimodal tasks, this task has longer multimodal sequences with more redundancy and noise. To address these two issues, the paper proposed a multistage fusion network with the fusion forget gate module, which builds upon this approach by modeling fine-grained interactions between the multisource modalities through a multistep fusion schema and controlling the flow of redundant information between multimodal long sequences via a forgetting module.
Multimodal text summarization is the task of condensing this information from the interacting modalities into an output summary. This generated output summary may be unimodal or multimodal.
In real-world applications, the input is usually a video consisting of hundreds of frames. Consequently, the temporal dependency in a video cannot be simply modeled by static encoding methods. Hence, in this work, Video-based Multimodal Summarization with Multimodal Output (VMSMO) is proposed, which selects cover frame from news video and generates textual summary of the news article in the meantime
Existing models suffer from the following drawbacks:
Most existing applications extract visual information from the accompanying images, but they ignore related videos. The paper contends that videos contain abundant contents and have temporal characteristics where events are represented chronologically, which are crucial for text summarization.
Although attention mechanism and early fusion are used extensively, it adversely introduces noise as it is unsuitable for multi-modal data without alignment, which is characterized by a large gap that requires intensive communication.
Various multi-modal summarization works have focused on a single task, such as text or video summarization with added information from other modalities. Paper observes that both summarization tasks share the same target of refining original long materials, and as such they can be performed jointly due to common characteristics.
There are three differences between queryfocused video summarization and generic video summarization :
Firstly, the video summary needs to take the subjectivity of users into account, as different user queries may receive different video summaries.
Secondly, trained video summarizers cannot meet all the users’ preferences and the performance evaluation is often to measure the temporal overlap, makes it hard to capture the semantic similarity between summaries and original videos.
Thirdly, the textual query will bring additional semantic information to the task.
Commercial product advertisements, as a critical component of marketing management in e-commerce platforms, aim to attract consumers’ interests and arouse consumers’ desires to purchase the products. However, most product advertisements are so miscellaneous and tedious that the consumers cannot be expected to be patient enough to carefully read through them.
This paper introduces ROCMMS, a system that automatically converts existing text to multimodal summaries (MMS) that capture the meaning of a complex sentence in a diagram containing pictures and simplified text related by structure extracted from the original sentence.
This paper is trained on a wide variety of images with a wide variety of natural language supervision that’s abundantly available on the internet. By design, the network can be instructed in natural language to perform a great variety of classification benchmarks, without directly optimizing for the benchmark’s performance.
Till now, I researched through a lot of papers and were working with autoencoders to transform the canny edge images from one viewpoint to the other. This approach however doesn’t work on doodle based very lowly abstracted sketches which are more commonly drawn by humans. We tried cyclegan for this approach where we could get the required abstraction to the canny images, however, that approach didn’t give any fruitful results.
With the emergence of a variety of social media platforms, and the freedom to express one’s thought, sadly, there is a lot of hateful content available on social media. Some platforms like Twitter filter out any posts which involve abusive and highly provocative language. However, Gab is a platform where freedom of speech is retained. Thus, hate content can be easily found on Gab. It becomes important to analyze the data, posts and comments. Hate Speech detection thus, plays an important role in identifying any kind of trend, troll, threat, etc: