With the emergence of a variety of social media platforms, and the freedom to express one’s thought, sadly, there is a lot of hateful content available on social media. Some platforms like Twitter filter out any posts which involve abusive and highly provocative language. However, Gab is a platform where freedom of speech is retained. Thus, hate content can be easily found on Gab. It becomes important to analyze the data, posts and comments. Hate Speech detection thus, plays an important role in identifying any kind of trend, troll, threat, etc:
To concretely define and come up with the approach for classification, it is required to think of the best architecture and techniques so as to beat the state of the art. Hence, I explored a lot of literature regarding the same which concerns with the newest approaches. Hence, I read the following papers :
This blog is dedicated to the third week of Google Summer of Code (i.e June 24 - July 1). This week was concentrated on cross-testing and analysis of the API with some challenging tests.
This blog is dedicated to the third week of Google Summer of Code (i.e June 16 - June 23). But first, a brief insight is given about how the code works.
This blog is dedicated to the second week of Google Summer of Code (i.e June 8 - June 15). The target of the second week according to my timeline was to implement the Jacobian and gradient using numdifftools.
This blog is dedicated to the first week of Google Summer of Code (i.e June 1 - June 7). The target of the first week according to my timeline was to get conversant with the code structure and implement the derivative using statsmodels and partly by numdifftools.
This is the basic outline of the design of the scipy.diff module. I have thoroughly investigated the functioning of statsmodels and numdifftools. Along with it, scipy.misc.derivative has also been looked at. In scipy.diff I propose to have adaptations from numdifftools and statsmodels majorly due to their accurate results, sophisiticated techniques of computation and ease of accessibility.
I have been selected for the GSOC 2017 under the umbrella organisation of Python Software Foundation - Scipy. The topic of my project is: implementation of a module : scipy.diff.
Till the last week, the AR application, that we developed consisted of face detection in 2D along with AR object rendering in parallel. However, there was no interaction between the two separate applications because of communication issues between unity and android.
The app that was required to be constructed was highly responsive and susceptible to even the small movements of the head. The movements of the augmented object was not proper and was highly irregular, hence, I made the movements regular so that it is perceptible and more concrete.
With the emergence of a variety of social media platforms, and the freedom to express one’s thought, sadly, there is a lot of hateful content available on social media. Some platforms like Twitter filter out any posts which involve abusive and highly provocative language. However, Gab is a platform where freedom of speech is retained. Thus, hate content can be easily found on Gab. It becomes important to analyze the data, posts and comments. Hate Speech detection thus, plays an important role in identifying any kind of trend, troll, threat, etc:
To concretely define and come up with the approach for classification, it is required to think of the best architecture and techniques so as to beat the state of the art. Hence, I explored a lot of literature regarding the same which concerns with the newest approaches. Hence, I read the following papers :
With the coming up of so many applications on which deep learning is proving to be impactful with high accuracies and precision. it is important to ensure it’s safety and security against adversarial attacks. It has been observed that deep neural networks are susceptible to adversarial attacks even in the form of small perturbations which are not conceivable by humans. My literature survey on this topic consists of the following papers and their details :
As discussed in the previous meet, we shifted out attention to generative models, especially Variational autoencoders. So, I thoroughly read the following papers :
The task of last week was to make a dataset for nazism element detection. I was provided with the positive examples, and I had to generate the negative ones. The initial given dataset consisted of around 2800 images in total belonging to various categories like :
The model trained on data size of 5714 and 36 classes didn’t perform that good. I tried with approaches based on unlabeled data as well as fully labeled data, however, the accuracy still remained low. The graphs are attached as follows :
Multimodal abstractive summarization (MAS) aims to take advantage of data from multiple modalities and provides a short, concise and readable textual summary to let users quickly acquire their essential information.
Opinion summarization is the task of automatically generating summaries from multiple documents containing users’ thoughts on businesses or products. This summarization of users’ opinions can provide information that helps other users with their decision-making on consumption.
When dealing with multi-modal information retrieval tasks, the extent to which a particular modality contributes to the final output might differ from other modalities. Amongst the modalities, there is often a preferable mode of representation based on the significance and ability to fulfill the task. We denote these preferred modalities as key modalities or central modalities (will be referred to as central modalities from here onwards). The other modalities help assist the central modalities in fulfilling the desired task, and are known as adjacent modalities. The adjacent modalities can enhance the user experience by either supplementing or by complementing the information represented via the central modality. When these adjacent modalities reinforce the facts and ideas presented in central modality, the enhancement is known as supplementary enhancement. On the other hand, when these adjacent modalities complete the central modality, by providing additional or alternate information that is relevant, albeit not covered by the central modality, the enhancement is known as complementary enhancement.
Li et. al proposed a hierarchical attention model for the multimodal sentence summarization task, while the image is not involved in the process of text encoding. Obviously, it will be easier for the decoder to generate an accurate summary if the encoder can filter out trivial information when encoding the input sentence. Based on this idea, paper proposes a multimodal selective mechanism which aims to select the highlights from the input text using visual signals, and then the decoder generates the summary using the filtered encoding information. Concretely, an encoder reads the input text and generates the hidden representations. Then, multimodal selective gates measure the relevance between the input words and the image to construct the selected hidden representation. Finally, a decoder generates the summary using the selected hidden representation.
Multimodal summarization for open-domain videos is an emerging task, aiming to generate a summary from multisource information (video, audio, transcript). Despite the success of recent multiencoder-decoder frameworks on this task, existing methods lack finegrained multimodality interactions of multisource inputs. Besides, unlike other multimodal tasks, this task has longer multimodal sequences with more redundancy and noise. To address these two issues, the paper proposed a multistage fusion network with the fusion forget gate module, which builds upon this approach by modeling fine-grained interactions between the multisource modalities through a multistep fusion schema and controlling the flow of redundant information between multimodal long sequences via a forgetting module.
Multimodal text summarization is the task of condensing this information from the interacting modalities into an output summary. This generated output summary may be unimodal or multimodal.
In real-world applications, the input is usually a video consisting of hundreds of frames. Consequently, the temporal dependency in a video cannot be simply modeled by static encoding methods. Hence, in this work, Video-based Multimodal Summarization with Multimodal Output (VMSMO) is proposed, which selects cover frame from news video and generates textual summary of the news article in the meantime
Existing models suffer from the following drawbacks:
Most existing applications extract visual information from the accompanying images, but they ignore related videos. The paper contends that videos contain abundant contents and have temporal characteristics where events are represented chronologically, which are crucial for text summarization.
Although attention mechanism and early fusion are used extensively, it adversely introduces noise as it is unsuitable for multi-modal data without alignment, which is characterized by a large gap that requires intensive communication.
Various multi-modal summarization works have focused on a single task, such as text or video summarization with added information from other modalities. Paper observes that both summarization tasks share the same target of refining original long materials, and as such they can be performed jointly due to common characteristics.
There are three differences between queryfocused video summarization and generic video summarization :
Firstly, the video summary needs to take the subjectivity of users into account, as different user queries may receive different video summaries.
Secondly, trained video summarizers cannot meet all the users’ preferences and the performance evaluation is often to measure the temporal overlap, makes it hard to capture the semantic similarity between summaries and original videos.
Thirdly, the textual query will bring additional semantic information to the task.
Commercial product advertisements, as a critical component of marketing management in e-commerce platforms, aim to attract consumers’ interests and arouse consumers’ desires to purchase the products. However, most product advertisements are so miscellaneous and tedious that the consumers cannot be expected to be patient enough to carefully read through them.
This paper introduces ROCMMS, a system that automatically converts existing text to multimodal summaries (MMS) that capture the meaning of a complex sentence in a diagram containing pictures and simplified text related by structure extracted from the original sentence.
This paper is trained on a wide variety of images with a wide variety of natural language supervision that’s abundantly available on the internet. By design, the network can be instructed in natural language to perform a great variety of classification benchmarks, without directly optimizing for the benchmark’s performance.
Till now, I researched through a lot of papers and were working with autoencoders to transform the canny edge images from one viewpoint to the other. This approach however doesn’t work on doodle based very lowly abstracted sketches which are more commonly drawn by humans. We tried cyclegan for this approach where we could get the required abstraction to the canny images, however, that approach didn’t give any fruitful results.
With the emergence of a variety of social media platforms, and the freedom to express one’s thought, sadly, there is a lot of hateful content available on social media. Some platforms like Twitter filter out any posts which involve abusive and highly provocative language. However, Gab is a platform where freedom of speech is retained. Thus, hate content can be easily found on Gab. It becomes important to analyze the data, posts and comments. Hate Speech detection thus, plays an important role in identifying any kind of trend, troll, threat, etc: