Annotation Guidelines Refinement
Published:
To concretely define and come up with the approach for classification, it is required to think of the best architecture and techniques so as to beat the state of the art. Hence, I explored a lot of literature regarding the same which concerns with the newest approaches. Hence, I read the following papers :
Subword-level Composition Functions for Learning Word Embeddings This paper provides the approach related to sub-word level analysis. The paper evaluates several types of composition for this in the context of training word embeddings in skip-gram like model. It presents various types of composition and compares them, namely, fast-text, skip-gram, CNNs, RNNs using sub-word level elements.
Looking Beyond the Obvious: Code-Mixed Sentiment Analysis This paper carries forward the idea of sub-word level elements on code-mixed sentiment analysis using CNNs and attention mechanisms using collective and specific encoder with feature network also being incorporated. This paper seems to perform very good for the task of classification.
Towards Sub-Word Level Compositions forSentiment Analysis of Hindi-English Code Mixed Text This paper provides a Hi-En code-mixed dataset for the sentiment analysis. Sub-word level representations are also explored. The paper provides CNN based architecture along with LSTMs for sub-word level elements and classification.
SWDE : A Sub-Word And Document Embedding BasedEngine for Clickbait Detection This paper also talks about sub-word level based architecture.
Neural Machine Translation of Rare Words with Subword Units This paper provides a byte pair encoding algorithm for the task of word segmentation. BPE allows the representation of an open vocabulary through a fixed-size vocabulary of variable-length character sequences.
Sentiment Analysis of Code-Mixed Languages leveraging Resource Rich Languages This paper introduces siamese network for the task of classification along with BiLSTM RNN.
Hierarchical CVAE for Fine-Grained Hate Speech Classification This paper provides a fine-grained classification of hate speech of 40 categories from SPLC hate groups. This is done using probabilistic models for classification.
We decided to move forward with sub-word level approach with BPE and hand-crafted feature analysis.
Annotation
Based on the discussion with Pulkit Parikh sir, we decided to change the guidelines. We added both class along with group and individual. Also, proper definitions were introduced. However, there still remains some confusion about the classification of hate classes. These guidelines are mentioned here :
https://docs.google.com/document/d/1__HEQjTVmcONpc_LY1J0R-zZN1sf7l39oWtioUpWbVg/edit?usp=sharing
We referred to following papers and article to come up with the above stated classes :
- https://arxiv.org/abs/1807.03688?fbclid=IwAR3aaPCppgrCUXijDCjPooVNqxhuFZHPl28EiM2M6jE3v8oKxF4DnYcZky0
- https://arxiv.org/abs/1812.01693?fbclid=IwAR2eFSlWnWhwjw5yHlRvZ0jyirFZ44AfufXKuiOQE3hfUFfj318iBgjYt3Q
- https://www.facebook.com/communitystandards/hate_speech
- https://help.twitter.com/en/rules-and-policies/hateful-conduct-policy
- etc :
We will finalize the annotations soon and then will start annotating around 400 posts to ensure everything is in check.