Twitter and Sentiment Analysis
Twitter is a popular micro-blogging service where users create status messages (called "tweets"). These tweets sometimes express opinions about different topics. Generally, this type of sentiment analysis is useful for consumers who are trying to research a product or service, or marketers researching public opinion of their company.
Dataset Credits
1600000 sentences annotated as positive, negative. http://help.sentiment140.com/for-students/ Sentiment140 Dataset
Preprocessing
- Case Folding of the Data (Turning everything to lowercase)
- Punctuation Removal from the data
- Common Abbreviations and Acronyms expanded.
- Hash-Tag removal.
The Main System
Training Distributed Semantic Representation (Word2Vec Model)
We use a Python Module called gensim.models.word2vec for this. We train a model using only the sentences (after preprocessing) from the corpus. This generates vectors for all the words in the corpus. This model can now be used to get vectors for the words. For unknown words, we use the vectors of words with frequency one.
Language Model
- Unigram: The word vectors are taken individually to train. E.g: I am not dexter. Is taken as: [I, am, not, dexter]
- Bigram: The word vectors are taken two at a time to train. E.g: I am not dexter. Is taken as: [(I,am), (am,not), (not,dexter)]
- Hybrid of Unigram and Bigram: Use unigram normally but bigram when words reversing sentiments like not,no,etc are present. E.g: I am not dexter. Is taken as: [I,am,(not,dexter)]
Machine Learning Scores
Use the various language models and train various two-class classifiers for results. The classifiers we used are:
- Support Vector Machines - Scikit Learn Python
- Multi Layer Perceptron Neural Network - Scikit Learn Python
- Naive Bayes Classifier - Scikit Learn Python
- Decision Tree Classifier - Scikit Learn Python
- Random Forest Classifier - Scikit Learn Python
- Logistic Regression Classifier - Scikit Learn Python
- Recurrent Neural Networks - PyBrain module Python We chose Hybrid of Unigram and Bigram with Random Forest Classifier to be the part of our system as they gave the best results.
Emoticon Scores
- Search for Emoticons in the given text using RegEx or find.
- Use a dictionary to score the emoticons.
- Use this emoticon score in the model.
Lexical Scores
- Get the text
- Lemmatize the text
- Score the Lemmatized text using dictionaries
- The Score will be used in the final system.
- This will be given more weightage as this is more definite
Challenges
- Randomness in Data: Twitter is written by Users, hence it is not very formal.
- Emoticons: Lots of types of emoticons with new ones coming very frequently.
- Abbreviations: Users use a lot of abbreviation slangs like AFAIK, GN, etc. Capturing all of them is difficult.
- Grapheme Stretching: Emotions expressed through stretching of normal words.Like, Please -> Pleaaaaaseeeeee
- Reversing Words: Some words completely reverse sentiment of another word. E.g: not good == opposite(good)
- Technical Challenges: Classifiers take a lot of time to train, hence silly mistakes cost a lot of time.
Future Improvements
- Handle Grapheme Stretching
- Handle authenticity of Data and Users
- Handle Sarcasm and Humor
Contributors
This tool was made by @Akirato as a part of the course Information Retrieval and Extraction
Support or Contact
Having trouble with tool? Check out our VIDEO for Demo or Raise An Issue and we’ll help you sort it out. Here's a Presentation of the same.