Twitter Sentiment Analysis Tool by Akirato

Twitter and Sentiment Analysis

Twitter is a popular micro-blogging service where users create status messages (called "tweets"). These tweets sometimes express opinions about different topics. Generally, this type of sentiment analysis is useful for consumers who are trying to research a product or service, or marketers researching public opinion of their company.

Dataset Credits

1600000 sentences annotated as positive, negative. http://help.sentiment140.com/for-students/ Sentiment140 Dataset

Preprocessing

Case Folding of the Data (Turning everything to lowercase)
Punctuation Removal from the data
Common Abbreviations and Acronyms expanded.
Hash-Tag removal.

The Main System

Training Distributed Semantic Representation (Word2Vec Model)

We use a Python Module called gensim.models.word2vec for this. We train a model using only the sentences (after preprocessing) from the corpus. This generates vectors for all the words in the corpus. This model can now be used to get vectors for the words. For unknown words, we use the vectors of words with frequency one.

Language Model

Unigram: The word vectors are taken individually to train. E.g: I am not dexter. Is taken as: [I, am, not, dexter]
Bigram: The word vectors are taken two at a time to train. E.g: I am not dexter. Is taken as: [(I,am), (am,not), (not,dexter)]
Hybrid of Unigram and Bigram: Use unigram normally but bigram when words reversing sentiments like not,no,etc are present. E.g: I am not dexter. Is taken as: [I,am,(not,dexter)]

Machine Learning Scores

Use the various language models and train various two-class classifiers for results. The classifiers we used are:

Support Vector Machines - Scikit Learn Python
Multi Layer Perceptron Neural Network - Scikit Learn Python
Naive Bayes Classifier - Scikit Learn Python
Decision Tree Classifier - Scikit Learn Python
Random Forest Classifier - Scikit Learn Python
Logistic Regression Classifier - Scikit Learn Python
Recurrent Neural Networks - PyBrain module Python We chose Hybrid of Unigram and Bigram with Random Forest Classifier to be the part of our system as they gave the best results.

Emoticon Scores

Search for Emoticons in the given text using RegEx or find.
Use a dictionary to score the emoticons.
Use this emoticon score in the model.

Lexical Scores

Get the text
Lemmatize the text
Score the Lemmatized text using dictionaries
The Score will be used in the final system.
This will be given more weightage as this is more definite

Challenges

Randomness in Data: Twitter is written by Users, hence it is not very formal.
Emoticons: Lots of types of emoticons with new ones coming very frequently.
Abbreviations: Users use a lot of abbreviation slangs like AFAIK, GN, etc. Capturing all of them is difficult.
Grapheme Stretching: Emotions expressed through stretching of normal words.Like, Please -> Pleaaaaaseeeeee
Reversing Words: Some words completely reverse sentiment of another word. E.g: not good == opposite(good)
Technical Challenges: Classifiers take a lot of time to train, hence silly mistakes cost a lot of time.

Future Improvements

Handle Grapheme Stretching
Handle authenticity of Data and Users
Handle Sarcasm and Humor

Contributors

This tool was made by @Akirato as a part of the course Information Retrieval and Extraction

Support or Contact

Having trouble with tool? Check out our VIDEO for Demo or Raise An Issue and we’ll help you sort it out. Here's a Presentation of the same.

Twitter Sentiment Analysis Tool

A Sentiment Analysis for Twitter Data.