Twitter and Sentiment Analysis

Twitter is a popular micro-blogging service where users create status messages (called "tweets"). These tweets sometimes express opinions about different topics. Generally, this type of sentiment analysis is useful for consumers who are trying to research a product or service, or marketers researching public opinion of their company.

Dataset Credits

1600000 sentences annotated as positive, negative. Sentiment140 Dataset


  1. Case Folding of the Data (Turning everything to lowercase)
  2. Punctuation Removal from the data
  3. Common Abbreviations and Acronyms expanded.
  4. Hash-Tag removal.

The Main System

Training Distributed Semantic Representation (Word2Vec Model)

We use a Python Module called gensim.models.word2vec for this. We train a model using only the sentences (after preprocessing) from the corpus. This generates vectors for all the words in the corpus. This model can now be used to get vectors for the words. For unknown words, we use the vectors of words with frequency one.

Language Model

Machine Learning Scores

Use the various language models and train various two-class classifiers for results. The classifiers we used are:

Emoticon Scores

  1. Search for Emoticons in the given text using RegEx or find.
  2. Use a dictionary to score the emoticons.
  3. Use this emoticon score in the model.

Lexical Scores

  1. Get the text
  2. Lemmatize the text
  3. Score the Lemmatized text using dictionaries
  4. The Score will be used in the final system.
  5. This will be given more weightage as this is more definite


Future Improvements


This tool was made by @Akirato as a part of the course Information Retrieval and Extraction

Support or Contact

Having trouble with tool? Check out our VIDEO for Demo or Raise An Issue and we’ll help you sort it out. Here's a Presentation of the same.