Research Notebook'Mixology' is an open research project, which aims to extract opinions in times of crisis, here from a corpus collected via the Twitter API, from December 12 to 31, 2021.
Blog 12: Main Dictionaries for Sentiment Analysis20 janvier 2022
While the scientific literature lists many dictionaries (or lexicons) for sentiment analysis (see Blog 10), some are more regularly used (Bogdan and Borza, 2020). Six of them are explored in this post.
1) General Inquirer. This lexicon was developed in 1962 for content analysis in social sciences. It iconsists of a frequent words list taken from the Harvard IV Dictionary and the Lasswell Dictionary. The hand-tagged categories have been improved over time by various researchers. The version distributed via the SentimentAnalysis R package – load function (DictionaryGI ) – counts 1,637 positive and 2,005 negative terms. It is significantly different from the version retrieved in this research, which has 1,935 positive terms against 2,291 negative terms (Stone et al., 1966; Khoo and Johnkhan, 2018).
2) MPQA Subjectivity Lexicon. The General Inquirer lexicon inspired this dictionary. It counts over 8,000 words divided between three negative shades: positive, neutral, and positive and negative. Their number may vary depending on the version. The one used for this research has 8,222 terms, of which 570 are labelled ‘neutral’. It includes adjectives, adverbs, verbs and nouns. This lexicon has been aggregated from several sources, including manually developed and automatically constructed sources (Wilson et al., 2005; Khoo and Johnkhan, 2018). This package is also available through the abhy/sentiment R package.
3) Bing. It is one of the most used dictionaries. It includes a list of about 6,800 terms regularly updated since 2004, according to negative and positive polarities (Hu and Bing, 2004). This dictionary is available via the R function get_sentiments from the R tidytext package – other dictionaries are also available through this package (NRC, Loughran and Afinn).
4) NRC. This lexicon comes from three sources: the 200 unigrams and 200 bigrams (adjectives, adverbs, nouns and verbs) from the Macquarie’s thesaurus crossed with terms identified by correspondence from the Google’s n-gram corpus (that tracks language evolution in print publications), 640 terms extracted from the WordNet Affect Lexicon, and the terms of the General Inquirer. The object of this dictionary is to explore emotions through eight categories of emotion (anger, fear, anticipation, confidence, surprise, sadness, joy and disgust) and two categories of feeling (positive and negative). It is, therefore, a much larger dictionary: the version available via the R Tidytext package has 13,875 terms ( (Mohammad and Turney, 2013; Khoo and Johnkhan, 2018).
5) Afinn. The particularity of this lexicon, developed between 2009 and 2011 by Finn Arup Nielsen, is to classify the terms on a Likert scale ranging from –5 to +5 (Nielsen, 2011). The version available via tidytext has 2,477 words (this dictionary is also available via the R corpus package.
6) Loughran. This dictionary includes a list of financial terms (Loughran and McDonald, 2011). It has six categories of feeling: constraining, contentious, negative, positive, superfluous, uncertain.
These different dictionaries have been developed in English, but many other lexicons exist in other languages. In French, these include the Lexicoder, available via the R Quanteda package. The syuzhet package also provide a translation of the NRC dictionary and a translation of the same name lexicon.
The results of a sentiment analysis vary depending on the dictionary used. They must also be weighted because a word is always used in a given context (a context that general lexicons do not consider). It can also be ambiguous or polysemous. The analysis of ‘vaccination’ tweets corpus in English demonstrates that the quantity of negative or positive terms does not push the results in one way or another (especially since the terms categorized as negative are generally more numerous). It also highlights that a context adapted lexicon or a dictionary containing more terms is likely to reflect better the richness of a language. See also, on this page, a test carried out on a sample of 5,000 tweets (to be continued).
Bogdan, M., & Borza, A. (2020). Big Data Analytics and Firm Performance: A Text Mining Approach. In Proceedings of the International Management Conference (Vol. 14, No. 1, pp. 549-560). Faculty of Management, Academy of Economic Studies, Bucharest, Romania.
Hu, M., & Liu, B. (2004, August). Mining and summarizing customer reviews. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 168-177).
Khoo, C. S., & Johnkhan, S. B. (2018). Lexicon-based sentiment analysis: Comparative evaluation of six sentiment lexicons. Journal of Information Science, 44(4), 491-511.
Loughran, T., & McDonald, B. (2020). Textual analysis in finance. Annual Review of Financial Economics, 12, 357-375.
Mohammad, S. M., & Turney, P. D. (2013). Crowdsourcing a word–emotion association lexicon. Computational intelligence, 29(3), 436-465.
Nielsen, F. Å. (2011). A new ANEW: Evaluation of a word list for sentiment analysis in microblogs. arXiv preprint arXiv:1103.2903.
Stone, P. J., Dunphy, D. C., Smith, M. S., & Ogilvie, D. M. (1966). The General Inquirer: A computer approach to content analysis in the behavioral sciences.
Wilson, T., Wiebe, J., & Hoffmann, P. (2005, October). Recognizing contextual polarity in phrase-level sentiment analysis. In Proceedings of human language technology conference and conference on empirical methods in natural language processing (pp. 347-354).