Research Notebook

'Mixology' is an open research project, which aims to extract opinions in times of crisis, here from a corpus collected via the Twitter API, from December 12 to 31, 2021.

Blog 15: Comparative sentiment analysis of the ‘vaccination’ sub-corpus (en, part.1)

1 février 2022


The first sentiment analysis of the ‘vaccine’ sub-corpus in English (see Blog 7) compared the results obtained with six sentiment dictionaries and the two lexicons developed in this research (Blog 12 and Blog 14).

General Inquirer:  this dictionary contains more negative terms than positive ones. Did this labelling influence the results? It was observed here that negative sentiment dominates, with 51.17%. However, this narrow majority reveals the polarization of debates around Covid vaccines.

MPQA Subjectivity Lexicon: negative sentiment also dominates the results (41.68%). This dictionary contains twice as many terms as the previous one and an overwhelming majority of negative terms (4,911). The more terms a dictionary has, and the more these are labelled negative, would influence the result?

Bing:  the results show a majority negative sentiment ( 57.1% ). Reducing this observation to a smaller scale means that one out of two tweets in this sub-corpus is negative. However, here, too, we find ourselves with a dictionary in which most terms are labelled negative.

NRC: this dictionary has been approached regarding the negative/positive polarity. The results from these two categories were first kept. Therefore, it is not the percentages obtained that should be considered here, but a relatively positive trend (22.22%). Note that this is a dictionary with a much larger number of terms (13,875). When we broaden the view to the eight other categories of feeling in this dictionary, the rather negative trend predominates, particularly with fear (11%) and anger (8.6%).

Afinn: the Likert scale from -5 to 5 has been recategorized (negative, positive, neutral). Here we are faced with fewer terms and more terms categorized negative. The results refute the hypothesis of the influence of the amount of terms included in one or the other category since they show a relatively positive trend with 52.29%.

Loughran: the trend observed is rather negative with 47.47%, against a positive sentiment of 22.8%. The sentiment coming in the third place is uncertainty (17.52%). Here too, the dictionary has more negative categorized terms. 

When a dictionary considers the categories « ambiguous » or « neutral », this generally does not influence the results. An explanation could be related to a relatively low volume of terms in these categories. However, the results obtained with the Afinn dictionary show that it is not because a category is in the majority that it gives rise to a result that tilts the balance more towards this category. However, negative sentiment dominates more with dictionaries containing more negative terms. Moreover, the amount of terms in the dictionary does not seem to influence the results. Here, it should be reminded that all these dictionaries are general or have been developed in a particular context (economy, for Loughran). Is the quantity (of terms and terms contained in a category) the only possible explanation for these results? How do the results evolve when a dictionary is adapted to the application domain?


The use of the Mixology Covid Lexicon (MCL), built from the frequency of unigrams (see Blog 14), and which includes practically as many terms categorized positive and negative, highligts that the adequacy of the dictionary with the application domain gives rise to more accurate results. This observation relies on the reversed general negative trend, although it is not very significant (54.75% for the negative sentiment). When the dictionary is more substantial, it does not influence the results: the same trend is observed between the MCL and the Mixology Lexicon (53.62% for the negative sentiment). Aside from quantity, it is, therefore, the quality of the lexicon that appears to be the most significant variable to consider during a sentiment analysis.

From a qualitative perspective, the main lesson of this analysis is that the results do not show that the balance tilts significantly in one direction or the other, despite the differences observed between the eight dictionaries used. Therefore, this can be understood as an indication of the polarization of the debates around vaccination, which would have practically as many supporters as opponents. However, this is not enough to determine what the debates crystallize. Also, it does not allow to generalize the results for the nine Western European countries retained in this sample, all the more so since the United Kingdom dominates this sub-corpus (to be continued).


# # #

Read more

Blog 21: Politicians, experts, and journalists

Blog 20: For vaccination, against restrictions

Blog 19: Comparative Sentiment Analysis

Blog18: A health and political crisis

Blog 17: Anatomy of the “political/sanitary measures” sub-corpus (en)

Blog 16: Sentiment analysis of the ‘vaccination’ sub-corpus (en, part.2)

Blog 15: Comparative sentiment analysis of the ‘vaccination’ sub-corpus (en, part.1)

Blog 14: An adapted dictionary for the Covid crisis and sentiment analysis

Blog 13: Building a stop words list

Blog 12: Main Dictionaries for Sentiment Analysis

Blog 11: Statistical description of the corpus #RStats

Blog 10: Sentiment analysis or the assessment of subjectivity

Blog 9: Topic modeling of the ‘vaccination’ corpus (English)

Blog 8: Linguistic and quantitative processing of the ‘vaccination’ corpus (English, part.2)

Blog 7: Linguistic and quantitative processing of the ‘vaccination’ corpus (English, part.1)

Blog 6: Collecting the corpus and preparing the lexical analysis

Blog 5: The textclean package

Blog 4: Refining the queries

Blog 3: The rtweet package

Blog 2: Collecting the corpus

Blog 1: An open research project

The challenges of research on media use in times of crisis