Research Notebook

The « Mixology » open research project aims to probe opinions in times of crisis from a corpus collected via the Twitter API. Its other objective is to develop an original research tool to be also reused for the analysis of headlines or media content (computational linguistics and machine learning methods), in line with media studies and journalism studies.

Blog 10: Sentiment analysis or the assessment of subjectivity

14 janvier 2022

Sentiment analysis is a text mining technique that deals with analysing and classifying subjective opinions. Regularly used to monitor the opinions conveyed on social networks, it comes in different approaches.

Not used with machine learning? Read this first!

Public opinion refers to the views and desires of the majority of a population on political, commercial, social, or other matters (El Barachi et al., 2021). ‘Public opinion carries a kind of internal syntactic contradiction: while the term public designates the group and the universal, opinion is generally associated with the individual and considered an internal and subjective formulation (Glynn and Huge, 2008). It is, therefore, a relatively broad concept, which also covers the concepts of feeling, evaluation, appreciation, or attitude. Moreover, the opinion can be single or constituting a set of opinions. A feeling can be defined as an affective state of consciousness that results from emotions (Cambria et al., 2017).

Sentiment analysis is a discipline of text mining and opinion mining that analyses and classifies subjective opinions, feelings, and emotions towards products, organizations, individuals, and other subjects (Medhat et al., 2014; Keshavarz and Abadeh, 2017). Based on natural language processing (NLP), it is studied in a variety of heterogeneous fields (politics, marketing, business, sociology, etc.) (Mowlaei et al., 2020; Parvin et al., 2021). it is regularly used to study the opinions of Internet users expressing themselves on social networks or even to predict election results because it facilitates the understanding of the extracted data (Chauhan et al. 2021; Dang et al., 2020; Jain, 2021).

Polarity and intensity are two components used in sentiment analysis. Polarity indicates whether the sentiment is negative, neutral, or positive, while intensity indicates the relative strength of the feeling (Dang et al., 2020).

Dictionary vs corpus

Sentiment analysis can be tackled through two approaches: dictionary-based (or lexicon-based), which includes a collection of scored terms– usually positive and negative (Zhang et al., 2018; Rice and Zorn, 2021); and corpus-based, annotated or not (that is to say, comprising terms of opinion and syntactic rules, within the framework of annotations), which will serve as a basis for classification operations in supervised or unsupervised learning (Dang et al., 2020; Jain, 2021).

According to Medhat et al. (2014), a supervised learning scheme is less efficient than a dictionary-based approach due to its complexity. Indeed, it is challenging to prepare an enormous corpus covering all the terms of a given language. However, it has the advantage of helping to find specific areas and contexts. This is a generally more precise scheme for other researchers, although much slower than a dictionary or lexicon-based method due to annotation activities (Augustyniak et al., 2015; Khoo and Johnkhan, 2018). However, others believe that a supervised learning scheme does not lead to sufficiently meaningful results (van Atteveldt et al., 2021).

In an unsupervised learning scheme, the information used for training the data is neither classified nor labelled. This method models the underlying data to learn about the dataset (Jain, 2021). It includes methods based on clustering, which allow grouping of data and can produce precise results without any human intervention; association methods, which would enable exploring the relationships between large portions of data when dealing with an extensive database (Jain, 2021).

Although it is recognized that a hybrid approach – which combines lexicon-based and machine learning methods – potentially leads to better results (Prabowo and Thelwall, 2009; Jain, 2021; Hardeniya and Borikar, 2016), a lexicon-based approach is the most appropriate when data are insufficient data or in the absence of training data (Khoo and Johnkhan, 2018). It is also the most relevant to process small corpora (Deng et al., 2017). Moreover, some researchers believe using a lexicon is more reproducible to other contexts than using a classifier applied to a problem for which it was not trained (Turner et al., 2021).

Limits of a lexicon-based approach

In a lexicon-based approach, sentiment analysis is usually performed in two phases: detecting subjectivity, which relates to the subject to which the feeling is directed, and the assigned polarity, using a lexicon or dictionaries (Hardeniya and Borikar, 2016).

While they have the advantage of being ready-to-use (Rice and Zorn, 2021), lexicons do not consider all application domains – that would be practically impossible – and can lead to erroneous or invalid results (Grimmer and Stewart, 2013). In addition, the quantity of terms that they contain is not always sufficient to meet all the richness of a language that counts tens of thousands of words.

This is why van Atteveldt et al. (2021) recommend using a maximum of dictionaries for sentiment analysis and considering customizing an existing dictionary or creating an own dictionary. But building a dictionary takes time and raises the whole question of its validation since it is compiled under human supervision (Grimmer and Stewart, 2013; Rice and Zorn, 2021; Mowlaei et al., 2020; Deng and al., 2017; Bagheri et al., 2013).

Building an original lexicon can be considered dynamically via classification algorithms – for example, the Naive Bayes or the Support Vector Machine regularly used for classification operations (Bonta et al., 2019; Medhat et al., 2014). Here, it is a question of training a classifier on an annotated corpus (Keshavarz and Abadeh, 2017). This form of supervised learning has the advantage of giving rise to contextual lexicons. However, it requires a significant amount of data to work well.

The limits of an approach by lexicon are plural:

1) commonly used lexicons are too general and do not take into account the context in which the term is used, whereas the meaning of the words and the feeling depends precisely on context;

2) the lexicons are therefore often insufficient and lack terms adapted to the field;

3) they don’t deal very well with ambiguous semantics (for example, in the context of the Covid crisis, ‘tested positive’ is nothing positive or ‘the government is great’ can evoke a positive feeling of author, in the first degree, or negative, seen from the angle of a second ironic degree);

4) a term judged to be positive can become negative as the field evolves;

5) a term judged favourable in one field may be so negative in another;

6) building an original lexicon takes time when undertaken manually;

7) negation is not treated in an approach relying solely on unigrams, whereas that reverses the polarity of the sentence;

8) a sentence may contain no sentiment, as in the case of some interrogative sentences (Khoo and Johnkhan, 2018; Liu, 2012; Deng et al., 2017; Jain, 2021; Mejova, 2009; Hardeniya and Borikar, 2016).

In addition, a feeling expressed in a sentence cannot be reduced to the sum of the feelings of its constituents. The sentiment is a function of the sentence constituents (Cambria et al., 2017, p.70).

Sentiment analysis consists of an appreciation of subjectivity which, within the framework of an approach by lexicon or a supervised method, also includes a part of subjectivity.

Challenges of content published on Twitter

User-generated content on Twitter relies on social interactions. In terms of quality, they present different levels of difficulty to be resolved upstream of the sentiment analysis:

1) the text is generally not well-formed in terms of natural language grammar, structure, and formality;

2) there is no spelling harmonization, and some terms may therefore have several different spellings;

3) the current use of abbreviations does not always respond to a logic of standardization;

4) slang or jargon words are not necessarily included in lexicons intended for sentiment analysis;

4) non-scope terms add ‘noise’ to content;

5) the context is not always well defined;

6) tweets may contain signs, usernames, emoticons, hashtags, hyperlinks, and non-textual content. In addition, several opinions can coexist in a single tweet (Deng et al., 2017; Jain, 2021; Kumar and Sebastian, 2012; Cambria et al., 2017, p.142; Martínez-Cámara et al., 2014).

That is why sentiment analysis has to be well prepared upstream with prioritizing data processing, which is all the more critical since it has been demonstrated that improving the quality of the data leads to a better quality of sentiment classification (Li et al., 2020).



Augustyniak, Ł., Szymański, P., Kajdanowicz, T., & Tuligłowicz, W. (2015). Comprehensive study on lexicon-based ensemble classification sentiment analysis. Entropy (Basel, Switzerland), 18(1), 4. https://doi.org/10.3390/e18010004

Bagheri, A., Saraee, M., & de Jong, F. (2013). Care more about customers: Unsupervised domain-independent aspect detection for sentiment analysis of customer reviews. Knowledge-Based Systems, 52, 201–213. https://doi.org/10.1016/j.knosys.2013.08.011

Bogdan, M., & Borza, A. (2020). Big Data Analytics And Firm Performance: A Text Mining Approach. In Proceedings of the INTERNATIONAL MANAGEMENT CONFERENCE (Vol. 14, No. 1, pp. 549-560). Faculty of Management, Academy of Economic Studies, Bucharest, Romania.

Bonta, V., Kumaresh, N., & Janardhan, N. (2019). A comprehensive study on lexicon based approaches for sentiment analysis. Asian Journal of Computer Science and Technology, 8(S2), 1–6. https://doi.org/10.51983/ajcst-2019.8.s2.2037

Cambria, E., Das, D., Bandyopadhyay, S., & Feraco, A. (Eds.). (2017). A practical guide to sentiment analysis (1st ed.). Springer International Publishing.

Chauhan, P., Sharma, N., & Sikka, G. (2021). The emergence of social media data and sentiment analysis in election prediction. Journal of Ambient Intelligence and Humanized Computing, 12(2), 2601–2627. https://doi.org/10.1007/s12652-020-02423-y

Dang, N. C., Moreno-García, M. N., & De la Prieta, F. (2020). Sentiment analysis based on deep learning: A comparative study. In arXiv [cs.CL]. http://arxiv.org/abs/2006.03541

Deng, S., Sinha, A. P., & Zhao, H. (2017). Adapting sentiment lexicons to domain-specific social media texts. Decision Support Systems, 94, 65–76. https://doi.org/10.1016/j.dss.2016.11.001

El Barachi, M., AlKhatib, M., Mathew, S., & Oroumchian, F. (2021). A novel sentiment analysis framework for monitoring the evolving public opinion in real-time: Case study on climate change. Journal of Cleaner Production, 312(127820), 127820. https://doi.org/10.1016/j.jclepro.2021.127820

Glynn, C. J., & Huge, M. E. (2008). Public Opinion. In The International Encyclopedia of Communication. John Wiley & Sons, Ltd.

Grimmer, J., & Stewart, B. M. (2013). Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Political Analysis: An Annual Publication of the Methodology Section of the American Political Science Association, 21(3), 267–297. https://doi.org/10.1093/pan/mps028

Hardeniya, T., & Borikar, D. A. (2016). Dictionary based approach to sentiment analysis-a review. International Journal of Advanced Engineering, 2(5).

Jain, S. (2021). A systematic study on sentiment analysis based on text mining and Deep learning for predictions in Stock Market trends through social and news media data. International Journal for Research in Applied Science and Engineering Technology, 9(10), 1589–1593. https://doi.org/10.22214/ijraset.2021.38662

Keshavarz, H., & Abadeh, M. S. (2017). ALGA: Adaptive lexicon learning using genetic algorithm for sentiment analysis of microblogs. Knowledge-Based Systems, 122, 1–16. https://doi.org/10.1016/j.knosys.2017.01.028

Khoo, C. S. G., & Johnkhan, S. B. (2018). Lexicon-based sentiment analysis: Comparative evaluation of six sentiment lexicons. Journal of Information Science, 44(4), 491–511. https://doi.org/10.1177/0165551517703514

Kumar, A., & Sebastian, T. M. (2012). Sentiment analysis on Twitter. International Journal of Computer Science Issues (IJCSI), 9(4), 372.

Kuznetsov, I., & Gurevych, I. (2018). From text to lexicon: Bridging the gap between word embeddings and lexical resources. In Proceedings of the 27th International Conference on Computational Linguistics (pp. 233–244).

Li, L., Goh, T.-T., & Jin, D. (2020). How textual quality of online reviews affect classification performance: a case of deep learning sentiment analysis. Neural Computing & Applications, 32(9), 4387–4415. https://doi.org/10.1007/s00521-018-3865-7

Liu, B. (2012). Sentiment Analysis and Opinion Mining. Morgan & Claypool.

Martínez-Cámara, E., Martín-Valdivia, M. T., Ureña-López, L. A., & Montejo-Ráez, A. R. (2014). Sentiment analysis in Twitter. Natural Language Engineering, 20(1), 1–28. https://doi.org/10.1017/s1351324912000332

Medhat, W., Hassan, A., & Korashy, H. (2014). Sentiment analysis algorithms and applications: A survey. Ain Shams Engineering Journal, 5(4), 1093–1113. https://doi.org/10.1016/j.asej.2014.04.011

Mejova, Y. (2009). Sentiment analysis: An overview. University of Iowa, Computer Science Department

Mowlaei, M. E., Saniee Abadeh, M., & Keshavarz, H. (2020). Aspect-based sentiment analysis using adaptive aspect-based lexicons. Expert Systems with Applications, 148(113234), 113234. https://doi.org/10.1016/j.eswa.2020.113234

Parvin, S. A., Sumathi, M., & Mohan, C. (2021). Challenges of sentiment analysis – A survey. 2021 5th International Conference on Trends in Electronics and Informatics (ICOEI).

Prabowo, R., & Thelwall, M. (2009). Sentiment analysis: A combined approach. Journal of Informetrics, 3(2), 143–157. https://doi.org/10.1016/j.joi.2009.01.003

Rice, D. R., & Zorn, C. (2021). Corpus-based dictionaries for sentiment analysis of specialized vocabularies. Political Science Research and Methods, 9(1), 20–35. https://doi.org/10.1017/psrm.2019.10

Turner, Z., Computer Science and Computer Engineering, University of Arkansas, Fayetteville, Arkansas, United States, Labille, K., Gauch, S., Computer Science and Computer Engineering, University of Arkansas, Fayetteville, Arkansas, United States, & Computer Science and Computer Engineering, University of Arkansas, Fayetteville, Arkansas, United States. (2021). Lexicon-based sentiment analysis for stock movement prediction. Journal of Construction Materials, 2(3). https://doi.org/10.36756/jcm.v2.3.5

van Atteveldt, W., van der Velden, M. A. C. G., & Boukes, M. (2021). The validity of sentiment analysis: Comparing manual annotation, crowd-coding, dictionary approaches, and machine learning algorithms. Communication Methods and Measures, 15(2), 121–140. https://doi.org/10.1080/19312458.2020.1869198

Zhang, L., Wang, S., & Liu, B. (2018). Deep learning for sentiment analysis: A survey. Wiley Interdisciplinary Reviews. Data Mining and Knowledge Discovery, 8(4), e1253. https://doi.org/10.1002/widm.1253


# # #

Read more

Blog 12: Main Dictionaries for Sentiment Analysis

Blog 11: Statistical description of the corpus #RStats

Blog 10: Sentiment analysis or the assessment of subjectivity

Blog 9: Topic modeling of the ‘vaccination’ corpus (English)

Blog 8: Linguistic and quantitative processing of the ‘vaccination’ corpus (English, part.2)

Blog 7: Linguistic and quantitative processing of the ‘vaccination’ corpus (English, part.1)

Blog 6: Collecting the corpus and preparing the lexical analysis

Blog 5: The textclean package

Blog 4: Refining the queries

Blog 3: The rtweet package

Blog 2: Collecting the corpus

Blog 1: An open research project

The challenges of research on media use in times of crisis