In this presentation, I discuss the implications of using text mining, and in particular topic models, as a methodological tool for interpretive research. The latter term refers to a loosely defined set of approaches in the social sciences whose main epistemic goal is generating understanding about social phenomena, and that typically employs methods labeled as qualitative.
Text mining is a term used to describe a set of statistical algorithms for the analysis of unstructured data (Aggarwal and Zhai 2012). A commonly used approach in text mining is topic modeling (Blei 2012) in which Bayesian methods - e.g Latent Dirichlet Allocation (LDA) - are used to fit a dataset consisting of unstructured data to a topic model. A topic model estimates the probability that each document in the data set is about a set of topics. The estimation of the model parameters rests on the model assumptions that a) every text document is characterized by a distribution of topics and b) that each topic is characterized by a distribution of words. By using a prior Dirichlet distribution for topics and words, LDA uses the observed words in a document to estimate the probability that each word in the document is about a specific topic, and that the same document is more or less focused on some specific topic. LDA is an unsupervised method, meaning that it fits a topic model without being instructed with the vocabularies of preselected theoretical topics.
It is intuitive to conceptualize LDA as providing an automated interpretation of the text data, and, possibly based on this intuition, topic modeling methods have recently been considered as a potentially useful tool for qualitative interpretive analysis (Janasik, Honkela, and Bruun 2008; Rose and Lennerholt 2017; Wiedemann 2016; Yu, Jannasch-Pennell, and DiGangi 2011). One recurring theme in this literature is how LDA and other similar instruments can make qualitative methods more valid, more reliable, or more objective. Recently (2021), Pääkkönen and Ylikoski have discussed the relationship between interpretive methods and text mining, focusing on “how unsupervised machine learning methods might make hermeneutic interpretive text analysis more objective in the social sciences” (2021, p. 1461). In this article, Pääkkönen and Ylikoski claim that topic modeling can extend the evidential base and improve the objectivity of interpretive research. These conclusions are independent from the attitude that researchers have concerning what topic models represent: topic realism or topic instrumentalism. According to the former, topic models are measurements of the real meaning structures underlying a text. According to the latter, the models generated using LDA are only statistical patterns that have no evidential relationship with claims about the meaning of the text documents. These are researchers’ attitudes towards topic models and not philosophical positions. In my discussion, I focus on topic instrumentalism and discuss the following theses:
First, I criticize the claim that topic modeling only provides an overview or a data reduction of the analyzed documents. This reduced data model can be related – according to topic instrumentalism – to a theory about the meaning of the documents only after it is interpreted by the researcher. To show that this attitude is problematic, I reconstruct the learning algorithm typically used in LDA – called Gibbs sampling – and show that this algorithm is based on two main assumptions. The first is that each document is mainly about a specific topic, and the second is that each term (typically, a word) is mainly connected to a specific topic. These two assumptions are important because they make some document types – for example, texts that are no longer than a paragraph – more appropriate for the application of LDA than others. Moreover, these two assumptions entail that a topic model is not only a data reduction or overview but an interpretation of the text. Even if topics here are only latent variables and not theoretical concepts, the choice of these two assumptions is informed by a theoretical claim concerning how texts convey meaning. As I argue, this entails that topic models generative algorithms cannot be detached from interpretive reasoning (and thus from evidential reasoning), as topic instrumentalism seems to entail.
Secondly, and related to the previous issue, the assumptions that base Gibbs sampling entail that topic models have crucial inferential features, which makes it impossible to detach topic models from the broad interpretive process, as topic instrumentalism seems to suggest. As I argue, the Gibbs sampling algorithm in LDA is not qualitatively distinct from the procedures humans seem to use when solving text classifications problems (Lee and Corlett 2003). What seems to distinguish the heuristics people use to assign topics to a text is the use of thresholds and the tallying of textual features necessary to make text interpretation cognitively affordable. However, a simplified Gibbs sampling heuristic could plausibly describe human text interpretation, which entails that the distinction is only a matter of level of applicability. The real distinction between LDA and human interpretation seems instead to be that humans use their knowledge of the text’s context to simplify the interpretation process. Unsupervised topic modeling algorithms lack this ability but can still be conceptualized as incorporating an interpretive process. In simple terms, topic models mimic the process of human interpretation when background information about the text’s context is very scarce.
Therefore, I conclude that the application of topic modeling to interpretive research should be considered as an integrated part of the interpretive process rather than a simple data reduction technique.
ReferencesAggarwal, Charu C, and ChengXiang Zhai. 2012. Mining Text Data. Springer Science & Business Media.
Blei, David M. 2012. “Topic Modeling and Digital Humanities.” Journal of Digital Humanities 2 (1): 8–11.
Janasik, Nina, Timo Honkela, and Henrik Bruun. 2008. “Text Mining in Qualitative Research: Application of an Unsupervised Learning Method.” Organizational Research Methods, April. https://doi.org/10.1177/1094428108317202.
Lee, Michael D., and Elissa Y. Corlett. 2003. “Sequential Sampling Models of Human Text Classification.” Cognitive Science 27 (2): 159–93. https://doi.org/10.1207/s15516709cog2702_2.
Pääkkönen, Juho, and Petri Ylikoski. 2021. “Humanistic Interpretation and Machine Learning.” Synthese 199 (1): 1461–97. https://doi.org/10.1007/s11229-020-02806-w.
Rose, Jeremy, and Christian Lennerholt. 2017. “Low Cost Text Mining as a Strategy for Qualitative Researchers.” Electronic Journal of Business Research Methods 15 (1): pp2‑16-pp2‑16.
Wiedemann, Gregor. 2016. Text Mining for Qualitative Data Analysis in the Social Sciences: A Study on Democratic Discourse in Germany. Kritische Studien Zur Demokratie. VS Verlag für Sozialwissenschaften. https://doi.org/10.1007/978-3-658-15309-0.
Yu, C., Angel Jannasch-Pennell, and S. DiGangi. 2011. “Compatibility between Text Mining and Qualitative Research in the Perspectives of Grounded Theory, Content Analysis, and Reliability.” Undefined. 2011. /paper/Compatibility-between-Text-Mining-and-Qualitative-Yu-Jannasch-Pennell/f6ce926c4cdc7733bc2a78078497693672c80043.
2022.
Topic modeling, Latent Dirichlet Allocation, Qualitative Interpretive Methods, Interpretation
11TH CONFERENCE OF THE EUROPEAN NETWORK FOR THE PHILOSOPHY OF THE SOCIAL SCIENCES(ENPOSS)Organised by UNED and Universidad de Málaga UNIVERSIDAD DE MÁLAGA. FACULTAD DE FILOSOFÍA Y LETRAS SEPTEMBER 21-23, 2022