Open this publication in new window or tab >>2025 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]
Making sense of large sets of data is a general and important challenge that occurs for many research fields and real-world scenarios. Therefore, many different specific computational methods for data mining and analysis have been developed, some which are specific to certain data types and some which are more general. Such methods often seek to reveal the intrinsic structure of relations between the data items in order to provide important insights beyond the individual data values. This can be done in many different ways, but interestingly several of the most prominent methods (such as clustering and dimensionality reduction) are based on similarity/closeness calculations. The concept of similarity may at first glance seem both intuitive and simple, but it provides several challenges conceptually, visually and computationally due to its inherently subjective nature.
Given the prevalence of similarity-based analysis methods within visual analytics (VA), we argue that there is a need for a better understanding of the potential and limitations of such methods---not only in their own specific contexts, but rather on a more common and general level. With this in mind, we have identified a current research gap regarding the need for a comprehensive approach on how to evaluate, compare and combine different models within the context of similarity calculations. In this thesis, we seek to fill this gap through a series of publications around the common thread of developing a coherent VA framework for similarity-based analysis of large textual data sets. Although we have founded our work on embedding-based similarity calculations on textual data, many of the general ideas and implications are generalizable to other computational approaches and data types as well.
Our work covers several important aspects of the problem area, each of which is needed in order to construct a comprehensive methodology framework. As a foundation for our work, and for positioning our contribution in the context of the current research frontier, we provide a comprehensive survey of the use of embeddings within VA applications. For a solid conceptual understanding of the concept of similarity, we provide an analysis of its inherently subjective nature and the challenges this entails. Computationally, we develop several new methods for evaluating, comparing and combining different models. As a direct result of this, we also uncover a surprisingly high level of model disagreement---even though only state-of-the-art models are used. Visually, we provide several new prototype VA tools aimed at including the analyst in the loop and promote trust and deep understanding. All in all, our work provides several new and important insights to a previously underresearched problem area.
Place, publisher, year, edition, pages
Linnaeus University Press, 2025
Keywords
Embeddings, Similarity Calculations, Visual Analytics, Text Mining
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:lnu:diva-138916 (URN)10.15626/LUD.571.2025 (DOI)9789180822985 (ISBN)978-91-8082-299-2 (ISBN)
Public defence
2025-06-12, Newton, hus C, Växjö, 09:30 (English)
Opponent
2025-06-022025-05-282025-06-02Bibliographically approved