lnu.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Exploring Similarity Patterns in a Large Scientific Corpus
Linnaeus University, Faculty of Technology, Department of computer science and media technology (CM). (ISOVIS)ORCID iD: 0000-0001-6150-0787
Blekinge Institute of Technology, Sweden.ORCID iD: 0000-0001-6745-4398
Linköping University, Sweden.ORCID iD: 0000-0002-1907-7820
Linnaeus University, Faculty of Technology, Department of computer science and media technology (CM). Linköping University, Sweden. (Lnuc DISA;ISOVIS)ORCID iD: 0000-0002-0519-2537
2025 (English)In: PLOS ONE, E-ISSN 1932-6203, Vol. 20, no 4, article id e0321114Article in journal (Refereed) Published
Abstract [en]

Similarity-based analysis is a common and intuitive tool for exploring large data sets. For instance, grouping data items by their level of similarity, regarding one or several chosen aspects, can reveal patterns and relations from the intrinsic structure of the data and thus provide important insights in the sense-making process. Existing analytical methods (such as clustering and dimensionality reduction) tend to target questions such as "Which objects are similar?"; but since they are not necessarily well-suited to answer questions such as "How does the result change if we change the similarity criteria?" or "How are the items linked together by the similarity relations?" they do not unlock the full potential of similarity-based analysis—and here we see a gap to fill. In this paper, we propose that the concept of similarity could be regarded as both: (1) a relation between items, and (2) a property in its own, with a specific distribution over the data set. Based on this approach, we developed an embedding-based computational pipeline together with a prototype visual analytics tool which allows the user to perform similarity-based exploration of a large set of scientific publications. To demonstrate the potential of our method, we present two different use cases, and we also discuss the strengths and limitations of our approach.

Place, publisher, year, edition, pages
Public Library of Science (PLoS) , 2025. Vol. 20, no 4, article id e0321114
Keywords [en]
Visual Text Analytics, Text Mining, Text Embedding, Network Embedding, Similarity Calculations
National Category
Computer Sciences Human Computer Interaction
Research subject
Computer Science, Information and software visualization
Identifiers
URN: urn:nbn:se:lnu:diva-137304DOI: 10.1371/journal.pone.0321114ISI: 001488705600008Scopus ID: 2-s2.0-105003254126OAI: oai:DiVA.org:lnu-137304DiVA, id: diva2:1946258
Funder
ELLIIT - The Linköping‐Lund Initiative on IT and Mobile Communications
Note

This work was partially supported through the ELLIIT environment for strategic research in Sweden. The work of Ilir Jusufi was supported in part by the Knowledge Foundation, Sweden, through the project ”Rekryteringar 21, Universitetslektor i spelteknik” under Contract 20210077.

Available from: 2025-03-20 Created: 2025-03-20 Last updated: 2025-05-28Bibliographically approved
In thesis
1. Using Multiple Embeddings for Visually Guided Text Similarity Analysis
Open this publication in new window or tab >>Using Multiple Embeddings for Visually Guided Text Similarity Analysis
2025 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Making sense of large sets of data is a general and important challenge that occurs for many research fields and real-world scenarios. Therefore, many different specific computational methods for data mining and analysis have been developed, some which are specific to certain data types and some which are more general. Such methods often seek to reveal the intrinsic structure of relations between the data items in order to provide important insights beyond the individual data values. This can be done in many different ways, but interestingly several of the most prominent methods (such as clustering and dimensionality reduction) are based on similarity/closeness calculations. The concept of similarity may at first glance seem both intuitive and simple, but it provides several challenges conceptually, visually and computationally due to its inherently subjective nature.

Given the prevalence of similarity-based analysis methods within visual analytics (VA), we argue that there is a need for a better understanding of the potential and limitations of such methods---not only in their own specific contexts, but rather on a more common and general level. With this in mind, we have identified a current research gap regarding the need for a comprehensive approach on how to evaluate, compare and combine different models within the context of similarity calculations. In this thesis, we seek to fill this gap through a series of publications around the common thread of developing a coherent VA framework for similarity-based analysis of large textual data sets. Although we have founded our work on embedding-based similarity calculations on textual data, many of the general ideas and implications are generalizable to other computational approaches and data types as well.

Our work covers several important aspects of the problem area, each of which is needed in order to construct a comprehensive methodology framework. As a foundation for our work, and for positioning our contribution in the context of the current research frontier, we provide a comprehensive survey of the use of embeddings within VA applications. For a solid conceptual understanding of the concept of similarity, we provide an analysis of its inherently subjective nature and the challenges this entails. Computationally, we develop several new methods for evaluating, comparing and combining different models. As a direct result of this, we also uncover a surprisingly high level of model disagreement---even though only state-of-the-art models are used. Visually, we provide several new prototype VA tools aimed at including the analyst in the loop and promote trust and deep understanding. All in all, our work provides several new and important insights to a previously underresearched problem area.

Place, publisher, year, edition, pages
Linnaeus University Press, 2025
Keywords
Embeddings, Similarity Calculations, Visual Analytics, Text Mining
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:lnu:diva-138916 (URN)10.15626/LUD.571.2025 (DOI)9789180822985 (ISBN)978-91-8082-299-2 (ISBN)
Public defence
2025-06-12, Newton, hus C, Växjö, 09:30 (English)
Opponent
Available from: 2025-06-02 Created: 2025-05-28 Last updated: 2025-06-02Bibliographically approved

Open Access in DiVA

fulltext(6193 kB)9 downloads
File information
File name FULLTEXT01.pdfFile size 6193 kBChecksum SHA-512
5dd277c1e3523fc82c1a6cf7ef950fb9ddb04ac4cdca4911ca72e819b6d0a0960bfe8949a406bb8a7cbbe68e1cba44d0c4a080186f102d8836232c89483c0ff0
Type fulltextMimetype application/pdf

Other links

Publisher's full textScopus

Authority records

Witschard, DanielKucher, KostiantynKerren, Andreas

Search in DiVA

By author/editor
Witschard, DanielJusufi, IlirKucher, KostiantynKerren, Andreas
By organisation
Department of computer science and media technology (CM)
In the same journal
PLOS ONE
Computer SciencesHuman Computer Interaction

Search outside of DiVA

GoogleGoogle Scholar
Total: 9 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 85 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf