lnu.sePublications
Change search
ExportLink to record
Permanent link

Direct link
BETA

Project

Project type/Form of grant
Project grant
Title [sv]
Artificial Intelligence as a risk and opportunity for the authenticity of archives
Title [en]
Artificial Intelligence as a risk and opportunity for the authenticity of archives
Abstract [sv]
Detta projekt undersöker hur artificiell intelligens kan framtidssäkra våra kulturarv, genom att hantera arkivmaterial och ge bättre sökning i och tillgång till dem. Syftet är att jämföra de krav som ställs på arkiv av olika slag med de möjligheter som artificiell intelligens, AI, erbjuder. I arbetet används öppen data från Riksarkivet och Riksantikvarieämbetet.
Abstract [en]
The purpose of this project is to conceptualise and guide the design of AI based methods which embody and address archival professional imperatives by using open and available datasets from the Swedish National Archives (Riksarkivet) and the Swedish National Heritage Board (Riksantikvarieämbetet).
Publications (10 of 12) Show all publications
Widegren, J. (2026). Automatic subject indexing of Sámi oral history interviews with an LLM and thesaurus. In: Presented at AI and Archives, Uppsala, March 23-24, 2026.: . Paper presented at Presented at AI and Archives, Uppsala, Sweden, March 23-24, 2026..
Open this publication in new window or tab >>Automatic subject indexing of Sámi oral history interviews with an LLM and thesaurus
2026 (English)In: Presented at AI and Archives, Uppsala, March 23-24, 2026., 2026Conference paper, Oral presentation only (Other academic)
Abstract [en]

In this pilot study I used Whisper, the LLM Gemini by Google and the thesaurus developed by SAMLA and ISOF to automatically transcribe and index 14 interviews in Swedish mixed with Sámi from ISOF’s Sámi collections. The results were very accurate, although language mixing caused some difficulties. This raises both intriguing possibilities and concerns for digitized oral history collections.

Keywords
oral history, llm, thesaurus, sámi, traditionsarkiv, llm, tesaurus, samer
National Category
Information Studies
Identifiers
urn:nbn:se:lnu:diva-145825 (URN)
Conference
Presented at AI and Archives, Uppsala, Sweden, March 23-24, 2026.
Funder
Wallenberg AI, Autonomous Systems and Software Program – Humanity and Society (WASP-HS)
Available from: 2026-04-07 Created: 2026-04-07 Last updated: 2026-05-08Bibliographically approved
Widegren, J. (2025). Arkiv + Sápmi + AI = ?. In: : . Paper presented at Svenska Arkivförbundets vårkonferens 2025, Karlskrona, Sweden, 7-8 maj 2025.
Open this publication in new window or tab >>Arkiv + Sápmi + AI = ?
2025 (Swedish)Conference paper, Oral presentation only (Other (popular science, discussion, etc.))
Keywords
artificiell intelligens, arkiv, samer
National Category
Information Studies
Identifiers
urn:nbn:se:lnu:diva-138420 (URN)
Conference
Svenska Arkivförbundets vårkonferens 2025, Karlskrona, Sweden, 7-8 maj 2025
Funder
Wallenberg AI, Autonomous Systems and Software Program – Humanity and Society (WASP-HS)
Available from: 2025-05-08 Created: 2025-05-08 Last updated: 2025-05-12Bibliographically approved
Widegren, J. (2025). Automatic subject indexing of oral history interviews with Whisper and Claude. In: : . Paper presented at Digital Dreams and Practices, Digital Humanities in Nordic and Baltic Countries 9th Conference, Tartu, Estonia 5-7,03,2025.
Open this publication in new window or tab >>Automatic subject indexing of oral history interviews with Whisper and Claude
2025 (English)Conference paper, Poster (with or without abstract) (Refereed)
Abstract [en]

In the archival media trinity of text, image and sound, the latter presents particular challenges for users’ searching and browsing. While a user can manually or digitally browse through texts and images to locate items of interest, sound needs to be played more or less in real time to do the same. Audio files can naturally be described and transcribed – digital or digitized audio files of speech automatically so – thus facilitating full-text search. However, the familiar Achilles’ heel of full-text search, namely the ambiguity of natural language, remains.

Enter the hallmark trade of librarianship: subject indexing. When subject index terms have been assigned to individual sections of an audio file, a user searching for these or similar terms can locate precisely where in the retrieved audio files the subject is discussed. With state-of-the-art AI systems, even this can be done automatically, thereby decimating the amount of time needed to index audio files, from real time to as fast as the system can process them. This heralds a brave new future for the accessibility and searchability of oral history archives.

This poster presents a pilot study on automatically transcribing interviews in Swedish from oral history archives using OpenAI’s Whisper, describing the content and assigning subject index terms to sections using Claude from Anthropic, and visualizing the results. The accuracy of the results depends on many factors, including sound quality, accents of the speakers, the amount of language mixing etc. The results are very promising, however, suggesting automatic subject indexing of interviews to be a worthwhile research direction going forward.

Keywords
artificial intelligence, archives, oral history, indexing
National Category
Information Studies
Identifiers
urn:nbn:se:lnu:diva-138419 (URN)
Conference
Digital Dreams and Practices, Digital Humanities in Nordic and Baltic Countries 9th Conference, Tartu, Estonia 5-7,03,2025
Funder
Wallenberg AI, Autonomous Systems and Software Program – Humanity and Society (WASP-HS)
Available from: 2025-05-08 Created: 2025-05-08 Last updated: 2025-05-08Bibliographically approved
Widegren, J. (2025). Automatic subject indexing of Sámi oral history interviews with an LLM and thesaurus. In: Presented at 29th International Conference on Theory and Practice of Digital Libraries, Tampere, 23-26 September, 2025: . Paper presented at The 29th International Conference on Theory and Practice of Digital Libraries (TPDL), Tampere, Finland, 23-26 September, 2025.
Open this publication in new window or tab >>Automatic subject indexing of Sámi oral history interviews with an LLM and thesaurus
2025 (English)In: Presented at 29th International Conference on Theory and Practice of Digital Libraries, Tampere, 23-26 September, 2025, 2025Conference paper, Oral presentation with published abstract (Refereed)
Abstract [en]

1 Aims

This pilot project investigates the possibility of using the free-to-use large language

model (LLM) Gemini 2.5 Pro by Google with a thesaurus jointy developed

by the Norwegian SAMLA project and the Swedish Institute for Language and

Folklore [1] for automatic subject indexing of oral history interviews. Considering

the difficulty of finding sections of interest in interviews from oral history

databases, subject indexing for audio files has been implemented in many institutions

[3]. These indexes are very time-consuming to construct manually,

suggesting a need for automatic or semi-automatic approaches that implement

controlled vocabularies to facilitate information retrieval. This study gauges the

feasibility of a fully automated workflow using automatic transcription, thematic

segmentation and subject indexing using modern AI tools.

2 Background

It is notoriously difficult to find segments of interest in an un-indexed oral history

recording, given that one needs to listen through the whole interview in moreor-

less real time to determine its content. Transcripts are naturally immensely

helpful in this regard, while time-consuming to create [3]. To overcome the issues

presented by searching for topics of interest in natural language documents,

transcripts can be combined with subject indices that make use of controlled

vocabulary terms [2].

While modern speech-to-text tools have been used to generate transcripts for

oral history interviews, the opportunity of using LLMs to generate the subject

indices has not to the author’s knowledge been leveraged to date.

3 Methods

A Google Colab notebook was designed to process a collection of 14 digitized

recordings in Swedish mixed with South Sámi from the Swedish Institute of

Language and Folklore. The notebook transcribes the interview using a version

of OpenAI’s Whisper finetuned for Swedish (KB-Whisper), whereafter the

transcript is passed to Google’s free-to-use Gemini API with a prompt with

instructions for subject indexing. The LLM is asked to perform the following

tasks:

1. Split the interview into thematic segments

2. Specify the start time of each segment

3. Mark segments which are unclear or seem to be in a language other than

Swedish

4. Give each segment a brief title

5. Summarize each segment

6. Describe each segment with free keywords

7. Subject index each segment using terms from the controlled vocabulary

The output is formatted as a csv-file, which can be opened in e.g. Google

Sheets. Due to the lack of a gold standard, the accuracy of the LLM output

for each individual task and the assigned index terms, as well as the overall

usefulness of the pipeline, was analyzed manually by the author.

4 Main findings

For the 14 interviews processed in this pilot study Gemini 2.5 Pro very successfully

divided them into thematic segments. The start time for each segment

was off only by a few seconds at most. Out of a total of 41 interview segments

in Sámi language, 14 were correctly marked by Gemini as [unclear/other language],

whereas 27 were not. The titles given to segments where descriptive and

accurate, except for misclassified segments in Sámi. Summarizations were very

good, except where subject expertise was needed. For all the segments in the

14 interviews, only one of the free keywords assigned was probably inaccurate.

The subject terms assigned by Gemini were only analyzed for their relevance

to the topic of the interview segment; possible alternative subject terms were

not identified by the author due to time constraints. Excluding the misclassified

segments which were in fact in Sámi language, the proportion of relevant subject

terms was 97%; including the segments in Sámi the proportion was 89%.

5 Relevance to workshop themes

This contribution connects to topic 10 of the workshop: Use of KOS in Artificial

Intelligence (AI) applications - methods and practical experience.

References

[1] Hans-Jakob ˚Agotnes. “Norske folkeminnesamlingar tilgjengelege i eit felles

digitalt arkiv: SAMLA”. In: Heimen 61.3 (Sept. 30, 2024), pp. 257–261.

issn: 0017-9841, 1894-3195. doi: 10.18261/heimen.61.3.7. url: https:

//www.scup.com/doi/10.18261/heimen.61.3.7 (visited on 06/11/2025).

[2] Koraljka Golub. “Automated subject classification of textual web documents”.

In: Journal of Documentation 62.3 (May 1, 2006), pp. 350–371.

issn: 0022-0418. doi: 10.1108/00220410610666501. url: https://www.

emerald . com / insight / content / doi / 10 . 1108 / 00220410610666501 /

full/html (visited on 06/11/2025).

[3] Douglas Lambert. “Oral History Indexing”. In: The Oral History Review

50.2 (2023), pp. 169–192. doi: https://doi.org/10.1080/00940798.

2023.2235000.

 

National Category
Information Studies
Identifiers
urn:nbn:se:lnu:diva-142251 (URN)
Conference
The 29th International Conference on Theory and Practice of Digital Libraries (TPDL), Tampere, Finland, 23-26 September, 2025
Funder
Wallenberg AI, Autonomous Systems and Software Program – Humanity and Society (WASP-HS)
Available from: 2025-10-30 Created: 2025-10-30 Last updated: 2026-05-08Bibliographically approved
von Bychelberg, L. & Widegren, J. (2024). A qualitative survey of archivist and technologist perspectives on the use of AI in archives. In: Presented at DHNB 2024: Digital Humanities in the Nordic and Baltic Countries 8th Conference, Reykjavik, Iceland, May 27-31, 2024, 2024: . Paper presented at DHNB 2024: Digital Humanities in the Nordic and Baltic Countries 8th Conference, Reykjavik, Iceland, May 27-31, 2024, 2024.
Open this publication in new window or tab >>A qualitative survey of archivist and technologist perspectives on the use of AI in archives
2024 (English)In: Presented at DHNB 2024: Digital Humanities in the Nordic and Baltic Countries 8th Conference, Reykjavik, Iceland, May 27-31, 2024, 2024, 2024Conference paper, Oral presentation with published abstract (Refereed)
Abstract [en]

 The opportunities offered by artificial intelligence (AI) and machine learning (ML) for the archive sector have been addressed in several scholarly articles in recent years. Some of the articles describe projects (e.g. Carter et al., 2022; Han et al., 2022) featuring a collaboration between archivists, generally defined for this paper as persons working in the archival sector, and technologists, defined as persons with a professional technical background. Some other archival projects have been undertaken by computer scientists or digital humanities scholars without the involvement of archivists (e.g. Luthra et al., 2022). Providing a qualitative angle, some articles have looked at digital archivists’ opinions on the implementations of AI techniques in archives (e.g. Cushing & Osti, 2023). Finally, more general articles have been published, both by archivists and technologists on how AI may be implemented in the archival sector (e.g. Colavizza et al., 2022; Hutchinson, 2020; Sabharwal, 2017). 

This paper presents a qualitative analysis of how the perspectives of archivists and technologists on AI in archives are presented in a selection of recent articles. The selection includes both articles describing case studies, such as project descriptions and interview studies, as well as literature reviews. The articles are categorized according to the type of project described or suggested: articles written from an archival perspective, articles written from a technologist perspective, and articles that describe or propose joint projects run by archivists and technologists in cooperation. Differences can be observed in these research articles regarding 1) how archival expertise is valued, 2) the proposed importance of archival theory for successful AI implementation and 3) the degree of influence ascribed to the archivists in collaborations.

The results indicate that viewpoints clearly differ depending on the professional role and background of the contributors to the articles. Articles written from a technologist perspective are more likely to criticize archivist work, and in some cases even blame them for obstructing access and “perpetuating silences” in archives (Luthra et al., 2022). Randby & Marciano (2020) describe the goals of AIC (Advanced Information Collaboratory), a project in which Randby is involved, as aiming towards information professionals learning “to think computationally and rapidly adapt new technologies”; the article goes through a computational workflow without further mentioning archivists on a larger scale. 

Articles written by archivists, on the other hand, emphasize the importance of incorporating archival principles and taking advantage of archivists’ knowledge in the AI implementation process (Hutchinson, 2020). Cushing & Osti (2023) highlight the expertise of archivists and stress how their professional background is a requisite to control AI decision-making. An important distinction is also that articles published in archivist journals such as Murphy et al.’s (2015) highlight the perspective of archivists by portraying them as subjects (“archivists”) which is contrasted with a passive portrayal of the technological aspects (“technology” instead of “technologists” or “machine learning experts”). Other articles point out the challenges of using AI in archives; for example, Jaillant & Caputo (2022) mention “ethical challenges” and problems with bias in AI. Cushing & Osti (2023) describe how participants in their study (archival experts) are confident about the possibilities of AI, but also skeptical about practical integration into their work. Lee (2018) argues that certain AI tools are not supportive of the “holistic view” that archivists have of their work.

Those articles that present a mutual collaboration between archivists and technologists highlight the importance of combining the expertise of both fields for successful AI implementation (e.g. Murphy et al., 2015; Carter et al., 2022; Han et al., 2022). Poole & Garwood (2018), an information scientist and computer and informatics researcher respectively, call for the involvement of archivists and librarians in Digital Humanities projects. Tsabedze (2023) offers an important perspective on archival professionals in Eswatini, highlighting the need for digital education; Tsabedze argues that the interview participants in Eswatini are afraid to lose their jobs without proper training. Nevertheless, Marciano et al. (2018) are positive about the disciplines achieving more together than each would have on their own. As a contrasting addition, Jo & Gebru (2020) argue not for implementing AI technologies in archives but for implementing archival expertise in AI development. They propose that archival document collection practices can inform data collection in sociocultural ML, because archivists possess the language and procedures to address issues of consent, transparency, inclusivity etc.

Finally, Marciano et al. (2018) believe that archival expertise of the future will involve knowledge of digital systems. Sabharwal (2017) suggests that in the future, the distinction between archivists and technologists will not be as clear anymore. Jaillant & Caputo (2022) also highly encourage collaboration between archivists and technologists in order to address future challenges. Cushing & Osti (2023) suggest that AI technology in archives can even change how archivists describe “digital archival expertise”. In conclusion, our findings suggest that there is a great diversity of opinions; as Poole & Garwood (2018) suggest, more research is necessary on how the collaboration between professions, as well as implementation of AI in archives, can be improved.

Keywords
artificial intelligence, archives
National Category
Information Studies
Research subject
Humanities, Library and Information Science
Identifiers
urn:nbn:se:lnu:diva-130054 (URN)
Conference
DHNB 2024: Digital Humanities in the Nordic and Baltic Countries 8th Conference, Reykjavik, Iceland, May 27-31, 2024, 2024
Funder
Marianne and Marcus Wallenberg FoundationMarcus and Amalia Wallenberg Foundation
Available from: 2024-06-07 Created: 2024-06-07 Last updated: 2024-06-24Bibliographically approved
Widegren, J. (2024). AI for improving access to archives pertaining to the Sámi: An overview of current approaches and future possibilities. In: Presented at DHNB 2024: Digital Humanities in the Nordic and Baltic Countries 8th Conference, Reykjavik, Iceland, May 27-31, 2024, 2024: . Paper presented at DHNB 2024: Digital Humanities in the Nordic and Baltic Countries 8th Conference, Reykjavik, Iceland, May 27-31, 2024, 2024.
Open this publication in new window or tab >>AI for improving access to archives pertaining to the Sámi: An overview of current approaches and future possibilities
2024 (English)In: Presented at DHNB 2024: Digital Humanities in the Nordic and Baltic Countries 8th Conference, Reykjavik, Iceland, May 27-31, 2024, 2024, 2024Conference paper, Oral presentation with published abstract (Refereed)
Abstract [en]

Facilitating access to archives via metadata creation and enrichment can be a monumental task for large archival collections. The past decade has witnessed an increasing use of artificial intelligence (AI) and machine learning (ML) to assist in these tasks in automatic or semi-automatic workflows (Colavizza et al., 2022). While technologies such as named entity recognition and topic modeling are useful in many different archival contexts, they have found special relevance for colonial archives and archives pertaining to underrepresented communities. Recent projects have for example explored the possibilities of using AI and ML to optimize information discovery in under‑utilized, Holocaust‑related records (Carter et al., 2022), extract mentions of underrepresented people in Dutch colonial records (Luthra et al., 2023), and transform Indigenous and Spanish colonial archives originating from Mexico into Linked Open Data repositories (Candela et al., 2023).

This paper presents, firstly, an overview of current state-of-the-art approaches for AI in archives, and secondly, a project in progress intended to align with the three first goals of the ongoing InterPARES Trust AI project (Duranti et al., 2021): to identify specific AI technologies that can address critical records and archives challenges; determine the benefits and risks of using AI technologies on records and archives; and ensure that archival concepts and principles inform the development of responsible AI. The opportunities offered by these technologies are contrasted with the risks of using automated approaches in general and AI in particular for improving access to archives. The paper also discusses a related approach for increasing discoverability in collections, i.e. semantic search, and compares the pros and cons of these approaches. Furthermore, the potential uses of generative pre-trained transformers (GPTs) for both indexing and retrieval, and the risks associated with these, are addressed.

Bringing the overview to a Swedish context, the paper describes an ongoing project aiming to explore the risks and possibilities of using AI and ML to provide up-to-date, enriched metadata for Swedish archives pertaining to the Sámi, an Indigenous population of the Nordic countries. Managing Indigenous heritage material demands distinct sensitivities that acknowledge the colonial past and the voice of the community in creating and maintaining the cultural record. While AI technologies can be a means for promoting cultural heritage from a Sámi perspective, using them without properly addressing colonial aspects runs the risk of reifying and perpetuating the colonial dynamics of past history writing. When properly applied, however, the hope is that these technologies may be of assistance in the remediation of the digital cultural record to counter the colonial dynamics of the analog cultural record (see Risam, 2019).

The project aims to gauge the potential of these technologies for improving the searchability and usability of records pertaining to the Sámi, of both colonial and Indigenous origin. The research is intended to follow an iterative approach, with continuous evaluation and feedback from experts and end users ensuring the suitability of the metadata generated by the technologies and its usefulness in facilitating search. The expected outcome of the project as a whole is a framework for assessing how to safely and effectively implement selected AI techniques in archival and related institutions while maintaining authenticity.

Keywords
artificial intelligence, machine learning, archives, Sámi, Indigenous archives, participatory research, artificiell intelligens, maskininlärning, arkiv, samer, urfolksarkiv
National Category
Information Studies
Research subject
Humanities, Library and Information Science
Identifiers
urn:nbn:se:lnu:diva-129846 (URN)
Conference
DHNB 2024: Digital Humanities in the Nordic and Baltic Countries 8th Conference, Reykjavik, Iceland, May 27-31, 2024, 2024
Funder
Wallenberg Foundations
Available from: 2024-06-03 Created: 2024-06-03 Last updated: 2024-06-24Bibliographically approved
Widegren, J. (2024). AI-powered participatory approaches for improved information discoverability in Sámi archival collections.
Open this publication in new window or tab >>AI-powered participatory approaches for improved information discoverability in Sámi archival collections
2024 (English)Other (Other academic)
Keywords
dissertation plan, seminar, artificial intelligence, archives, indigenous knowledge organization, some
National Category
Information Studies
Identifiers
urn:nbn:se:lnu:diva-128568 (URN)
Funder
Wallenberg Foundations
Note

iSchools PhD Seminar Series April 3 2024

Available from: 2024-04-04 Created: 2024-04-04 Last updated: 2024-05-16Bibliographically approved
Widegren, J. (2024). Aktuell forskning om AI och arkiv: Hur förhåller vi oss till utmaningar och skapar möjligheter?. In: : . Paper presented at Framtidens arkiv och informationsförvaltning, Stockholm, 6 februari 2024.
Open this publication in new window or tab >>Aktuell forskning om AI och arkiv: Hur förhåller vi oss till utmaningar och skapar möjligheter?
2024 (Swedish)Conference paper, Oral presentation with published abstract (Other academic)
National Category
Information Studies
Identifiers
urn:nbn:se:lnu:diva-128151 (URN)
Conference
Framtidens arkiv och informationsförvaltning, Stockholm, 6 februari 2024
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)
Available from: 2024-03-06 Created: 2024-03-06 Last updated: 2024-03-18Bibliographically approved
Widegren, J. (2024). Embracing Critical Curiosity: Navigating the Societal Challenges of AI with WASP-HS.
Open this publication in new window or tab >>Embracing Critical Curiosity: Navigating the Societal Challenges of AI with WASP-HS
2024 (English)Other (Other (popular science, discussion, etc.))
National Category
Information Studies
Identifiers
urn:nbn:se:lnu:diva-128160 (URN)
Note

WASP-HS

Blog Posts

Available from: 2024-03-07 Created: 2024-03-07 Last updated: 2024-03-18Bibliographically approved
Widegren, J. (2024). Hur kan vi förbättra tillgången till samiska arkiv med hjälp av AI?. In: : . Paper presented at Bokmässan, Göteborg, Sweden, 29 september, 2024.
Open this publication in new window or tab >>Hur kan vi förbättra tillgången till samiska arkiv med hjälp av AI?
2024 (Swedish)Conference paper, Oral presentation only (Other (popular science, discussion, etc.))
National Category
Information Studies
Identifiers
urn:nbn:se:lnu:diva-132922 (URN)
Conference
Bokmässan, Göteborg, Sweden, 29 september, 2024
Funder
Wallenberg Foundations
Available from: 2024-10-08 Created: 2024-10-08 Last updated: 2024-11-08Bibliographically approved
Principal InvestigatorGolub, Koraljka
Principal InvestigatorFoka, Anna
Co-InvestigatorWidegren, Johannes
Co-InvestigatorKamal, Ahmad M.
Co-InvestigatorMilrad, Marcelo
Co-InvestigatorTaalas, Saara L.
Co-InvestigatorHuvila, Isto
Co-Investigatorvon Bychelberg, Larissa
Coordinating organisation
Linnaeus University, Faculty of Arts and Humanities, Department of Cultural Sciences
Funder
Period
2023-08-15 - 2028-08-14
National Category
Information Studies
Identifiers
DiVA, id: project:8246

Search in DiVA

Information Studies

Search outside of DiVA

GoogleGoogle Scholar

Link to external project page

Project website (Linnaeus University)Project website (Uppsala University)