1 Aims
This pilot project investigates the possibility of using the free-to-use large language
model (LLM) Gemini 2.5 Pro by Google with a thesaurus jointy developed
by the Norwegian SAMLA project and the Swedish Institute for Language and
Folklore [1] for automatic subject indexing of oral history interviews. Considering
the difficulty of finding sections of interest in interviews from oral history
databases, subject indexing for audio files has been implemented in many institutions
[3]. These indexes are very time-consuming to construct manually,
suggesting a need for automatic or semi-automatic approaches that implement
controlled vocabularies to facilitate information retrieval. This study gauges the
feasibility of a fully automated workflow using automatic transcription, thematic
segmentation and subject indexing using modern AI tools.
2 Background
It is notoriously difficult to find segments of interest in an un-indexed oral history
recording, given that one needs to listen through the whole interview in moreor-
less real time to determine its content. Transcripts are naturally immensely
helpful in this regard, while time-consuming to create [3]. To overcome the issues
presented by searching for topics of interest in natural language documents,
transcripts can be combined with subject indices that make use of controlled
vocabulary terms [2].
While modern speech-to-text tools have been used to generate transcripts for
oral history interviews, the opportunity of using LLMs to generate the subject
indices has not to the author’s knowledge been leveraged to date.
3 Methods
A Google Colab notebook was designed to process a collection of 14 digitized
recordings in Swedish mixed with South Sámi from the Swedish Institute of
Language and Folklore. The notebook transcribes the interview using a version
of OpenAI’s Whisper finetuned for Swedish (KB-Whisper), whereafter the
transcript is passed to Google’s free-to-use Gemini API with a prompt with
instructions for subject indexing. The LLM is asked to perform the following
tasks:
1. Split the interview into thematic segments
2. Specify the start time of each segment
3. Mark segments which are unclear or seem to be in a language other than
Swedish
4. Give each segment a brief title
5. Summarize each segment
6. Describe each segment with free keywords
7. Subject index each segment using terms from the controlled vocabulary
The output is formatted as a csv-file, which can be opened in e.g. Google
Sheets. Due to the lack of a gold standard, the accuracy of the LLM output
for each individual task and the assigned index terms, as well as the overall
usefulness of the pipeline, was analyzed manually by the author.
4 Main findings
For the 14 interviews processed in this pilot study Gemini 2.5 Pro very successfully
divided them into thematic segments. The start time for each segment
was off only by a few seconds at most. Out of a total of 41 interview segments
in Sámi language, 14 were correctly marked by Gemini as [unclear/other language],
whereas 27 were not. The titles given to segments where descriptive and
accurate, except for misclassified segments in Sámi. Summarizations were very
good, except where subject expertise was needed. For all the segments in the
14 interviews, only one of the free keywords assigned was probably inaccurate.
The subject terms assigned by Gemini were only analyzed for their relevance
to the topic of the interview segment; possible alternative subject terms were
not identified by the author due to time constraints. Excluding the misclassified
segments which were in fact in Sámi language, the proportion of relevant subject
terms was 97%; including the segments in Sámi the proportion was 89%.
5 Relevance to workshop themes
This contribution connects to topic 10 of the workshop: Use of KOS in Artificial
Intelligence (AI) applications - methods and practical experience.
References
[1] Hans-Jakob ˚Agotnes. “Norske folkeminnesamlingar tilgjengelege i eit felles
digitalt arkiv: SAMLA”. In: Heimen 61.3 (Sept. 30, 2024), pp. 257–261.
issn: 0017-9841, 1894-3195. doi: 10.18261/heimen.61.3.7. url: https:
//www.scup.com/doi/10.18261/heimen.61.3.7 (visited on 06/11/2025).
[2] Koraljka Golub. “Automated subject classification of textual web documents”.
In: Journal of Documentation 62.3 (May 1, 2006), pp. 350–371.
issn: 0022-0418. doi: 10.1108/00220410610666501. url: https://www.
emerald . com / insight / content / doi / 10 . 1108 / 00220410610666501 /
full/html (visited on 06/11/2025).
[3] Douglas Lambert. “Oral History Indexing”. In: The Oral History Review
50.2 (2023), pp. 169–192. doi: https://doi.org/10.1080/00940798.
2023.2235000.