lnu.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Low-code web scraping and text analysis with Octoparse and KNIME: An example from the CICuW project
Linnaeus University, Faculty of Arts and Humanities, Department of Languages. Linnaeus University, Faculty of Arts and Humanities, Department of Cultural Sciences.ORCID iD: 0000-0002-0930-644X
Linnaeus University, Faculty of Arts and Humanities, Department of Cultural Sciences.ORCID iD: 0000-0001-9938-4785
Linnaeus University, Faculty of Arts and Humanities, Department of Cultural Sciences.ORCID iD: 0000-0002-6184-6603
2025 (English)In: Huminfra Handbook: Empowering digital and experimental humanities / [ed] Bouma, Gerlof; Dannélls, Dana; Kokkinakis, Dimitrios; Volodina, Elena, Tartu: University of Tartu, 2025, p. 505-540Chapter in book (Refereed)
Abstract [en]

Low-code tools play an important role in making data analysis and visualization accessible to researchers and students with limited experience, or interest, in programming. While low-code tools do introduce closedbox issues, they can still be considered important stepping stones toward computational approaches. This chapter draws on two such tools, Octoparse and KNIME (Konstanz Information Miner), to present a workflow from data collection from online sources, through text pre-processing, toward text classification in the context of the ongoing project Cultural Institutions and the Culture War (CICuW) that investigates the democratic implications of the pervasiveness of farright digital discourse. This chapter will introduce web scraping, topic modeling, and sentiment analysis in an accessible way, while also showcasing state-of-the-art approaches to the analysis components through the use of BERT (Bidirectional Encoder Representations from Transformers) models and zero-shot classification. The chapter will take a critical perspective on the described methods by discussing how they contribute to creating methodological closed-boxes and how quantitative techniques can be fruitfully combined with qualitative approaches.

Place, publisher, year, edition, pages
Tartu: University of Tartu, 2025. p. 505-540
Series
NEALT Proceedings Series, ISSN 1736-8197, E-ISSN 1736-6305 ; 59
Keywords [en]
web scraping, topic modelling, sentiment analysis, low code tools, digital humanities
National Category
Cultural Studies
Research subject
Humanities, Library and Information Science; Humanities, Linguistics
Identifiers
URN: urn:nbn:se:lnu:diva-142479DOI: 10.58009/aere-perennius0184ISBN: 9789153170778 (print)ISBN: 9789908536125 (electronic)OAI: oai:DiVA.org:lnu-142479DiVA, id: diva2:2013970
Available from: 2025-11-14 Created: 2025-11-14 Last updated: 2026-01-19Bibliographically approved

Open Access in DiVA

fulltext(2054 kB)35 downloads
File information
File name FULLTEXT01.pdfFile size 2054 kBChecksum SHA-512
523e7a36cbf473556229ca8713861fbf1c2d9cd08adde6d5b7db778db9ac77801d14b0b0c9db57deb90df731f6619cf6bd401b4d07bd0bd931cbee29ce121750
Type fulltextMimetype application/pdf

Other links

Publisher's full texthttps://hdl.handle.net/10062/117354

Authority records

Ihrmark, DanielCarlsson, HannaHanell, Fredrik

Search in DiVA

By author/editor
Ihrmark, DanielCarlsson, HannaHanell, Fredrik
By organisation
Department of LanguagesDepartment of Cultural Sciences
Cultural Studies

Search outside of DiVA

GoogleGoogle Scholar
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

doi
isbn
urn-nbn

Altmetric score

doi
isbn
urn-nbn
Total: 819 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf