lnu.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Challenges of combining structured and unstructured data in corpus development
University of Helsinki, Finland.
Linnaeus University, Faculty of Arts and Humanities, Department of Languages. (DISA-DH)ORCID iD: 0000-0001-5251-5338
2021 (English)In: Research in Corpus Linguistics (RiCL), ISSN 1064-4857, E-ISSN 2243-4712, Vol. 9, no 1, p. I-viiiArticle in journal (Other academic) Published
Abstract [en]

Recent advances in the availability of ever larger and more varied electronic datasets, both historical and modern, provide unprecedented opportunities for corpus linguistics and the digital humanities. However, combining unstructured text with images, video, audio as well as structured metadata poses a variety of challenges to corpus compilers. This paper presents an overview of the topic to contextualise this special issue of Research in Corpus Linguistics. The aim of the special issue is to highlight some of the challenges faced and solutions developed in several recent and ongoing corpus projects. Rather than providing overall descriptions of corpora, each contributor discusses specific challenges they faced in the corpus development process, summarised in this paper. We hope that the special issue will benefit future corpus projects by providing solutions to common problems and by paving the way for new best practices for the compilation and development of rich-data corpora. We also hope that this collection of articles will help keep the conversation going on the theoretical and methodological challenges of corpus compilation.

Place, publisher, year, edition, pages
Spanish Association for Corpus Linguistics , 2021, vol 9 no 1. Vol. 9, no 1, p. I-viii
Keywords [en]
corpora, corpus compiling, annotation
National Category
General Language Studies and Linguistics
Research subject
Humanities, Linguistics
Identifiers
URN: urn:nbn:se:lnu:diva-106122DOI: 10.32714/ricl.09.01.01Scopus ID: 2-s2.0-85150893885OAI: oai:DiVA.org:lnu-106122DiVA, id: diva2:1584034
Available from: 2021-08-10 Created: 2021-08-10 Last updated: 2023-05-11Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Authority records

Tyrkkö, Jukka

Search in DiVA

By author/editor
Tyrkkö, Jukka
By organisation
Department of Languages
In the same journal
Research in Corpus Linguistics (RiCL)
General Language Studies and Linguistics

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 94 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf