lnu.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
PDF Parsing, Unveiling the Most Efficient Method
Linnaeus University, Faculty of Technology, Department of computer science and media technology (CM).
Linnaeus University, Faculty of Technology, Department of computer science and media technology (CM).
2024 (English)Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
Abstract [en]

This report investigates the challenges and methods associated with parsing PDF documents for use with Large Language Models (LLMs). The variability and complexity of PDF formats pose significant challenges in ensuring accurate data extraction and interpretation. We evaluate several parsing techniques, including rule-based, deep learning-based, and multimodal methods, to determine their effectiveness in handling diverse PDF content. Our study reveals that while multimodal methods generally outperform others, particularly in managing mixed content formats. For PDFs containging complex images a deep learning approach is more effective. This research offers valuable insights into optimizing PDF parsing strategies, balancing accuracy and cost-efficiency, thereby advancing the utility of LLMs in document processing and contributing to improved data management practices in various industries.

Place, publisher, year, edition, pages
2024. , p. 63
Keywords [en]
PDF parsing, Large Language Models, data extraction, Multimodal method, deep learning, document processing
National Category
Computer Engineering
Identifiers
URN: urn:nbn:se:lnu:diva-133931OAI: oai:DiVA.org:lnu-133931DiVA, id: diva2:1920553
External cooperation
tietoevry
Subject / course
Computer Engineering
Educational program
Software Engineering Programme, 180 credits
Supervisors
Examiners
Available from: 2025-03-13 Created: 2024-12-11 Last updated: 2025-03-13Bibliographically approved

Open Access in DiVA

fulltext(1209 kB)152 downloads
File information
File name FULLTEXT01.pdfFile size 1209 kBChecksum SHA-512
141437893141720b8705acfc596be5f9bc1626fa56c7d4cfcfdda3bb120caa3014439534d408428048e305e22bd020aeb21932fe778833ae85946a95d520e108
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Ingemarsson, PerDaniel, Persson
By organisation
Department of computer science and media technology (CM)
Computer Engineering

Search outside of DiVA

GoogleGoogle Scholar
Total: 152 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 456 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf