PDF Parsing, Unveiling the Most Efficient Method
2024 (English)Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE credits
Student thesis
Abstract [en]
This report investigates the challenges and methods associated with parsing PDF documents for use with Large Language Models (LLMs). The variability and complexity of PDF formats pose significant challenges in ensuring accurate data extraction and interpretation. We evaluate several parsing techniques, including rule-based, deep learning-based, and multimodal methods, to determine their effectiveness in handling diverse PDF content. Our study reveals that while multimodal methods generally outperform others, particularly in managing mixed content formats. For PDFs containging complex images a deep learning approach is more effective. This research offers valuable insights into optimizing PDF parsing strategies, balancing accuracy and cost-efficiency, thereby advancing the utility of LLMs in document processing and contributing to improved data management practices in various industries.
Place, publisher, year, edition, pages
2024. , p. 63
Keywords [en]
PDF parsing, Large Language Models, data extraction, Multimodal method, deep learning, document processing
National Category
Computer Engineering
Identifiers
URN: urn:nbn:se:lnu:diva-133931OAI: oai:DiVA.org:lnu-133931DiVA, id: diva2:1920553
External cooperation
tietoevry
Subject / course
Computer Engineering
Educational program
Software Engineering Programme, 180 credits
Supervisors
Examiners
2025-03-132024-12-112025-03-13Bibliographically approved