Content extraction from PDF invoices on business document archives

Bandara RMCV

UoM IR
→
Thesis & Dissertation
→
Faculty of Engineering, Computer Science & Engineering
→
Master of Science in Computer science and Engineering
→
View Item

dc.contributor.advisor	Perera I
dc.contributor.author	Bandara RMCV
dc.date.accessioned	2020
dc.date.available	2020
dc.date.issued	2020
dc.identifier.uri	http://dl.lib.uom.lk/handle/123/16780
dc.description.abstract	organization better control over their information processes. When a business expands, more documents will be produced, and it needs to be carefully handled and tracked to make good use of. Output management systems that are working with ERP systems contains thousands of business documents and Portable document format (PDF) is the common output format for these types of documents. These systems need to execute documents search operations frequently. PDF documents Indexing is a critical part in this context. It will boost document search engine efficiency by cutting search space. Content extraction from PDF documents goes a step further and it will allow more structured search queries. Extracting the document content from a PDF file is a very important. But this is a very challenging task because PDF is a layout-based format that defines the fonts and locations of the individual character as opposed to the semantic units of the text and their role within the document. In this research I have developed a technique to extract content from a PDF file. We can use it for allow more structured search queries on large document archives in output management systems typically work with world leading ERP systems. On this research mainly considered on four aspects which are correctly identifying words, word order on a paragraph, clear separation of paragraph boundaries and semantic roles of each word. After extracting content from the PDF file, extracted texts content written to an xml document. XML file contains tags to recognize the pages and rotation angle and number of images on each page. Sample set of PDF invoices extracted and calculated the extracted word percentage to evaluate the accuracy of this technique. This tool hits 94.27% accuracy rate according to the results.	en_US
dc.language.iso	en	en_US
dc.subject	COMPUTER SCIENCE AND ENGINEERING-Dissertations	en_US
dc.subject	COMPUTER SCIENCE-Dissertations	en_US
dc.subject	DATA PROCESSING, BUSINESS	en_US
dc.subject	BUSINESS COMMUNICATION-Portable Document Format	en_US
dc.subject	AUTOMATIC CONTENT EXTRACTION	en_US
dc.title	Content extraction from PDF invoices on business document archives	en_US
dc.type	Thesis-Abstract	en_US
dc.identifier.faculty	Engineering	en_US
dc.identifier.degree	MSc in Computer Science	en_US
dc.identifier.department	Department of Computer Science & Engineering	en_US
dc.date.accept	2020
dc.identifier.accno	TH4255	en_US