Automated scanned receipt processing with optical character recognition and machine learning / Hor Zhang Neng

Text detection and recognition in parsing optical character recognition (OCR) receipts are less studied than other popular OCR tasks. Study for post-OCR parsing of receipts is scarce, which opens up the opportunity to explore extracting key information from receipts and classifying them. This disser...

Full description

Saved in:
Bibliographic Details
Main Author: Hor, Zhang Neng
Format: Thesis
Published: 2022
Subjects:
Online Access:http://studentsrepo.um.edu.my/14443/1/Hor_Zhang_Neng.pdf
http://studentsrepo.um.edu.my/14443/2/Hor_Zhang_Neng.pdf
http://studentsrepo.um.edu.my/14443/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Text detection and recognition in parsing optical character recognition (OCR) receipts are less studied than other popular OCR tasks. Study for post-OCR parsing of receipts is scarce, which opens up the opportunity to explore extracting key information from receipts and classifying them. This dissertation explores how the OCR and machine learning (ML) techniques can optimize and automate receipt handling for reimbursement purposes. Automating the reimbursement process keeps faulty reimbursement expense reporting behaviour to a minimum and speeds up employee claims. The dataset prepared for this work consists of one hundred receipts commonly found in Malaysia's employee expense reimbursement report. The receipts are organized into six categories: meals, groceries, petrol, accommodation, telecommunication, and transportation fares. The receipts are of Malaysian origin, and the language of receipts is restricted to only containing English text. This work does not consider parsing handwriting on the receipt nor addresses text ambiguity. The text processing accuracy follows the accuracy of the OCR tool selected. This dissertation proposes three objectives; developing an image processing framework in improving receipt quality pre-parsing, recognizing text and extracting key information from receipts using the OCR technique, and evaluating the ML classifiers in improving receipt classification post-parsing. The overall text extraction is 90.72% and 78.51% accurate at character and word level, with harmonic mean of the precision and recall, F1 score of 0.89 and 0.78. Overall accuracy for key information extraction is 74.33%, with an F1 score of 0.74. Seven ML classifiers, Naive Bayes, maximum entropy, Support Vector Machine (SVM), linear Support Vector Classifier (SVC), k-nearest neighbours (KNN), decision tree and random forest, were compared. They perform between 52% and 80% overall, with F1 scores between 0.55 and 0.79. Interestingly, the linear SVC has the highest score and accuracy for its searching capability in finding the best dividing field that separates high-dimensional text data into classes.