Super fast processing different formats of PDFs using ChatGPT
Are you already automating with ChatGPT? We put it to the test. Can ChatGPT contribute to automatically extracting information from transport documents (PDFs) with different formats and translating it into a software-readable JSON format? And what was the answer? Yes, it can! Thanks to the technology behind ChatGPT, we delivered the Proof of Concept at an unprecedented pace, exceeding our expectations and those of our clients for whom we built the solution. The results were significantly more accurate than anticipated, without any training time and easily extendable for future purposes.
Read about the process in the blog below.
In this blog
Author of the blog:
Harmjan Oonk
The Challenge
In this blog post, we’re going to talk about how we solved a tricky problem. Imagine you have a bunch of PDF documents, but they’re not organized in a standard way. We needed to turn them into a format that computer programs can understand.
There were two main challenges:
- We had to create a structure for the documents where there wasn’t one already. This is tricky because computers can struggle with messy data.
- We also needed to translate terms and ideas from the documents. The problem was that each document might use different words for the same thing.
Components contributing to the solution
We quickly resolved this problem using innovative utilisation and a combination of several components that were already familiar and available to us:
We solved these challenges by using a combination of AWS services;
AWS Textract:
- A ready-made solution from Amazon Web Services for Optical Character Recognition (OCR) to extract text from documents.
Open AI GPT4:
- A breakthrough in artificial intelligence, GPT4 is the most advanced AI model to date, enabling applications like ChatGPT.
AWS S3, SNS, SQS, and Lambda:
- More traditional cloud components for storage, task distribution, and computing power that bring the solution together.
Image 1: the IT architecture
The Initial Solution Direction
Based on “Automate Document Processing in Logistics using AI,” the initial idea was straightforward: Upload documents to S3, detect them, and trigger a serverless Lambda function via Simple Notification Service (SNS) to initiate the processing. This function would first extract (potentially handwritten) text from the document using AWS Textract. Then, it would merge this into a text file given to AWS Comprehend. AWS Comprehend would identify the elements in the text file thanks to its understanding of human language. This includes loading and unloading dates, locations, carriers, products, etc. AWS Comprehend learns to recognise custom entities by training the system beforehand. This involves providing a training set to AWS Comprehend containing input and expected output.
However, the initial results were disappointing. Despite the possibility of better results with more data, we re-evaluated the direction of the chosen solution. It became apparent that with this approach, we were losing (a lot of) information between AWS Textract and AWS Comprehend. It seemed we were throwing all words (and, in a sense, sentences) together into a pile and giving them to AWS Comprehend. Is AWS Comprehend the suitable component for this solution?
New ChatGPT Solution Direction
It was time to find a more suitable tool for this step. As avid users of ChatGPT, we found it interesting to test ChatGPT, the paid version 4 variant. We gave ChatGPT4 a markdown table representation of the data (see the image alongside) and asked it to rewrite it into Open Trip Model JSON. This first attempt was without too many prompt engineering tricks, but the result was surprisingly good for such a quick test. Of course, there were still some challenges, and this was not an immediate solution without further drawbacks. But the core seemed to be working. So, we continued experimenting with the use of GPT4 in the solution.
To experiment more effectively, we switched from ChatGPT to the API for GPT4, which offers advanced options for configuration and control, making it better suited for applications. Recent research into better applying prompt engineering quickly led to consistent JSON output in our experiments. We moved away from the idea of generating OTM JSON immediately; that was resolved later. The result: the solution worked to convert this semi-structured data into JSON output. An automation step followed, a lambda function to translate the Textract result to markdown, a Lambda function to call the OpenAI API, and even an utter ChatGPT-built benchmark tool followed to assess the result better.
Image 2: Markdown table
The first real test with ChatGTP
The first set of PDFs were loaded, the expected result was manually checked and corrected for the first time, and we achieved an initial accuracy score above 80% (the success threshold agreed upon with our client). The markdown representation was improved, and the prompt was refined until we felt we had addressed most opportunities to improve the result. Despite still seeing room for improvement, we were delighted with this initial result: an accuracy score of 97.45%. It’s almost entirely flawless.
Looking back at the two challenges we foresaw, we assert that we have tackled both the unstructured data and the translation of different concepts and terms from diverse documents with this solution. We structured the unstructured data by converting the PDF documents into markdown format. By applying custom instructions per document type, we returned various document types to one clear format.
Reflecting on the initial challenges we anticipated, we confidently declare that we’ve successfully addressed both the disorganised data and the translation of diverse concepts and terms from various documents through this solution. We organised the unstructured data by converting PDF documents into markdown format. By applying custom instructions for each document type, we returned various document types to one unified format.
Image 3: GPT-4 97% accuracy score
Ongoing Challenges
Currently, we are still working to complete this cutting-edge solution and facing some challenges, with several ideas to approach them:
- The solution currently works reliably for documents up to 2 pages long. At the time of writing, GPT-4 Turbo was unavailable, but after the release of this new version of GPT-4, we are confident that even larger documents can be translated.
- While no training is involved, a small part of the solution is still document-specific, requiring an additional step in the process.
- The limitations imposed by OpenAI limits the overall throughput.
A comforting thought is that we expect this solution only to get better, faster, easier, and more extendable. When GPT-5 or a similar model becomes available, adjusting the model used is just a matter of adjusting.
One could say that this solution is too innovative.
Conclusion: Is ChatGPT valuable in the digitisation of business processes?
This case study demostrates the remarkable power of recent advancements in AI, particularly with ChatGPT. Coupled with readily available cloud-native components and an entirely new form of ‘low-code’ development, makes this a fascinating journey which is both intriguing and demanding. Unlike the conventional method that could take days to train a model effectively, our approach showed positive outcomes in just one working day. While more time is needed to turn this proof of concept into a high-quality solution for our client’s process, we are excited to continue this journey by applying our expertise further.
Do you have a challenging case for our Innovation team? We invite you to contact our innovation team to explore the boundaries of (AI) possibilities together.