Beyond Regular Expressions: Using LLMs for Intelligent Receipt Processing
tech
Mikko Harju

Beyond Regular Expressions: Using LLMs for Intelligent Receipt Processing

When it comes to extracting structured data from unstructured text, developers have traditionally reached for regular expressions, pattern matching, and rule-based parsers. But what happens when you're dealing with real-world documents that don't follow neat, predictable formats or they have ever just slight variance that is hard to capture with pre-defined rules?

We needed to build a system to automatically process taxi receipts from various sources all over Finland. Each company has its own receipt format, different layouts, varying languages, and inconsistent data placement. Traditional approaches would've required:

- Dozens of regular expressions for different formats
- Complex parsing logic for each receipt type
- Constant maintenance as formats change
- Brittle code that breaks with minor layout variations

This is exactly the problem where we thought that LLMs would be very useful.

The LLM-powered solution

First we tried to use the multimodal capabilities of GPT-4o but found the performance a bit lacking with our test set. So, we instead tried to have the OCR be handled separately and then push that parsed text for the LLM.

Our approach combines Google Cloud Vision API for optical character recognition (OCR) with OpenAI's GPT-4 for intelligent data extraction.

Image to text with Google Vision

The Vision API doesn't just extract text – it preserves spatial relationships, allowing us to reconstruct the logical flow of information as it appears on the receipt. This just needs to combine the location information with the raw text to construct a logical flow for the LLM to start working with.

LLM-Powered Intelligent Extraction

Instead of writing complex parsing rules, we leverage GPT-4's natural language understanding (the prompt is not the same we use in real life, this is for demonstration purposes):

```python
PROMPTS = {
 "fi": """Toimit taksikuittitietojen tulkinnan asiantuntijana ja tehtäväsi on selvittää
 mahdollisimman monta sinulle annetusta koneluetusta tekstistä:
 - Kuljetusyhtiön nimi
 - Kuljetusyhtiön y-tunnus
 - Auton rekisteritunnus
 - Matkan kokonaissumma euroina
 - Veron osuus matkasta
 ...
 """
}
```

The prompt makes GPT-4 act as a data extraction expert, instructing the LLM to:

- Understand the context
- Identify specific data points
- Handle Finnish business terminology
- Return structured JSON output
- Gracefully handle missing information

Structured Data with Pydantic

We define a clear data model that ensures type safety and validation:

```python
class Receipt(pydantic.BaseModel):
 company_name: str | None = None
 company_vat_number: str | None = None
 ...
 distance: Decimal | None = None
```

This way we can ingest the output from the LLM in a type-safe and structured way.

What did we learn?

By replacing traditional regex-based parsing with LLM-powered extraction, we've built a system that's (at least for now) been more robust, maintainable, and adaptable than conventional approaches. The combination of OCR for text extraction and LLMs for intelligent parsing creates a powerful pipeline that handles real-world complexity with ease.

The future of document processing isn't about writing better regular expressions – it's about leveraging AI that understands language, context, and meaning just like humans do.

Ready to implement LLM-powered document processing in your own projects? Gather up some test data to experiment with, build a clear data model, craft specific prompts, and let the AI handle the complexity of real-world text extraction.

Or, you can be in touch with us to build the tool you want, together!

Mikko Harju

With deep expertise in software development and emerging technologies, our Technology Director Mikko shares practical insights and concrete examples from real-world projects. Passionate about scalable tech, AI and emerging trends.

About the author

Mikko Harju

Latest Blog Posts

Read all Posts