Invoice Data Extraction – PDF to Excel , JSON and DB
A full-stack application for extracting structured data from invoices using OpenAI’s GPT models. Built with Node.js backend and React frontend, featuring multi-language support and automated database storage.
DISCLAIMER:-
This item uses third-party AI services (such as OpenAI) which are not included in the purchase price.
Buyers are responsible for providing their own API keys and covering any usage costs charged by these services.
No AI credits, subscriptions, or usage fees are included with this item.
Features
- AI-Powered Extraction: Uses OpenAI GPT-4o-mini for accurate invoice data extraction with structured JSON output
- Multi-Format Support: Processes text, PDF (using pdf-parse), and image files (using Tesseract.js OCR)
- Database Automation: Automatically stores extracted data in MongoDB with Mongoose ODM
- Multi-Language Support: Built-in internationalization with English and Spanish translations
- RESTful API: Clean API endpoints for invoice management with proper error handling
- Data Validation: Comprehensive validation service with fallbacks and data cleaning
- File Upload: Multi-file upload support (up to 5 files) with drag-and-drop interface
- Export Functionality: Export extracted data to Excel format
- Unit Testing: Comprehensive test coverage with Jest and Supertest
- Modern UI: React-based frontend with responsive design and Tailwind CSS
- Text Preprocessing: Intelligent text preprocessing to handle OCR artifacts and formatting issues
Tech Stack
Backend
- Node.js with Express.js
- MongoDB with Mongoose ODM
- OpenAI API for data extraction
- JWT for authentication (optional)
- Jest & Supertest for testing
Frontend
- React with modern hooks
- Axios for API communication
- i18next for internationalization
- React Router for navigation
- Testing Library for component testing
- Tailwind CSS for styling
- XLSX for Excel export functionality
Architecture Overview
Backend Architecture
Data Flow
- File Upload: User uploads invoice files (PDF, image, text)
- Text Extraction: Files are processed using pdf-parse or Tesseract.js OCR
- AI Processing: Extracted text is sent to OpenAI with structured prompts
- Data Validation: AI response is validated and cleaned
- Database Storage: Structured data is saved to MongoDB
- Frontend Display: Data is displayed in a responsive table with export options
Project Structure
invoice-extraction/
├── backend/ # Node.js backend
│ ├── models/ # Mongoose models
│ ├── routes/ # API routes
│ ├── services/ # Business logic services
│ ├── __tests__/ # Unit tests
│ ├── db.js # Database connection
│ └── index.js # Server entry point
├── frontend/ # React frontend
│ ├── src/
│ │ ├── components/ # React components
│ │ ├── i18n/ # Internationalization setup
│ │ └── __tests__/ # Component tests
├── prompts/ # OpenAI prompt templates
├── translations/ # Language files
├── docs/ # Documentation
├── .env # Environment variables
└── README.md # This file
Prerequisites
- Node.js (v16 or higher)
- MongoDB (local or cloud instance)
- OpenAI API key
Installation
Clone the repository
git clone <repository-url >
cd invoice-extraction
Install backend dependencies
cdbackend
npm install
Install frontend dependencies
cd../frontend
npm install
cd..
Environment Setup
- Copy
.env file and update the values:cp .env .env.local
- Update the following variables:
OPENAI_API_KEY: Your OpenAI API key
MONGO_URI: MongoDB connection string
PORT: Server port (default: 5000)
Start MongoDB
Make sure MongoDB is running on your system or update MONGO_URI for cloud instance.
Usage
Development
Start the backend server
cd backend
npm run dev
Start the frontend
cd frontend
npm start
Access the application
Production
Build the frontend
cd frontend
npm run build
Start the backend
cd backend
npm start
API Documentation
Invoice Endpoints
Upload Invoice
POST /api/invoices/upload
Content-Type: multipart/formdata
Form Data:
- invoice: File (text, PDF, or image)
Response:
{
"message": "Invoice processed successfully",
"invoice": {
"vendor":"Vendor Name",
"invoiceNumber": "INV-001",
"date": "2023-01-01T00:00:00.000Z",
"totalAmount": 100.50,
"currency">: "USD",
"items": [...],
"status": "processed"
}
}
Get All Invoices
GET <span class="hljs-regexp">/api/i</span>nvoices
Get Invoice by ID
GET /api/invoices/:<span class="hljs-built_in">id</span>
Delete Invoice
<span class="hljs-keyword">DELETE</span> <span class="hljs-regexp">/api/i</span>nvoices<span class="hljs-regexp">/:id</span>
OpenAI Integration
The system uses OpenAI’s GPT-4o-mini model with structured JSON output to ensure consistent data extraction. The AI is prompted with:
- System Prompt: Defines the AI’s role as an invoice data extraction expert
- User Prompt: Provides the extracted text and specifies the exact JSON format required
- JSON Schema: Enforces structured output with validation rules
- PDF Files: Processed using
pdf-parse library to extract text content
- Image Files: OCR processing using Tesseract.js with optimized parameters
- Text Files: Direct UTF-8 text extraction
- Preprocessing: Text cleaning to handle OCR artifacts and formatting issues
Data Validation & Cleaning
- Schema Validation: Ensures all required fields are present and properly formatted
- Fallback Values: Provides sensible defaults for missing data
- Type Conversion: Validates dates, amounts, and other data types
- Duplicate Prevention: Uses invoice number as unique identifier for upsert operations
Supported Invoice Fields
- Vendor/Supplier information
- Invoice number and dates
- Financial amounts (total, subtotal, tax, discounts, shipping)
- Customer and shipping details
- Line items with descriptions, quantities, and pricing
- Payment terms and currency information
Testing
Backend Tests
<span class="hljs-built_in">cd</span> backend
npm <span class="hljs-built_in">test</span>
Frontend Tests
<span class="hljs-built_in">cd</span> frontend
npm <span class="hljs-built_in">test</span>
Multi-Language Support
The application supports multiple languages through JSON-based translations.
Adding a New Language
- Create a new translation file in
translations/ directory
- Update the i18n configuration in
frontend/src/i18n/index.js
- Add language option in the UI
Current Languages
- English (en)
- Spanish (es)
Configuration
All configuration is managed through environment variables in the .env file:
PORT: Server port
MONGO_URI: MongoDB connection string
OPENAI_API_KEY: OpenAI API key
JWT_SECRET: JWT secret for authentication
MAX_FILE_SIZE: Maximum file upload size
DEFAULT_LANGUAGE: Default application language
Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests for new functionality
- Submit a pull request
Development Guidelines
Code Style
- Backend: Follow Node.js best practices with async/await patterns
- Frontend: Use React functional components with hooks
- Naming: Use camelCase for variables/functions, PascalCase for components
- Error Handling: Implement proper try-catch blocks and error responses
- Comments: Add JSDoc comments for functions and complex logic
Environment Variables
Create a .env file in the root directory with:
# OpenAI Configuration
OPENAI_API_KEY =your_openai_api_key_here
# Database Configuration
MONGO_URI=mongodb://localhost:27017/invoice-extraction
# Server Configuration
PORT5000
JWT_SECRET=your_jwt_secret_here
# File Upload Configuration
MAX_FILE_SIZE=10485760
DEFAULT_LANGUAGE=en
Running Tests
# Backend tests
cd backend && npm =test
# Frontend tests
>cd frontend && npm test
Adding New Features
- Create a feature branch from
main
- Implement the feature with proper error handling
- Add unit tests for new functionality
- Update documentation if needed
- Submit a pull request
Deployment
Backend Deployment
- Environment Setup: Configure production environment variables
- Database: Set up MongoDB instance (local or cloud)
- Build: Run
npm install --production for dependencies
- Start: Use
npm start or process manager like PM2
Frontend Deployment
- Build: Run
npm run build in frontend directory
- Serve: Deploy built files to web server (nginx, Apache, etc.)
- API Configuration: Update API endpoints for production
Docker Deployment (Optional)
# Example Dockerfile for backend
FROM node:16-alpine
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .
EXPOSE 5000
CMD ["npm", "start"]
Production Considerations
- Security: Use HTTPS, validate inputs, rate limiting
- Monitoring: Implement logging and error tracking
- Scalability: Consider load balancing for high traffic
- Backup: Regular database backups
- Updates: Keep dependencies updated and monitor for vulnerabilities
Troubleshooting
Common Issues
- OpenAI API Errors: Check API key and quota limits
- MongoDB Connection: Verify connection string and network access
- File Upload Issues: Check file size limits and supported formats
- OCR Problems: Ensure Tesseract.js is properly installed
Debug Mode
Set NODE_ENV=development for detailed error logging and debugging information.