Domain-Based Auto-Archiving System

Domain-Based Auto-Archiving System

University Project

August, 2020

3 minutes


Project Aim

The aim of this project was to develop an Arabic domain-based auto-archiving system to automate the data entry process, enhancing accuracy and efficiency. This system was designed to handle scanned documents by extracting information and storing it for easy access and utilization. It was part of a junior project at the Arab International University (AIU).

Domain-Based Auto-Archiving System DiagramDomain-Based Auto-Archiving System Diagram

My Role

As a key team member, I was responsible for developing the core logic of the Domain-Based Auto-Archiving System. This involved implementing image preprocessing with OpenCV, integrating Tesseract OCR for text extraction, and enhancing search results with Solr's fuzzy search. I collaborated closely with three other team members who focused on frontend development, user interface design, and system integration. Together, we ensured that the system effectively automated data entry and maintained high accuracy in processing Arabic documents.

Description & Technologies

Domain-Based Auto-Archiving System UIDomain-Based Auto-Archiving System UI

The Domain-Based Auto-Archiving System leverages a range of technologies to provide a robust solution for managing Arabic documents. Key technologies include:

  • OpenCV: Used for image preprocessing tasks such as binarization, skew detection, and noise removal.
  • Marvin Framework: Utilized for image segmentation to identify text regions.
  • Tesseract OCR: Employed for optical character recognition, converting images of text into machine-encoded text.
  • Solr: Implemented for result enhancement using fuzzy search, improving the accuracy of the data extraction process.

Outcome

Domain-Based Auto-Archiving System UIDomain-Based Auto-Archiving System UI

The system effectively automates the archiving process, handling variations in document layouts through predefined templates. It significantly reduces manual data entry errors and enhances the accessibility and usability of archived data. By integrating OCR and fuzzy search technologies, the system ensures high accuracy and efficiency in processing Arabic documents.

Key Aspects

Domain-Based Auto-Archiving System UIDomain-Based Auto-Archiving System UI
  • Image Preprocessing: Utilizes OpenCV for binarization, skew correction, and noise removal to enhance image quality.
  • Template Definition: Allows for flexible template creation to handle different document layouts.
  • OCR Integration: Leverages Tesseract OCR for accurate text extraction from images.
  • Result Enhancement: Uses Solr for fuzzy search to improve the accuracy of extracted data.
  • User Management: Supports role-based access control for secure data management.
  • Data Integrity: Ensures consistency and reliability of archived data.
  • Accessibility: Provides a user-friendly interface for defining templates and managing archived documents.

Technologies Used

  • OpenCV: For image preprocessing tasks.
  • Marvin Framework: For image segmentation.
  • Tesseract OCR: For optical character recognition.
  • Solr: For result enhancement through fuzzy search.
  • Java: For backend development.

Final Thoughts

This project successfully addresses the challenges of archiving Arabic documents by providing a reliable and efficient auto-archiving system. The integration of advanced technologies ensures high accuracy and usability, setting a benchmark for similar systems in the field.