k5n EDocIAS is an open source PHP-based electronic document index and search system. It is not a document management system. With the proper 3rd party text extraction tools, you can easily search for text embedded in binary files (PDF, DOC, XLS, HTML, etc.) And, with the installation of Tesseract, any documents you have scanned (and saved as TIFF, JPEG or PNG) can also be searched.
- PHP 5 or later
- MySQL (other databases will likely work but not tested)
- 3rd party tools for extracting text (Tesseract, etc.) (See the README.txt file for a list of possible tools.)
It’s important to keep in mind not only what this tool was designed for but also what it is not.
- Open source: GLP v2
- Multiplatform (Mac, Windows, Linux): Achieved by using PHP and MySQL.
- Simple: This is not a document management system with file management, user access control, etc. This is intended for users who already have a bunch of documents of various formats and just need a way to quickly search through the files from any machine on their network.
- Lean: Since this app may not be used for days at a time (depending on how it’s being used), it should not require additional memory or CPU time while not in use. So, no standalone Java server for example.
- Accessible from the network: Even though all the files are centrally located on a single machine, you can search for and download/view the documents using the simple web interface. This makes all the documents instantly accessible to the other machines on your network. And no mounting of network drives is required.
- Don’t reinvent the wheel: There are plenty of existing tools out there that will extract the plain text from binary data. So, there is no need to rewrite that functionality again. The document index process invokes 3rd party tools.
- Expandable: If you have files of a different type that you want to include, all you need to do is find or build a tool to extract the text you want to index. For example, find a tool to pull the metadata out of photos and index your pictures.
- Intranet-focused: Easily integrated into other PHP-based apps for use in an intranet. Custom header, trailer and CSS is easily configured.
All software is licensed under the GNU General Public License, version 2.
For more information about this license:
- General discussion forum on Sourceforge.net.
- Better documentation for configuring the various text extraction tools (check the wiki).
- Does not yet re-index a document that is updated.
- Does not yet remove a document from the search index if it is deleted.
- Allow database table name to be configurable so that more than one data index can be stored in a single database.
- Support scanned images within a PDF file. Currently only text is extracted from PDF files. However, many scanners generate PDFs directly making this a useful feature.
- Full support for addtional databases like Oracle, SQL Server, PostgreSQL, etc. (if there is a need).
The 1.0 release is available at SourceForge.net:
- edocias-1.0.tar.gz (33kb)
Below is a list of just some of the similar tools out there. There are also surely more not listed here. What is missing from most apps in the list below is a simple way to get the text from scanned images (via OCR) into the search index. EDocIAS can use Tesseract to extract text from scanned images. (You will need to install Tesseract though.)
- Solr: Enterprise quality document indexing and searching written in Java. Fairly complicated to setup and manage. Designed for indexing websites rather than local storage.
- IntraDex: PHP/MySQL app for indexing and searching. Last updated in 2010. No support for scanned documents. Windows only.
- DocSearcher: Java-based desktop (not web-based) tool (uses Lucene) for indexing documents.