![]() |
|
|
|
. | PrimeOCR Product Evaluation Report Provided by: Aspen Systems Corporation ASPENs DIRECTOR of SYSTEMS & TECHNOLOGY reported on a test he had performed that compared a leading conventional OCR software product to the "Voting" OCR software solution from Prime Recognition (PrimeOCR). This highly detailed and well executed test pilot shows that:
Notes from the directors report are presented below.
These notes have been edited for brevity.
About Aspen Systems Aspen Systems Corporation was founded in 1958 and, since that time, has grown into a 1400 employee Information Management Services company whose core competencies include:
Tester Notes Recently I devised and performed a test designed to evaluate a number of parameters which might affect the quality and throughput of the OCR process. The motivation for this testing was obtaining a 45-day evaluation copy of the Prime Recognition OCR software, which employs a voting engine methodology to improve accuracy. The test was constructed in such a way as to test the following variables:
The PrimeOCR software incorporates from 3 to 5 commercially available OCR engines as part of its process. The Prime software provides an interface for controlling these 5 engines, interprets the output from each, and intelligently arrives at what it considers the best possible solution for each page converted by each engine. The Prime software can be tuned by means of Prime "string commands", which set attributes for the engine. Using these string commands, the operator can specify acceptable levels and thresholds for turning on and off the successive OCR engines. Under normal operation, the OCR engines are invoked in sequence on a page-by-page basis. With the appropriate string command settings, successive engines will not be invoked if the measured accuracy level of the current engine either falls below a threshold (the quality of the material is so poor as to preclude accurate conversion) or above (the first engine(s) did such a good job, that there would be no appreciable improvement achieved through additional engines). Because there are five engines used as part of the OCR conversion process, the time to process a given page is approximately 5x longer than it would be using a single engine alone. However, this can be mitigated somewhat by use of string commands as described above. For the purposes of this test, I used a small sample of source hardcopy material. The sample was kept small to keep the test manageable, and to allows manual inspection of the test results. There were two initial batches: Good and Bad source. The Good source material was first generation photocopy; some pages had mixed fonts, some had indented material, some had graphics, and some had numerical data. The Bad source material consisted of two parts: one half of the pages were taken from the Good batch and successively photocopied until the text was very light (but still readable). The remaining pages were photocopies which included large amounts of tabular data, and mixed fonts and graphic data. Each of these pages were scanned on both the Fujitsu and Kodak scanners, once at 200 DPI and once at 300 DPI. One half of the images were then image enhanced. Finally, the TIF files were passed through the (conventional OCR leader) and Prime OCR engines, generating text files. Both the Prime and (conventional OCR leader) OCR engines were run on similarly configured systems. These systems were Compaq DeskPro 400s, running the Windows 95 Operating System, with 64MB RAM, and over 500 MB Hard Disk free. Image data was copied to the local hard drive prior to OCR conversion, and the conversion generated text files on the local hard drive. As part of the conversion process, the following parameters were captured for each page:
Adjustments to Process and Measurements When the test began, there were no string commands entered to modify the operation of the Prime software. As it became apparent that there would not be time to complete the test without some modification to reduce processing time, a string command was entered to establish thresholds for continuing to invoke OCR engines.
The numbers reflect overall page-level confidence levels for the conversion, on a scale of 0 900, with 900 being the highest confidence. This command would not invoke the following engine in sequence if the output of the current engine was either below 600 (very poor), or above a laddered threshold, of from 850 to 895. As a result of this string command, the overall processing time was reduced. The (conventional OCR leader) engine was determined to have a different problem. While it generated converted text as expected, the engine inserted a significant number of additional spaces into that text stream. While these extra spaces did not affect the final makeup of the resultant text document, they did inflate the total number of characters on the page, which consequently improved the reported OCR accuracy. When manual comparisons were done between the reported number of characters by Prime and (conventional OCR leader), and those actually counted, it was determined that the PrimeOCR engine was very accurate in its total page character count. The Prime generated character counts were used in the calculations of the statistics in this report. The testing generated the following conclusions: 1) The quality of source material has the greatest impact on the resultant OCR accuracy. With the PrimeOCR engine, the test showed a 3.6% difference, and with the (conventional OCR leader) engine a 4.2% difference between the OCR accuracy for good and bad source material. 2) There does not appear to be any significant difference in the resultant OCR accuracy based upon DPI, scanner, or image enhancements. The overall difference in the average OCR accuracy for the same data sets with only one variable are:
3) There is a significant improvement in OCR accuracy when using the PrimeOCR engine vs. the (conventional OCR leader) engine. Prime generates an average of 2.64% improvement in accuracy. While 2.64% may not sound like a lot, given that our average number of characters per page in this test was 2774, this translates to over 66 more characters per page converted correctly by the Prime OCR engine vs. the (conventional OCR leader) engine. When comparing the one time costs for the PrimeOCR licenses against the potential recurring costs for OCR correction, the Prime solution may be cost-effective. For example, assuming that there is an advantage of 66 characters which do not have to be corrected using the Prime solution, and that a clerk can correct at the rate of 200 characters/hr. This means that you would receive a $3/page OCRed benefit by not having to correct these additional characters. The software cost would be paid for, under this scenario, after 4000 pages were OCRed. |