Sunday, July 11, 2010

OCR in Concept Framework

I've included tesseract in Concept Framework. It's a really nice OCR library and I'm impressed by the quality of the character recognition. I've tried to keep the APIs really simple: just 2 functions, one for debugging and one for the actual OCR.

Prototype looks like this:

number OCR(szImage_filename, refOutText, szLanguage="eng", szDataPath="", szConfigFile="");

It returns OCR_E_CANT_OPEN or OCR_E_CANT_READ if error, or 0 otherwise.

A basic example:

import standard.lib.ocr

class Main {
function Main() {
// if debug file is not set, nul or /dev/null is assumed
if (!OCR("cap.bmp", var data))
echo data;

I hope is as straight forward as it can be.

This morning I've realized that although the framework has many PDF writing APIs (like libharu and PDFLib), it has no PDF reading APIs. So, I've integrated poppler (already used by de Concept Client). Now you can convert PDFs to images or extract text.

The APIs are pretty basic:

import standard.lib.poppler

class Main {
function Main() {
var pdf=PDFLoadBuffer(ReadFile("test.pdf"), "", var err);
if (pdf) {
var pages=PDFPageCount(pdf);
echo "Document has $pages pages\n";
for (var i=0;i<pages;i++) {
echo "Page ${i+1}:\n";
echo "==============";
// extract the text
echo PDFPageText(pdf, i);
// extract page as an image
echo PDFPageImage(pdf, i, "page_$i.png");
echo "==============";
// Don't forget to close !

(sorry for the indentation, blogspot didn't handle it very well)

No comments: