.

TimestampBrief NameURLWhat kind of resource is it?Language and Document TypesDescription of the ResourceOther CommentsSubmitted By (optional)

.

5/25/2008 18:13:44Baird Grouphttp://www.cse.lehigh.edu/~baird/research.htmlacademic OCR research groupLatin script, predominantly English; mathematicsThe group is at Lehigh University.

1. Document image content extraction (DICE);
2. Human Interactive Proofs and CAPTCHAs;
3. Full integration of document images into digital libraries; and
4. High-performance image understanding systems.

.

5/25/2008 18:16:09Govindaraju Grouphttp://cubs.buffalo.edu/govind/academic OCR research groupLatin, Arabic, Devanagari script; handwriting; English, Arabic, Sanskrit, HindiDr. Govindaraju is currently working on extending his expertise in the automated recognition of both machine-printed and hand-written text in Latin script to Arabic and Indic scripts for indexing and searching documents. Dr. Govindaraju's foray into the field of biometrics began with his dissertation on the automated recognition of faces over two decades ago. His recent multi-disciplinary efforts include novel applications such as use of facial expressions for remote biometric authentication.

.

5/25/2008 19:05:28PRImAhttp://www.cse.salford.ac.uk/prima/academic OCR research groupPRImA is a group of researchers aiming at developing world-class Pattern Recognition and Image Analysis techniques for real-world problems.

Techniques developed by PRImA members and their associates have gained international academic standing and are currently in use in Industry.

.

5/26/2008 0:44:54IUPRhttp://www.iupr.org/academic OCR research groupOur research group conducts basic and applied research in pattern recognition, machine learning, image understanding, and artificial intelligence, with practical applications to digital libraries, network security, bioinformatics, historical document analysis, and scientific data analysis. To learn more about us, have a look at our Research Themes, Projects, and Publications. tmbdev

.

5/25/2008 18:11:18Abbyy Finereaderhttp://finereader.abbyy.com/commercial OCR systemLatin script, 39 languages supported "with dictionaries"Widely used commercial OCR system.

.

5/25/2008 18:12:09Omnipagehttp://www.nuance.com/omnipage/commercial OCR systemLatin scriptA widely used commercial OCR system.

.

5/25/2008 18:56:11Kadmoshttp://www.rerecognition.com/commercial OCR systemKADMOS recognizes hand print, machine print, fraktur, norm fonts e.g. OCRA, OCRB, F7B, CMC7, E13B ..., and marks.KADMOS is an easily integrated character recognition software component for professional use and is incorporated in complete packages for recognition solutions of every kind. The latest proprietary mathematical algorithms are used.

KADMOS is available in three versions:
- REC for isolated single character
- REL for separated text lines
- REP for multiline recognition.

.

5/25/2008 19:18:06TOPOCRhttp://www.topocr.com/index.htmlcommercial OCR systemcamera or smartphoneTopOCR is designed to be simple and user-friendly for use with your digital camera or smartphone.

.

5/25/2008 19:22:19SimpleOCRhttp://www.simpleocr.com/commercial OCR systemSimpleOCR is the popular freeware OCR software with hundreds of thousands of users worldwide. SimpleOCR is also a royalty-free OCR SDK for developers to use in their custom applications.

.

5/26/2008 0:34:01Automatic Reader (Arabic)http://www.amazon.com/Automatic-Reader-Multilingual-OCR-Gold/dp/B0002A5D5Ucommercial OCR systemArabic

* A trainable OCR. It recognizes Arabic text
* Arabic Natural Language Processing, NLP
* Supports both OMNI & Learning technologies
* Works with any type of scanner
* OCR bilingual documents (Arabic/English, and other Latin based LAnguages. Optionally Available: Great features are offered within Sakhrs Automatic Reader package regarding accuracy enhancement, employing NLP tools, supporting PDF, all new famous image formats, and other script languages that have similar shapes to Arabic such as Farsi, Urdu, Pashto and Jawi.

tmbdev

.

5/26/2008 0:35:13ReadIris Middle Easthttp://www.irislink.com/c2-561-189/Readiris-Pro-11-Middle-East---Arabic-OCR-Software.aspxcommercial OCR systemArabic, Farsi, English, and HebrewA mature commercial OCR system for desktop usage.tmbdev

.

5/26/2008 0:36:01Verushttp://www.novodynamics.com/verus_stand.htmcommercial OCR systemArabicSales blurb: "An extraordinarily advanced OCR solution, VERUS™ Standard provides the most accurate Middle Eastern language optical character recognition in the world. It recognizes Arabic, Farsi (Persian), Dari, and Pashto languages, including embedded English and French. It automatically detects and cleans degraded and skewed documents, automatically identifies a page's primary language, and recognizes a page's fonts without manual intervention. VERUS'™ intuitive user interface allows users to quickly review and edit recognized tex"tmbdev

.

5/25/2008 15:51:58MARGhttp://marg.nlm.nih.gov/index2.aspdata set for OCR training or testingEnglishdirectly downloadable from the site, contains > 1000 pages of scanned document images, dataset designed for OCR training or evaluation, flatbed scanned, contains a lot of alphabetic scripts (Latin, Greek, Hebrew, Russian, ...) , contains many academic journals; Scanned images of biomedical journals and their ground truth data.

.

5/25/2008 15:51:58UW3http://documents.cfar.umd.edu/resources/database/3UWCdRom.htmldata set for OCR training or testingEnglishThis dataset has a number of problems: it is hard to obtain, and skew correction was carried out after ground truthing, making the bounding boxes for page elements somewhat inaccurate.

.

5/25/2008 15:51:58ETLdata set for OCR training or testingJapanesecontains > 1000 pages of scanned document images, contains a lot of CJK (Chinese, Japanese, Korean), contains many academic journals; Eletrotechnial Lab database of printed and handwritten documents. Mostly Japanese. It's unclear where to get that now. IUPR has a copy on disk. Documentation is in Japanese.

.

5/25/2008 15:51:58Tobacco Corpusdata set for OCR training or testingEnglishflatbed scanned, many scans are bitonal; Scanned legal documents from the US tobacco law suit.

.

5/25/2008 15:51:58Google 1000data set for OCR training or testingEnglish, some other languagesdataset designed for OCR training or evaluation, contains a lot of historical (pre-1930) documents; A release of 1000 books from Google for the purpose of training and testing OCR systems. The distribution is about 120G large and is shipped on disk from Google.

.

5/26/2008 0:37:45http://ieeexplore.ieee.org/Xplore/login.jsp?url=/iel5/8850/27985/01249846.pdf?arnumber=1249846data set for OCR training or testingDevanagari, HindiCreation of data resources and design of an evaluation test bed for Devanagari script recognition
Setlur, S.; Kompalli, S.; Ramanaprasad, V.; Govindaraju, V.
Research Issues in Data Engineering: Multi-lingual Information Management, 2003. RIDE-MLIM 2003. Proceedings. 13th International Workshop on
Volume , Issue , 10-11 March 2003 Page(s): 55 - 61
Digital Object Identifier 10.1109/RIDE.2003.1249846
Summary: The Indian subcontinent has a large number of languages, dialects, and scripts with the Devanagari script being the primary and most widely used of all the scripts. To date, much of the Devanagari optical character recognition (OCR) research has been restricted to a handful of groups. So, techniques have not yet been widely disseminated or evaluated independently and automated evaluation tools are currently not available for lack of a standard representation of ground-truth and result data. A key reason for the absence of sustained research efforts in off-line Devanagari OCR appears to be the paucity of data resources. Ground truthed data for words and characters, on-line dictionaries, corpora of text documents and reliable, standardized statistical analyses and evaluation tools are currently lacking. So, the creation of such data resources will undoubtedly provide a much needed fillip to researchers working on Devanagari OCR. This paper describes a National Science Foundation sponsored project under the International Digital Libraries program to create data resources that will facilitate development of Devanagari OCR technology and provide a standardized test bed and evaluation tools for Devanagari script recognition.
tmbdev

.

5/26/2008 0:38:58IFN/ENIT Arabic Databsehttp://www.ifnenit.com/data set for OCR training or testingArabic handwritingThe IFN/ENIT-database contains material for training and testing of Arabic handwriting recognition software. There are more than 2200 binary images of handwriting sample forms from 411 writers, about 26,000 binary word images have been isolated from the forms and saved individually for easy of access. A ground truth file for each word in the database has been compiled. This file contains information about the word such as the position of the words base line, and information on the individual used characters in the word.tmbdev

.

5/25/2008 15:51:58Internet Archivehttp://www.archive.orgdigital library sitemostly Englishnot-for-profit book scanning and archiving effort; raw scans, Finereader output, PDFs,etc.

.

5/25/2008 15:51:58Arxivhttp://www.arxiv.org/ digital library sitemostly Englishdigital library site (intended for readers and end users); Large collection of scientific papers obtained by self-submission, often with ground truth in LaTeX format.

.

5/25/2008 15:51:58Citeseerhttp://citeseer.ist.psu.edu/digital library sitemostly Englishdigital library site (intended for readers and end users); Large collection of scientific papers obtained by web crawling, often with ground truth in PDF or other formats.

.

5/25/2008 18:03:24IMPACThttp://www.impact-project.eu/funded OCR-related research grant (EU, NSF, etc.)european languages and scriptsIMPACT is a project funded by the European Commission. It aims to significantly improve access to historical text and to take away the barriers that stand in the way of the mass digitization of the European cultural heritage.
* Koninklijke Bibliotheek
* The British Library
* Österreichische Nationalbibliothek
* Universität Innsbruck
* Deutsche Nationalbibliothek
* Bayerische Staatsbibliothek
* Staats- und Universitätsbibliothek Göttingen
* ABBYY Production
* IBM Israel – Science and Technology Ltd
* Instituut voor Nederlandse Lexicologie
* National Centre for Scientific Research "Demokritos"
* Centrum für Informations- und Sprachverarbeitung, University of Munich
* University of Bath
* University of Salford
* Bibliothèque Nationale de France

.

5/25/2008 18:08:32PAPYRUShttp://www.ict-papyrus.eu/funded OCR-related research grant (EU, NSF, etc.)european languages and scriptsPast and existing work for digital recapturing and preservation of European cultural and scientific heritage has consumed significant effort and resources for the digitisation, characterisation, and classification of content. Digital libraries have thus emerged providing electronic access for many communities of users to available information of their discipline. What has never been targeted, however, is a digital library that draws content from one domain and makes it available to the users of another.
Our project approaches this need by introducing the concept of a cross-discipline digital library engine. Papyrus intends to be a dynamic digital library which will understand user queries in the context of a specific discipline, look for content in a domain alien to that discipline and return the results presented in a way useful and comprehensive to the user. The consortium intends to showcase this approach with a specific pair of disciplines which can be illustrated as an apparent need and may prove to be an immediate exploitation opportunity even on its own. This proposed use case is the recovery of history from news digital content.

.

5/25/2008 18:33:37TextGridhttp://www.textgrid.de/funded OCR-related research grant (EU, NSF, etc.)european scripts and languagesModulare Plattform für verteilte und kooperative wissenschaftliche Textdatenverarbeitung - ein Community-Grid für die Geisteswissenschaften

TextGrid errichtet eine grid-fähige Workbench für die gemeinschaftliche philologische Bearbeitung, Analyse, Annotation, Edition und Publikation von Textdaten für die Philologie, Linguistik und angrenzende Wissenschaften.

Dabei garantieren die für weitere Projekte offenen Schnittstellen Synergien in der wissenschaftlichen Textdatenverarbeitung, sowie eine Rationalisierung des wissenschaftlichen Arbeitens unter anderem durch optimierten Zugriff auf Primärquellen und Werkzeuge.

Mit der Schaffung einer interdisziplinären, internationalen und vernetzten virtuellen Forschungsplattform ist TextGrid gemeinsam mit anderen e-Humanities Initiativen der ersten Stunde an der "Gridifizierung" der Geisteswissenschaften beteiligt.

.

5/25/2008 18:34:31JISChttp://www.jisc.ac.uk/funded OCR-related research grant (EU, NSF, etc.)European languages and scriptsThe mission of the Joint Information Systems Committee (JISC) is to provide world-class leadership in the innovative use of ICT to support education and research.
JISC funds a national services2 portfolio (e.g. JANET) and a range of programmes3 (e.g. Cross-institutional use of e-learning to support lifelong learners) and projects4 (e.g. British Cartoon Archive digitisation project).

.

5/25/2008 18:51:19ocradhttp://www.gnu.org/software/ocrad/ocrad.htmlfunded OCR-related research grant (EU, NSF, etc.)GNU Ocrad is an OCR (Optical Character Recognition) program based on a feature extraction method. It reads images in pbm (bitmap), pgm (greyscale) or ppm (color) formats and produces text in byte (8-bit) or UTF-8 formats.

Also includes a layout analyser able to separate the columns or blocks of text normally found on printed pages.

Ocrad can be used as a stand-alone console application, or as a backend to other programs.

.

5/25/2008 19:10:28InftyProjecthttp://www.inftyproject.org/en/index.htmlmultiple: provides datasets and softwarescientific documents (mathematical formulas)InftyProject is a voluntary R&D organization consisting of researchers from different universities and research institutes to investigate and develop new systems to process scientific information by computer. InftyProject is featured by its policy of research activity to have overriding priority in the application of the results of the research to practical system development, usable in research, education or welfare of science and technology. InftyReader, an OCR system for mathematics produced by the project, is available only as a commercial product for US$900.

.

5/25/2008 19:00:37unpaperhttp://unpaper.berlios.de/open source image processing or machine learning softwareunpaper is a post-processing tool for scanned sheets of paper, especially for book pages that have been scanned from previously created photocopies. The main purpose is to make scanned book pages better readable on screen after conversion to PDF. Additionally, unpaper might be useful to enhance the quality of scanned pages before performing optical character recognition (OCR)

.

5/25/2008 15:51:58ocropushttp://www.ocropus.org/open source OCR projectomnifont, omniscript, omni-languageOCRopus(tm) is a state-of-the-art document analysis and OCR system, featuring pluggable layout analysis, pluggable character recognition, statistical natural language modeling, and multi-lingual capabilities.

.

5/25/2008 18:53:32OCREhttp://lem.eui.upm.es/ocre.htmlopen source OCR projectSpanishAn open source project; there's fairly little documentation and benchmarking.

.

5/25/2008 18:54:35ClaraOCRhttp://www.geocities.com/claraocr/open source OCR projectEnglishCurrent status is: claraocr interface is being reworked to provide a solid basis for future development. By now, only the source tarball clara-20031214.tar.gz is available. Please wait the new versions.

.

5/25/2008 18:57:00Kognitionhttp://sourceforge.net/projects/kognition/open source OCR project
An omnifont OCR software for KDE. Due to the fact that each step of the OCR process can be visualized you can get a quick idea of how OCR works and where the problems lie. However the program may be of minor/no use for end users in its current state.

.

5/25/2008 18:57:47GOCRhttp://jocr.sourceforge.net/open source OCR projectEnglishGOCR is an OCR (Optical Character Recognition) program, developed under the GNU Public License. It converts scanned images of text back to text files. Joerg Schulenburg started the program, and now leads a team of developers.
GOCR can be used with different front-ends, which makes it very easy to port to different OSes and architectures. It can open many different image formats, and its quality have been improving in a daily basis.

.

5/25/2008 19:01:17Conjecturehttp://www.corollarium.com/conjecture/open source OCR projectConjecture is a modular, extensible, open-source C++ framework for Optical Character Recognition (OCR). Conjecture is not a single OCR, but rather is an extensible collection of OCRs that can be explored, analyzed, compared, extended, modified, and merged within a unified environment. Seems to be dead (last SVN activity 9 months ago).

.

5/25/2008 19:02:03hOCR (Hebrew OCR)http://hocr.berlios.de/open source OCR projectLibHocr is a GNU Hebrew optical character recognition library. It scans document images, improve the image, analyses the page layout, recognises the characters and outputs the text. The output texts are now editable text, ready for your blog, word processor or any other use.

.

5/25/2008 19:03:12Tesseracthttp://code.google.com/p/tesseract-ocr/open source OCR projectalphabetic languages; dictionary language modelThe Tesseract OCR engine was one of the top 3 engines in the 1995 UNLV Accuracy test. Between 1995 and 2006 it had little work done on it, but it is probably one of the most accurate open source OCR engines available. The source code will read a binary, grey or color image and output text. A tiff reader is built in that will read uncompressed TIFF images, or libtiff can be added to read compressed images.Tesseract OCR is already integrated into OCRopus.

.

5/25/2008 19:03:51Cuneiform OCRhttp://www.cuneiform.ru/eng/index.htmlopen source OCR projectCUNEIFORM V.12. In our modern world you can hardly think about simply retyping documents into a computer, because they usually have images and other elements of decoration which you you will not be able to copy into your computer without scanning. And only such modern and perfect system as CuneiForm V.12 can reproduce an original document form, including images, tables, columns, paragraphs, indentions, font styles and sizes.

CUNEIFORM V.12 is a famous Russian software width established traditions. It combines broad experience acquired by Russian scientists with the most advanced achievements in the field of optical recognition as cognitive analysis algorithm, adaptive recognition of characters, meridian segmentation of tables, neuron nets, etc.

.

5/25/2008 19:04:56Gamerahttp://ldp.library.jhu.edu/projects/gamera/open source OCR projectGamera is a framework for the creation of structured document analysis applications by domain experts.

.

5/25/2008 19:06:05Cuneiform Linuxhttps://code.launchpad.net/~jpakkane/+junk/cuneiform-linuxopen source OCR projectLinux port of the Cuneiform OCR system. There is very little readable documentation on the system.

.

5/26/2008 0:41:04SIRAGIhttp://siragi.sourceforge.net/open source OCR projectSIRAGI is an open source software designed to help blind and partially sighted people working with their computer. Visually impaired people can use this program to "listen" the content of their screen under windows or Linux/KDE. The main advantage of using SIRAGI is the support of arabic language for braille language and for speech synthesis in arabic. tmbdev

.

5/26/2008 0:42:30BOCRAhttp://bocra.sourceforge.net/doc/open source OCR projectBengaliWord-based Bengali OCR.tmbdev

.

5/26/2008 4:59:57BanglaOCRhttp://sourceforge.net/project/showfiles.php?group_id=158301&package_id=215908open source OCR projectBangla/Bengali printed documentThe development of this project is still ongoing.

Center for Research on Bangla Language Processing (CRBLP) is developing an OCR for machine printed Bangla documents.
Md. Abul Hasnat

.

5/25/2008 18:59:53WeOCRhttp://weocr.ocrgrid.org/other document processing softwareWeOCR is a platform for Web-enabled OCR (Optical Character Reader/Recognition) systems that enables people to use character recognition over networks. A WeOCR server receives document images from users, recognize texts in the images, and return recognition results to the users. WeOCR does not have its own character recognition engine. Instead, it is intended to accommodate various character recognition engines. WeOCR provides a simplified user interface so that more people can benefit from OCR easily.

.

5/26/2008 0:40:22ArabEyeshttp://www.arabeyes.org/resource page / directoryArabeyes is a Meta project that is aimed at fully supporting the Arabic language in the Unix/Linux environment. It is designed to be a central location to standardize the Arabization process. Arabeyes relies on voluntary contributions by computer professionals and enthusiasts all over the world.tmbdev

.

5/25/2008 18:54:21FBK (formerly ITC)http://tev.itc.it/OCR/ResearchProjects.htmlresource page / directoryA extensive overview of research groups, projects and products.Might be considered as additional source for this document.

.

zzzzzzzzzzzzNew Entries Belowzzzzzz

.

6/1/2008 21:04:17THDLhttp://www.thdl.org/digital library siteTibetan language and script; related scripts.The Tibetan and Himalayan Digital Library is an international community using Web-based technologies to integrate diverse knowledge about Tibet and the Himalayas for free access from around the world.

Serving a wide range of communities, we publish multilingual studies, multimedia learning resources, and creative works concerned with the area's environments, cultures, and histories.
tmbdev

.

6/15/2008 10:44:56Nepali OCRhttp://www.mpp.org.np/pannepal/activities.phpfunded OCR-related research grant (EU, NSF, etc.)NepaliRajesh Pandey

.

6/19/2008 12:03:42Quixatehttp://www.quixate.com
commercial OCR systemLatin script. Text in photographs of natural scenes.
Quixate develops technologies to read the text present in natural photographic images. Applications include indexing and searching collections of photographs and text reading using mobile devices.
enquiries@quixate.com

.

4/9/2009 20:21:48