Searching in Alto files with context1
Tags: context, alto, search
Last Updated: Dec 03, 2009 13:54
Description
This application performs a search for terms in ALTO files. ALTO files are XML files that store the output of OCR. A term can be several words and multiple terms can be searched at the same time. The output is in XML format and contains the coordinates of the words that were found as well as textual context around the hits. The program does a case insensitive search. This is achieved by converting everything into lowercase and then comparing. In addition, any punctuation marks (or brackets etc.) at the beginning or end of terms are ignored for comparison purposes. The detection of punctuation marks and the conversion to lowercase is based on Unicode 5.2 data.
- Author: Yves Maurer
- Additional author(s):
- Institution: Centre Informatique de l'Etat
- Year: 2009
- License: GPL v2
- Short description: Use, modification and distribution of the code are permitted provided the copyright notice, list of conditions and disclaimer appear in all related material.
- Link to terms: http://www.gnu.org/licenses/gpl-2.0.txt
- Skill required for using this code:
advanced
Usage
alto_search -alto filename -term XXXX [-term XXXX] [-help] [-block YYYY]
- alto : "filename" gives the full pathname of the ALTO file in which to search (must be UTF-8)
- term : "XXXX" is a search term in UTF-8 (one term can be several words enclosed in quotation marks). There can be several -term parameters.
- block: "YYYY" is the ID of a TextBlock in the ALTO file. As soon as 1 block ID is given, the search is restricted to only those blocks. There can be several -block parameters.
- help : prints a help text in XML
Examples
- alto_search -alto C:\data\1908\1908-01-01_01\alto\1908-01-01_01-00001.xml -term "der Thron"
Searches for the two words "der" and "Thron" when they appear consecutively in the same block.
All blocks in the ALTO file are considered
- alto_search -alto /exlibris4/storage/2009/11/01/363562 -term "DIE BISCHÖFE" -term "KIRCHE" -block P2_TB00046 -block P2_TB00047
Searches for "DIE BISCHÖFE" as consecutive two words and then a separate word "KIRCHE". The search is restricted to the two blocks P2_TB00046 and P2_TB00047. Other blocks are skipped.
Explanation of result XML
MeasurementUnit
The measurement unit used in the ALTO file. This is usually mm10, which means that in order to transform them into pixels, you need to use the formula pixel=mm10 * dpi / 254. The dpi is the same as the one from the image that the ALTO file corresponds to.
ResultSet
The container for the whole XML
result BLOCKID="P1_TB00029" NHITS="1" CHARS_HIT="3"
A result with its context. The BLOCKID comes from the TextBlock's ID where the hit was found. NHITS tells you how many terms have been found (can be the same term several times). CHARS_HIT says how many characters are in the terms that were found.
context
The context around a hit. The context can appear before, after and in between hits. It consists of regular text
hit HPOS="3291" VPOS="5211" WIDTH="157" HEIGHT="33"
A hit. This is always one single word (as defined by the segmentation in the ALTO). One word can be split into two hits if it was hyphenated on the original page. If a search term consists of several words, they will be adjacent hits. HPOS is the horizontal position, VPOS is the vertical position, WIDTH is the width and HEIGHT is the height of the block that delimits the word on the image. These values are in the unit defined in MeasurementUnit.
info
Information about the time needed to perform the search.
State
Stable
Programming language
C++
Software requirements
GCC compiler
Author(s) homepage
http://www.bnl.lu
http://www.eluxemburgensia.lu
Download
http://www.exlibrisgroup.org/download/attachments/26019311/alto_search.zip (Source code)
Changes
Version 1.0
Initial Release
Release notes
Initial Release
Only works with UTF-8 encoded ALTO files
Installation instructions
Unzip into a directory of your choice. Then edit the Makefile to point to your GCC compiler (can usually be found using the command "which gcc"). Then run "make" and it will create the executable alto_search.
Known issues
Only works with UTF-8 encoded ALTO files. Only takes UTF-8 encoded strings as input.