Access Keys:
Skip to content (Access Key - 0)
My Area (Access Key - 2)


Toggle Sidebar
Your Rating: Results: PatheticBadOKGoodOutstanding! 4 rates

Labels

search search Delete
fulltext fulltext Delete
ocr ocr Delete
alto alto Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.

Searching in Alto files and get context

Tags: , , ,
Last Updated: Dec 16, 2009 18:25


Description

This application performs a search for terms in ALTO files. ALTO files are XML files that store the output of OCR. A term can be several words and multiple terms can be searched at the same time. The output is in XML format and contains the coordinates of the words that were found as well as textual context around the hits. The program does a case insensitive search. This is achieved by converting everything into lowercase and then comparing. In addition, any punctuation marks (or brackets etc.) at the beginning or end of terms are ignored for comparison purposes. The detection of punctuation marks and the conversion to lowercase is based on Unicode 5.2 data.


  • Author: Yves Maurer
  • Additional author(s):
  • Institution: Centre Informatique de l'Etat
  • Year: 2009
  • License: GPL v2
  • Short description: Use, modification and distribution of the code are permitted provided the copyright notice, list of conditions and disclaimer appear in all related material.
  • Link to terms: http://www.gnu.org/licenses/gpl-2.0.txt
  • Skill required for using this code:
    advanced

Usage

alto_search -alto filename -term XXXX [-term XXXX] [-utf8term XX,YY,ZZ] [-help] [-block YYYY]

  • alto : "filename" gives the full pathname of the ALTO file in which to search (must be UTF-8)
  • term : "XXXX" is a search term in UTF-8 (one term can be several words enclosed in quotation marks). There can be several -term parameters.
  • utf8term : This is meant for the case where you cannot pass non-ANSI characters to the program. Here you pass the bytes composing the UTF-8 string as decimal numbers separated by commas.
  • block: "YYYY" is the ID of a TextBlock in the ALTO file. As soon as 1 block ID is given, the search is restricted to only those blocks. There can be several -block parameters.
  • help : prints a help text in XML

Examples

  • alto_search -alto C:\data\1908\1908-01-01_01\alto\1908-01-01_01-00001.xml -term "der Thron"
    Searches for the two words "der" and "Thron" when they appear consecutively in the same block.
    All blocks in the ALTO file are considered
  • alto_search -alto /exlibris4/storage/2009/11/01/363562 -term "DIE BISCHÖFE" -term "KIRCHE" -block P2_TB00046 -block P2_TB00047
    Searches for "DIE BISCHÖFE" as consecutive two words and then a separate word "KIRCHE". The search is restricted to the two blocks P2_TB00046 and P2_TB00047. Other blocks are skipped.
  • altosearch -alto /exlibris4/storage/2009/11/01/363562 -utf8term "68,65,-61,-97,32,100,105,101"
    This searches for "DAß die" as two consecutive words. The bytes passed correspond to the UTF-8 bytes for the character string.

Example output

The following is the result for running the program with the parameters:
alto_search -alto alto.xml -term "VON DEN" -term "in ihren" -term "vom nächsten 6 januar an"
Note that it finds "vom 6. januar an" with the dot after the number 6, because leading and trailing punctuation is removed per word, not only per term.

 <?xml version="1.0" encoding="UTF-8"?>
<ResultSet>
	<Description>
		<MeasurementUnit>mm10</MeasurementUnit>
	</Description>
	<result BLOCKID="P2_TB00001" NHITS="1" CHARS_HIT="6">
		<context>Eutfer» iiKng Batiffols bewirken. Die Abberufung wurde lediglich </context>
		<hit HPOS="63" VPOS="894" WIDTH="51" HEIGHT="23">von</hit>
		<hit HPOS="131" VPOS="890" WIDTH="48" HEIGHT="27">den</hit>
		<context>Bischöfen, denen die Oberleitung des Institut Ca* tholique </context>
	</result>
	<result BLOCKID="P2_TB00003" NHITS="1" CHARS_HIT="6">
		<context>mache mir nichts aus der Uniform. Vorsitzender: Aber ich. Einige </context>
		<hit HPOS="222" VPOS="2332" WIDTH="50" HEIGHT="23">von</hit>
		<hit HPOS="286" VPOS="2329" WIDTH="48" HEIGHT="26">den</hit>
		<context>Angeklagten, die bei der Musterung aus gehoben worden </context>
	</result>
	<result BLOCKID="P2_TB00006" NHITS="1" CHARS_HIT="6">
		<context>zu allererst auf dem Plane erscheinen, ein Zeichen daß sie </context>
		<hit HPOS="746" VPOS="4080" WIDTH="50" HEIGHT="23">von</hit>
		<hit HPOS="812" VPOS="4077" WIDTH="49" HEIGHT="26">den</hit>
		<context>Neu- Wahlen entscheidende Dinge erhoffen. In vielen </context>
	</result>
	<result BLOCKID="P2_TB00036" NHITS="1" CHARS_HIT="6">
		<context>106, Waldbredimus 85, Wellen stein 526 Franken. Art. 2. </context>
		<hit HPOS="2109" VPOS="1539" WIDTH="60" HEIGHT="27">Von</hit>
		<hit HPOS="2182" VPOS="1539" WIDTH="47" HEIGHT="27">den</hit>
		<context>vorbenanntenSnbsidten werden die nach» </context>
	</result>
	<result BLOCKID="P2_TB00040" NHITS="1" CHARS_HIT="7">
		<context>Kinderschar zu wissen. Diese speisen im Lokal der Küche, jene tragen </context>
		<hit HPOS="2309" VPOS="3148" WIDTH="30" HEIGHT="27">in</hit>
		<hit HPOS="2359" VPOS="3149" WIDTH="76" HEIGHT="31">ihren</hit>
		<context>gutgefüllten Speisetöpfen die kräftige und </context>
	</result>
	<result BLOCKID="P2_TB00040" NHITS="1" CHARS_HIT="22">
		<context>Regierung den Dank der städtischen Armen verdient. </context>
		<hit HPOS="2009" VPOS="3817" WIDTH="67" HEIGHT="26">Vom</hit>
		<hit HPOS="2098" VPOS="3815" WIDTH="113" HEIGHT="33">nächsten</hit>
		<hit HPOS="2242" VPOS="3816" WIDTH="23" HEIGHT="27">6.</hit>
		<hit HPOS="2286" VPOS="3817" WIDTH="110" HEIGHT="32">Januar</hit>
		<hit HPOS="2418" VPOS="3826" WIDTH="34" HEIGHT="19">an</hit>
		<context>wird die Küche täglich, mit Ausnahme der Sonn» und </context>
	</result>
	<info>
		<time_elapsed>93</time_elapsed>
	</info>
</ResultSet>

Explanation of result XML

MeasurementUnit
The measurement unit used in the ALTO file. This is usually mm10, which means that in order to transform them into pixels, you need to use the formula pixel=mm10 * dpi / 254. The dpi is the same as the one from the image that the ALTO file corresponds to.

ResultSet
The container for the whole XML

result BLOCKID="P1_TB00029" NHITS="1" CHARS_HIT="3"
A result with its context. The BLOCKID comes from the TextBlock's ID where the hit was found. NHITS tells you how many terms have been found (can be the same term several times). CHARS_HIT says how many characters are in the terms that were found.

context
The context around a hit. The context can appear before, after and in between hits. It consists of regular text

hit HPOS="3291" VPOS="5211" WIDTH="157" HEIGHT="33"
A hit. This is always one single word (as defined by the segmentation in the ALTO). One word can be split into two hits if it was hyphenated on the original page. If a search term consists of several words, they will be adjacent hits. HPOS is the horizontal position, VPOS is the vertical position, WIDTH is the width and HEIGHT is the height of the block that delimits the word on the image. These values are in the unit defined in MeasurementUnit.

info
Information about the time needed to perform the search.

State

Stable

Programming language

C++

Software requirements

GCC compiler

Author(s) homepage

http://www.bnl.lu
http://www.eluxemburgensia.lu

Download

http://www.exlibrisgroup.org/download/attachments/26019533/alto_search.zip (Source code)

Changes

Version 1.1

Added the -utf8term parameter so that you can safely pass UTF-8 even if your programming language or console doesn't understand it.

Version 1.0

Initial Release

Release notes

Initial Release
Only works with UTF-8 encoded ALTO files

Installation instructions

Unzip into a directory of your choice. Then edit the Makefile to point to your GCC compiler (can usually be found using the command "which gcc"). Then run "make" and it will create the executable alto_search.

Known issues

Only works with UTF-8 encoded ALTO files. Only takes UTF-8 encoded strings as input.

Page Attachments

File NameCommentSizeNumber of Downloads
alto_search.zipalto search 26 kB47

Added by Yves Maurer on Dec 03, 2009 19:07, last edited by Yves Maurer on Dec 16, 2009 18:25

Adaptavist Theme Builder Powered by Atlassian Confluence