Discussioni utente:Alex brollo/OCR.js

Il punto di partenza

Si parte da una copia identica a mul:MediaWiki.OCR.js; da qui inizia l'avventura. Scrivo in inglese per permettere a Phe, autore dello script originale, di leggere e intervenire, se vorrà.

How original OCR.js works

OCR.js tries to extract hOCR of the page; hOCR is a standard xhtml format for OCR texts, where words are wrapped into span tags. Il comes from a transformation of usual lisp-like or xml text structure coming from djvu text layer.

hOCR from Pagina:Sacre rappresentazioni I.djvu/42 will be used as an example.

This is the code for the word 1 into example page, "RAPPRESENTAZIONE":

 
<span class='ocrx_word' id='word_1' title="bbox 123 55 269 66">
    RAPPRESENTAZIONE</span> <span ....

Id contains a progressive number for words; title contains bbox and four numbers, that are x1,y1,x2,y2 coordinates of the rectangular area containing the word into original djvu image, origin being top-left corner of the image.

Two other data are extremely interesting:

<div class='ocr_page' id='page_1' title='image "page0042.djvu"; 
   bbox 0 0 627 996; ppageno 0'>
   <div class='ocr_carea' id='block_1_1' title="bbox 29 52 524 918">

The first one shows whole page dimensions; the second one shows the coordinates of the whole text area into the image. As you see, there's a 29px left margin and a 627-524 103px right margin; you see too that "RAPPRESENTAZIONI" has a wide left spacing.

As soon as OCR button is clicked, a copy of hOCR is saved into localStorage.ws_hOCR, and there it persists till the next OCR click successfully getting a hOCR for current page.

Going deeper into hOCR data

As you see if you browse a localStorage.ws_hOCR text, there are five levels of detail and five different classes:

.ocr_page
.ocr_carea
.ocr_par
.ocr_line
.ocrx_word

<div class='ocr_page' id='page_1' title='image "page0042.djvu"; bbox 0 0 627 996; ppageno 0'>
   <div class='ocr_carea' id='block_1_1' title="bbox 29 52 524 918">
    <p class='ocr_par' dir='ltr' id='par_1' title="bbox 29 52 524 918">
     <span class='ocr_line' id='line_1' title="bbox 29 52 524 918">
       <span class='ocrx_word' id='word_1' title="bbox 123 55 269 66">RAPPRESENTAZIONE</span> 
       <span class='ocrx_word' id='word_2' title="bbox 278 56 292 66">DI</span> 
       <span class='ocrx_word' id='word_3' title="bbox 301 56 359 66">ABRAMO</span>
...

Really, ocr_carea, ocr_par and ocr_line are single and identical for any page into Phe implementation, so that the text is a "single long line of text into a single paragraph into a single area"; this is why OCR button loads simply a listof words without any line/paragraph break. Nevertheless, many details can be obtained using appropriately words coordinates, and it.source present work is focused on lines splitting and lines coordinates using data wrapped into ocr_page, ocr_area and ocrx_word tags.

(to be continued)