Discussioni utente:Alex brollo/OCR.js
Aggiungi argomentoIl punto di partenza
[modifica]Si parte da una copia identica a mul:MediaWiki.OCR.js; da qui inizia l'avventura. Scrivo in inglese per permettere a Phe, autore dello script originale, di leggere e intervenire, se vorrà.
How original OCR.js works
[modifica]OCR.js tries to extract hOCR of the page; hOCR is a standard xhtml format for OCR texts, where words are wrapped into span tags. Il comes from a transformation of usual lisp-like or xml text structure coming from djvu text layer.
hOCR from Pagina:Sacre rappresentazioni I.djvu/42 will be used as an example.
This is the code for the word 1 into example page, "RAPPRESENTAZIONE":
<span class='ocrx_word' id='word_1' title="bbox 123 55 269 66"> RAPPRESENTAZIONE</span> <span ....
Id contains a progressive number for words; title contains bbox and four numbers, that are x1,y1,x2,y2 coordinates of the rectangular area containing the word into original djvu image, origin being top-left corner of the image.
Two other data are extremely interesting:
<div class='ocr_page' id='page_1' title='image "page0042.djvu"; bbox 0 0 627 996; ppageno 0'> <div class='ocr_carea' id='block_1_1' title="bbox 29 52 524 918">
The first one shows whole page dimensions; the second one shows the coordinates of the whole text area into the image. As you see, there's a 29px left margin and a 627-524 103px right margin; you see too that "RAPPRESENTAZIONI" has a wide left spacing.
As soon as OCR button is clicked, a copy of hOCR is saved into localStorage.ws_hOCR, and there it persists till the next OCR click successfully getting a hOCR for current page.
Going deeper into hOCR data
[modifica]As you see if you browse a localStorage.ws_hOCR text, there are five levels of detail and five different classes:
- .ocr_page
- .ocr_carea
- .ocr_par
- .ocr_line
- .ocrx_word
<div class='ocr_page' id='page_1' title='image "page0042.djvu"; bbox 0 0 627 996; ppageno 0'> <div class='ocr_carea' id='block_1_1' title="bbox 29 52 524 918"> <p class='ocr_par' dir='ltr' id='par_1' title="bbox 29 52 524 918"> <span class='ocr_line' id='line_1' title="bbox 29 52 524 918"> <span class='ocrx_word' id='word_1' title="bbox 123 55 269 66">RAPPRESENTAZIONE</span> <span class='ocrx_word' id='word_2' title="bbox 278 56 292 66">DI</span> <span class='ocrx_word' id='word_3' title="bbox 301 56 359 66">ABRAMO</span> ...
Really, ocr_carea, ocr_par and ocr_line are single and identical for any page into Phe implementation, so that the text is a "single long line of text into a single paragraph into a single area"; this is why OCR button loads simply a listof words without any line/paragraph break. Nevertheless, many details can be obtained using appropriately words coordinates, and it.source present work is focused on lines splitting and lines coordinates using data wrapped into ocr_page, ocr_area and ocrx_word tags.
(to be continued)