A Two-stage Approach for Word Searching in Handwritten Document Images
DOI:
https://doi.org/10.56294/dm202554Keywords:
Feature ex-traction, Antlion Algorithm for feature section, comparative study with existing algorithmAbstract
Introduction; Despite the rise of electronic papers, handwritten paper documents remain important. Current technologies make document digitization, storage, compression, and transmission easy and affordable. But semi-automatic document image processing needs specific technology to extract document information accurately. Typed textual searches are used to get information from Digital Libraries.
Objective; Generally, in a document, there exists a varying number of characters in different words. That is why searching a word in a whole document is incorporate mismatched word images in the fetched word image and also increases the time consumption to complete the task.
Method; Keeping this idea in mind, the words having different number of characters with respect to the search word are discarded at the beginning as preprocessing.
Result; To confirm the outstanding words in the document page as probable search word, a voting-based approach has been used for doing this, a modified HOG feature descriptor is extracted from each word image, then 5 distance-matching metrics are calculated, fed to a voting schema with the help of threshold value of each metrics, calculated beforehand.
Conclusion; Here 3 types of voting is performed, first 2, with the varying no of metrics vote for positivity of the search word and in the last one three distance metrics are used among which if more than one votes for the positivity the model will indicate the word as a search word.
References
1. Bhowmik S, Malakar S, Sarkar R, Basu S, Kundu M, Nasipuri M. Off-line Bangla handwritten word recognition: a holistic approach. Neural Comput Appl. 2019;31:5783–98.
2. Basu S, Das N, Sarkar R, Kundu M, Nasipuri M, Basu DK. A hierarchical approach to recognition of handwritten Bangla characters. Pattern Recognit. 2009;42(7):1467–84.
3. Rath TM, Manmatha R. Word spotting for historical documents. Int J Doc Anal Recognit. 2007;9:139–52.
4. Begum N, Goyal A. Analysis of legal case document automated summarizer. In: 2021 6th International Conference on Signal Processing, Computing and Control (ISPCC). IEEE; 2021. p. 533–8.
5. Sharma S, Choudhary S, Sharma VK, Goyal A, Balihar MM. Image watermarking in frequency domain using Hu’s invariant moments and firefly algorithm. no April. 2022;1–15.
6. Rath TM, Manmatha R. Word image matching using dynamic time warping. In: 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003 Proceedings. IEEE; 2003. p. II–II.
7. Dalal N, Triggs B. Histograms of oriented gradients for human detection. In: 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05). Ieee; 2005. p. 886–93.
8. Zagoris K, Ergina K, Papamarkos N. A document image retrieval system. Eng Appl Artif Intell. 2010;23(6):872–9.
9. Retsinas G, Louloudis G, Stamatopoulos N, Gatos B. Efficient learning-free keyword spotting. IEEE Trans Pattern Anal Mach Intell. 2018;41(7):1587–600.
10. Pantke W, Dennhardt M, Fecker D, Märgner V, Fingscheidt T. An historical handwritten arabic dataset for segmentation-free word spotting-hadara80p. In: 2014 14th International Conference on Frontiers in Handwriting Recognition. IEEE; 2014. p. 15–20.
11. Rusiñol M, Aldavert D, Toledo R, Lladós J. Efficient segmentation-free keyword spotting in historical document collections. Pattern Recognit. 2015;48(2):545–55.
12. Malakar S, Ghosh M, Sarkar R, Nasipuri M. Development of a two-stage segmentation-based word searching method for handwritten document images. J Intell Syst. 2019;29(1):719–35.
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Ankur Goyal, Pronita Mukherjee, Dipra Mitra, Shiv Kant, Khalid Almalki, Suliman Mohamed Fati (Author)

This work is licensed under a Creative Commons Attribution 4.0 International License.
The article is distributed under the Creative Commons Attribution 4.0 License. Unless otherwise stated, associated published material is distributed under the same licence.