Improved Algorithm for Blob Detection in Document Images

Abstract’ With the evolution in the field of image processing, its many application areas have been identified. These areas include Character Recognition, Pattern Matching, Finger Print Recognition, Document Analysis and Verification and many more. All these applications are based on extracting useful information from image. Image objects needs to be detected and extracted in order to obtain useful information from image. In a binary image, these objects are called blobs. Blobs may be 4-connected or 8-connected. Various algorithms are proposed for detecting blobs [5]. As blobs have complex shapes [1], blob detection is more time consuming as compared to other operations like edge detection, threshold calculation etc. As it is the backbone of all fundamental operations, it needs to be fast and accurate. This paper proposes a fast algorithm for blob detection. The proposed algorithm detects 8-connected blobs within an image. It is a type of two scan detection algorithm with optimizations done so that the number of comparisons with neighboring pixels are reduced. Reduced number of comparisons make it nearly 1.5 times faster than algorithms discussed in paper.
Keywords’ blob, Label, Optimization, Two pass, 8-connected
Introduction
In today’s digital world, paper based documents are scanned and stored in form of digital images. To retrieve data from these images, blobs need to be identified. Blob detection algorithm is the backbone for many applications such as Character Recognition, Finger Print Recognition, Document Analysis and Recognition. Most of the time is consumed in blob labeling and extraction rather than other image processing steps. Lot of work has been done on blob labeling algorithms. Some of the studied algorithms are listed here. 1. Stack Algorithm [7]. 2. Two pass scanning algorithm [2], [5] 3. Union Find Algorithm [2], [5]. 4. Run Based Algorithm [1]. 5. Contour Tracing Algorithm [7].
Stack Algorithm
Image is scanned until an unlabeled data pixel is found and a new label is assigned to it. All neighbors of current data pixel are assigned same labels and pushed on stack. Pop value at the top of stack, assign same label to pixel at location referred by top and push its neighbours to stack.While the stack is not empty, pixels are assigned same label and all neighboring pixels are pushed on the stack. After stack is empty, all the pixels belonging to same blob gets unique label. Once whole blob is labeled, next data pixel is searched and assigned a new label. This process is repeated till all the blobs in image are labeled [7]. Basic disadvantage of stack based algorithm is that stack can become very large.
Two Pass Scanning Algorithm
In two pass scanning algorithm, image is scanned from left to right twice. During first pass, a mask is used containing all previously labeled neighbor pixels. Minimum label from the mask is assigned to current pixel and label equivalence information is stored in equivalence table. During second pass, Each label is replaced with the root label in equivalence table [2]. In Two pass algorithm, for each new label, there is an entry in entry in equivalence table. So at least NL * NL memory space is required where NL is number of provisional label assigned during first pass. Also, during first pass, for each object pixel, four neighbors have to be compared to find out minimum labels and equivalence table is modified for each of them. For large sized image, this leads to significant increase in processing time.
Union Find Algorithm
Union Find Algorithm is an optimization of two pass scanning algorithm. A decision tree is used to minimize the number of comparisons. In second optimization strategy, equivalent labels are resolved using union find mechanism. Union find data structure is implemented as single dimensional array. Accessing an array saves lot of time [2], [5].
Contour Tracing Algorithm
This is also a two pass algorithm in a sense. In contour tracing technique, a scan line is moved row by row and as soon as it encounters an unlabeled object’s boundary, all boundary pixels of a blob are traced and they are assigned same label. This algorithm is based on the fact that a blob is determined by its border. Such process is repeated until all border pixels are labeled [4].Same process is repeated to find internal contour. During second pass, all pixels between two contours are labeled with same label.
Run Based Two Scan Labeling Algorithm
During first scan, run data is obtained. If current run is not connected to run above scan in previous row, it is assigned a new provisional label. If it is connected to runs in row above the scan row, it is assigned same provisional label as assigned to leftmost connected run and provisional label set corresponding to all connected runs are merged and smallest of them is used as representative label. During second scan, each provisional label is replaced with its representative label [1]. But in this algorithm, when current run is connected to previous run, we have to perform merging operations, Also, we need to find minimum label from the label set.
In document images, percentage of background area is more than foreground. Processing them unit by unit is a time consuming operation. Here, in case of image, unit refers to pixel. Our algorithm works on 8*1 blocks. If complete block represent background data, we just proceed to next block. For large document images, processing 8 units at a time saves a lot of time.
OUR PROPOSED ALGORITHM
Our proposed algorithm can be divided into 4 phases as shown in figure1. These four phases are Analysis Phase, Scanning and Provisional labeling phase, Root label generation phase and Final label assignment phase.
Fig 1: Various phases in our proposed algorithm.
Analysis Phase
In a N*M image where N represents width of image and M represent height of image. Our algorithm scans the image left to right for each row until first block with object unit is encountered and location of block in grid is stored in a vector. This phase generates M sized vector which is used in other phases as a start index vector.
Scanning and Provisional Labeling Phase
??x,y
Unit currently being processed with (x,y) coordinates.
L(z)
Label of unit z.
??p
Count of provisional labels assigned so far.
BC
Block currently under processing.
BA
Block just above BC .
Vb
Represents the set of background units
Vo
Represents the set of object units
T
Represents the connected pair table.
Ti
Represents the ith pair in connected pair table.
In this phase, Image is scanned left to right starting from location where first object unit appears. This information is retrieved from the vector generated during analysis phase. Provisional labels are assigned to each object unit and their connectivity information is recorded in a table, unlike other two pass algorithms, our algorithm does not uses a mask to determine provisional label. It assigns a provisional label using the procedure mentioned in equation (1).
{
If unit under processing is ??x, y and ??x-1,y is an object unit, then unit ??x, y is labeled with same label as ??x-1,y. Otherwise, it is assigned a new label. Proposed algorithm processes image into 8*1 blocks .Now, suppose BC represents block currently being processed and BA represents the block just above BC. We have taken the leverage of the following optimizations in our algorithm.
Case 1: If ??x,y ?? Vo, assign a label to ??x,y according to (1). If ‘ k ?? BA : k ?? Vb then ??x,y is connected to object unit ??x-1,y-1 as shown in fig.2(a). L(??x-1,y-1) and L(??x,y) are added to T as shown in fig 2(b). Similarly, for last unit of this block, ??x+1,y-1 needs to be identified. If it is an object unit, add L(??x+1,y-1) and L(??x,y) to T as shown in fig. 2(b)
Case 2: If ??x,y ?? Vo, assign a label to ??x,y according to (1). If ‘ k ?? BA: k ?? Vo or BA = BC If ??x,y-1 is an object unit. ‘ k ?? BC , add L(??x,,y-1) and L(??x,y) to T and i is incremented. Refer to fig 3(a), 3(b).
Default: If ??x,y ?? Vo, assign a label to ??x,y according to (1). If ??x-1,y ?? Vo and If ??x,y-1 ?? Vb and ??x+1,y-1 ?? Vo, then L(??x+1,y-1) and L(??x,y) are added to T. If ??x-1,y ?? Vb , we need to check for units ??x,y-1, ??x-1,y -1, ??x+1,y -1 in the same order. As soon as, a labeled unit is found, add this label and L(??x,y) to T. Refer to fig 4(a),4(b).
After completion of scanning and provisional labeling phase, connected pair list is obtained having connectivity information of all labels in an Image. This connected pair list is to be used in root label generation phase.
Root Label generation phase
Using the connected pair table, root label corresponding to each provisional label is generated. To find out root label, Union find data structure is used which is very efficient in terms of processing time. A mapping vector is obtained from this phase which contains mapping of provisional labels to their root label. Refer to Fig 6.
Final Label Assignment Phase
Once mapping vector is obtained, this vector is utilized to assign final labels to each blob. After this phase, each blob
will be having units with unique label. Refer Fig 7.
Let us have a look how various phases are interacting with each other. We have an Image as represented in Fig. 5(a). Image is scanned left to right and provisional labels are assigned. After provisional labels have been assigned, connection pair list is generated as shown in fig. 5(b). During root label generation phase, each label is resolved to its parent label, higher up in the hierarchy as represented in fig.6.During this phase, mapping between root label and provisional label is created.
In final phase, provisional labels are replaced with root label.
After final phase, all blobs are labeled with unique label. Units within the same blob get same label. Refer to Fig. 7
EXPERIMENTAL RESULTS
In order to test proposed algorithm for performance, all of the algorithms mentioned above are implemented in java. Our testing environment was a PC with XP operating system with 3 GB RAM, 2.2 GHz Core 2 Duo processing unit. All the algorithms were tested on same set consisting of four different kind of images like Invoices, Check images, images containing a lot of textual data, images with a lot of pictorial data.
Table I depicts the average time for various algorithms on different kind of images of various sizes over 2000 runs. It is clear from the statistics that for Invoices, our algorithm runs four times faster than two scan algorithm, and nearly 1.5 times faster than all other algorithms..
Fig. 8 represents a plot of execution time with number of blobs in Image. For any number of blobs in image, execution time in proposed algorithm is significantly lower than other mentioned algorithms
Fig. 9 shows how execution time varies with image size for various algorithms. It is clear from the figure that for any number of pixels, proposed algorithm runs faster than other conventional algorithms.
Conclusion
In this paper, we have presented a new blob detection algorithm, that uses two pass algorithms as base and number of optimizations have been employed to reduce its execution time. Experiments were performed on four different kind of images of different sizes. The result shows that proposed algorithm is nearly 1.5 times faster than algorithms compared as mentioned in Table I for invoices. Also for check images, execution time is reduced to half in proposed algorithm. For pictorial images and textual images, it is 1.5 times faster.
References
[1] He L, Chao Y, Suzuki K: A run-based two-scan labeling algorithm. IEEE, Trans Image Process 2008, 17(5):749-756.
[2] Wu K, Otoo E, Shoshani A: Optimizing connected component labeling algorithms. Proc SPIE 2005, 5747:1965-1976.
[3] R.C. Gonzalez, R.E. Woods, Digital Image Processing, Prentice Hall, 2008.
[4] F. Chang, C. J. Chen, and C. J. Lu, ‘A linear-time component-labeling algorithm using contour tracing technique,’ Comput. Vis. Image Understand.,
vol. 93, pp. 206’220, 2004.
[5] C. Fiorio, J. Gustedt, Two linear time Union-Find strategies for image processing, Theor. Comput.
295 Sci. 154 (1996) 165’181.
[6] He L, Chao Y, Suzuki K, Wu K: Fast connected-component labeling. Pattern Recogn 2009, 42(9):1977-1987.
[7] M. Jankowski, and J.P.kuska ‘Connected Components Labeling ‘ algorithms in Mathematica, Java, C++ and C#.

Essay: Improved Algorithm for Blob Detection in Document Images

Essay details and download:

Text preview of this essay:

About this essay:

Essay details and download:

Text preview of this essay:

About this essay:

Essay Categories: