Final Paper CS332
This paper analyses the system presented in Reading text in the wild with convolutional neural networks by Jaderberg et al. (2016). In that article, the authors propose an end-to-end method for text spotting. The first task, text detection, is performed by weak detectors, and the resulting proposals are then filtered and refined. The second task, word recognition, is accomplished by a deep convolutional neural network. In the next sections, I explain the details of these two stages, as well as their results in terms of other text spotting methods. In addition, I first present an introduction to word processing in the human brain. Understanding how the brain recognizes text is fundamental to appreciate why the system developed by Jaderberg and colleagues is superior to previous text spotting methods.
Text Recognition in the Human Brain
Text is a relatively recent invention and, because of time constraints, a brain region devoted to text recognition is unlikely to have evolved in evolution. The modern Homo sapiens is thought to be about 200 000 years old while writing was first invented around 5000 years ago (Reinhardt, 2005). In addition, not all human populations were exposed to text in the past, but until more recently after the invention of the printing press. Studies using fMRI, however, have reported a stronger response to alphabetic strings than to other visual stimuli (e.g., faces, houses, checkerboard) in the left occipitotemporal cortex, a region adjacent to the fusiform gyrus known as the visual word form area (VWFA; Figure 1) (Cohen, Dehaene et al. 2000, Cohen, Lehericy et al. 2002, Hasson, Levy et al. 2002). This activation is present in literate subjects across different languages and writing systems (Bolger, Perfetti et al. 2005). Moreover, the VWFA is equally activated by real words and readable pseudowords, suggesting that this area is tuned to the orthographic regularities of the language(Cohen, Lehericy et al. 2002). These results in the Roman alphabet have also been replicated in real and pseudo characters in Chinese (Liu, Zhang et al. 2008).
Figure 1. The visual word form area (Cohen, Lehericy et al. 2002). The left panel shows the VWFA, a region of the left-hemispheric fusiform cortex more responsive to letters and words than to control stimuli in fMRI studies. The green squares come from individual subjects, whereas the yellow squares represent group analysis. The right panel shows the average BOLD signal for words and consonants versus checkerboards in both visual fields.
If the visual word form area is not encoded in our genome, how can we understand its consistency across individuals and cultures? Recent studies suggest that, despite the structural constraints of the cortical organization in the visual system, experience-driven plasticity can lead to a specialization process. For instance, Baker, Liu et al. (2007) tested the dependency on experience of the VWFA by showing that this region is activated more strongly to Hebrew words in readers than in nonreaders of that language. In addition, the orthographic familiarity of subjects seems to be correlated with a stronger blood oxygenation (BOLD) response in the VWFA(Binder, Medler et al. 2006). Other studies using fMRI rapid adaptation techniques suggest that neurons in the VWFA respond selectively to individual real words (i.e., words known by the subjects) (Glezer, Jiang et al. 2009).
Because the VWFA results from a functional reorganization in the visual system driven by experience, it is not a surprise that word recognition is in line with the postulates of object recognition in the visual cortex(Riesenhuber and Poggio 2002). Besides the specificity of the VWFA, consider, for example, how the VWFA recognizes words at different levels, from characters (Baker, Liu et al. 2007), syllables or letter combinations (Binder, Medler et al. 2006), up to “whole real words” (Glezer, Jiang et al. 2009). Furthermore, in fMRI studies, the response in the VWFA to words shows invariance across visual features such as letter case, size, orientation, and font. For instance, the VWFA is equally activated (versus fixation) by words whether they are presented as “pure-case words” (e.g., hello world) or as “alternating-case words” (e.g., hElLo WoRlD)(Polk and Farah 2002). The case insensitivity of the VWFA has also been reported in word masking and unconscious repetition priming. The results from fMRI and event-related potentials (ERPs) in the VWFA were sensitive to word repetition, independently of changes in letter case (Dehaene, Naccache et al. 2001).
Taken together, this evidence from neuroimaging reveals three key ideas about word recognition in the human brain. First, the specialization of the visual word form area results from the functional reorganization in the visual cortex driven by experience. Second, the response of the VWFA is sensitive to spelling and the experience of the reader in a particular writing system, but its response is invariant across other visual features such as letter case. Third, there is evidence of a hierarchical organization in the process of word recognition (i.e., recognizing letters, combination of letters, and whole words). The third idea will be important when discussing Jaderberg and colleagues’ system for text recognition in computer vision. I will use some of these concepts to discuss how the proposed text recognition system of Jaderberg et al. is similar to word processing in the human brain and superior to previous methods.
Text Recognition Using Convolutional Neural Networks
In computer vision, the detection and recognition of words in natural scene images is a problem with important applications. In the modern urban world, text is present almost everywhere: traffic signs, labels, digital screens, and billboards, just to mention a few examples. In this context, an automatic text spotting system could have relevant applications for visually impaired people, translating text from images, and analyzing or retrieving textual content from video or image databases. Recognizing text in the wild, however, is not an easy task. Unlike text in black and white documents, text in scene images varies greatly in visual features such as lighting, occlusions, size, alignment, orientation, and noise. This is why the challenges for text spotting methods in the wild are greater than the ones covered by standard text recognition techniques in documents (e.g., OCR). In the next section, I explain Jaderberg and colleagues’ end-to-end system for text spotting (figure 2). First, I describe their method for text detection, then the convolutional neural network (CNN) for text recognition, and finally, I discuss their results.
Figure 2. An overview of the text spotting system (Jaderberg, Simonyan et al. 2016). A) The first step of text detection uses weak detectors to generate proposals. B) Those proposals are filtered with a stronger classifier. C) The bounding box of the word proposals is refined using regression in a CNN. D) Word recognition is performed using a CNN. E) The outputs of the CNN are merged and ranked in order to eliminate false positives and duplicates. F) The proposals that pass the threshold are taken as the final results.
Before performing the laborious task of text recognition, this system identifies, filters, and refines the text proposals that will go into the CNN (figure 2: a,b,c). Early in text detection, a tradeoff between precision and recall (true positives found) was necessary to reduce the complexity and time devoted to this task. That is, the authors chose a fast detection method with high recall and low precision (many false positives). In order to achieve this tradeoff, they selected two weak classifiers whose proposals were then filtered and refined for text recognition.
Weak classifiers for proposal generation
Edge boxes is a weak detector developed by Zitnick and Dollár (2014). The idea behind this method is simple: the number of edges wholly enclosed in a box indicates how likely that box is to contain an object (figure 3). Because words are a combination of letters with clearly defined contours, they are correctly detected in this method. Following Zitnick and Dollár’s method, Jaderberg et al. used a sliding window at different scales to evaluate the probability of each box b to contain text. These boxes are then given a score, ranked and removed if they overlap with another box of higher ranking. Those boxes with scores higher than a threshold are taken as candidate “bounding boxes” Be.
Figure 3. Edge boxes: a weak detector for text. The wholly enclosed edges of an image indicate how likely that box is to contain an object, in this case, text.
The second cheap classifier is a trained detector called “aggregate channel feature” (ACF) detector. Jaderberg and coworkers decided to use the efficient ACF structure presented by Dollar et al. (2014). The ACF detector is a sliding window method that considers the output of an adaptive boosted (AdaBoost) classifier over a collection of channel features (figure 4). The channel features are designed to extract and reduce information from a given image. In this case, Jaderberg et al. obtained the normalized gradient magnitude, the histogram of oriented gradients (HOG), and a grayscale version of the image. In order to reduce the information from these channels, the authors divided each channel into blocks, smoothed them, summed the pixels in each section, and smoothed them again to finally result in the aggregate channel features. Next, the trained AdaBoost algorithm was applied. Similar to the Viola and Jones face detector (2004), AdaBoost creates an accurate classification rule to detect text by combining weak and simple features. These features were applied using a sliding window at multiple scales in order to represent different length words and those proposals above a threshold were considered as the final box proposals, Bd. Finally, the candidate bounding boxes identified by both weak classifiers (edge boxes and ACF) were considered as the proposals for the next stage.
Figure 4. Overview of the aggregate channel feature detector (Dollar et al., 2014). The ACF classifier from Dollar et al. takes an image and it extracts features such as HOG, normalized gradient magnitude, and grayscale. Then the image is divided into blocks, and the pixels in each block are added. The image is transformed to a vector and the boosted classifier is applied.
Filtering and Refinement of Proposals
In Jaderberg and coworkers’ text spotting pipeline, filtering and refinement occur in two different stages. First, a stronger classifier than edge boxes and ACF is used to reject the false positives in the candidate bounding boxes. In order to achieve this, Jaderberg et al. opted for a random forest classifier (Breinman, 2001). This method is a binary classifier (i.e., word/no-word) acting on the HOG features of each bounding box (figure 5, a, b). The forest classifier simply rejects the proposals that are under a certain threshold, keeping the candidates that are most likely to be words. Once most of the false positives have been rejected, it is important to refine the bounding boxes of each proposal in order to prepare the whole words that will be taken as input in the text recognition CNN. In most cases, because the classifiers used an overlap ratio of 0.5, the proposals generally overlap only with half of the groundtruth (figure 5, c). That is, the bounding boxes are accurate in terms of width, but not height, and vice versa. The solution proposed by the authors was a CNN capable of regressing the groundtruth bounding box for all the candidate bounding boxes (figure 5, d). In short, the input to this network is an image of fixed width and height containing the bounding box at its center. In addition, the bounding box is inflated by a factor of two and its coordinates are encoded in terms of the cropped image in order to provide enough context to the CNN for the prediction of the refined proposal. The network is trained with example pairs of the input image and the groundtruth bounding box. After filtering with a random forest classifier and refining with a CNN bounding box regression, the proposals are finally ready for the most computationally expensive task: text recognition.
Figure 5. Filtering and Refinement of Proposals. A) An example of the HOG used in the stronger classifier. B) The random forest classifier separates words from no-words according to the HOG. C) Image showing the problem in the bounding box word because of the 0.5 overlap ratio used in the weaker detectors. D) Before and after the CNN bounding box regression. Green shows the groundtruth bounding box. Red is the regressed box.
The CNN for text recognition is composed by five convolutional layers and three fully connected layers. Each output neuron in the CNN corresponds to a word from an English dictionary of 90K words, whereas each input is one of the generated proposals. This network recognizes which word “w” in the dictionary corresponds to the input image by ranking the probability of each word for a given input image “x” (figure 6). The word from the dictionary with the highest probability is the best match for the given input. A minor limitation of using a CNN is that the image input must have a pre-defined size. This condition, however, did not affect the performance of the network since the horizontal distortion provided information about the length of the word.
Figure 6. Overview of the text recognition CNN (Jaderberg, Simonyan et al. 2016). The network is composed by five convolutional layers and three fully connected layers. The final layer corresponds to the words from the dictionary used for recognition. The input is a whole-word image. The bounding box proposal is associated to the word in the dictionary that has the highest probability of being its match.
Given that the CNN takes a whole word image as its input, the CNN had to be trained by word images. Despite the fact that there are some databases available with street view text (Wang et al., 2011), the size and variety of the word images in these datasets are considerably limited. Because of this constraint, other text recognition methods generally approach the problem by developing character classifiers. The solution to this problem was a synthetic data set developed by the authors themselves. Their premise was that most text in natural scenes is created by a range of fonts available in computers. In addition, other text features such as alignment, texture, and lighting effects could be imitated. Considering the variation on these features, Jaderberg et al. created single word image samples. Each one of these samples was composed by three different image-layers: background, foreground, and border/shadow layer. The generation of synthetic data consisted of six steps (figure 7):
- Font rendering: choosing a random font from a catalog of 1400 fonts
- Border/shadow rendering: altering the border size and the shadow
- Base coloring: changing the color of the layers in the context of natural images
- Projective distortion: distorting the view of the sample to simulate the 3D world
- Natural data blending: mixing the samples with textures from natural scenes
- Noise: introducing Gaussian noise and other artefacts to the image
Overall, the large synthetic data set created in this process was able to produce a diverse range of samples without the need of using real-world data. In addition, the authors had the flexibility to choose the words from the dictionary used in the CNN. This rich synthetic data set allowed the authors to train the CNN based on whole-word image samples.
Figure 7. Synthetic Training Data (Jaderberg, Simonyan et al. 2016). A) The process of creating word image samples from words in the dictionary. B) Examples of images used in the synthetic training data set generated by the authors.
Merging and Ranking
At this point in the algorithm, I have described how Jaderberg and coworkers’ system generates, filters and refines word-bounding boxes. These word images are then matched to their most likely set of words in the CNN. However, some duplicates and false positives must be eliminated before yielding a final answer. The authors performed merging and ranking according to the requirements of the text recognition task. That is, whether the task is text spotting (general word search) or image retrieval (specific word search). In the case of text spotting, there were two major problems. First, candidate output words for the same word (duplicates), and second, different words with some overlap. In order to reject duplicates and to find the actual word in overlapping candidates, the authors performed non-maximum suppression (NMS). The key idea is that this method works as a “positional voting” for a specific word. Therefore, the candidate with the best score was taken as the real output. In the case of image retrieval, the system computes the probability of an image to contain the query word and those images with the best cores are processed. This classification allows the system to retrieve images rapidly from large databases.
Now that I have explained the process of proposal generation, filtering, bounding box regression, CNN recognition, and merge and raking of candidate words, it is time to evaluate the performance of the system in text spotting. According to the standards in the field (Wang et al., 2011), text spotting algorithms should disregard alphanumeric characters and words that are not at least three characters long. In addition, a result is considered as valid only if the bounding box has at least 0.5 overlap with the groundtruth. Following these rules, Jaderberg and colleagues compared their system with previous end-to-end text spotting methods using different databases. Across all datasets, their pipeline was far superior that all previous methods (Figure 8). Furthermore, the performance of their system was slightly better when the overlap with the groundtruth was reduced to 0.3.
Figure 8. Text recognition using CNN for whole-word images is superior to previous text spotting methods in natural scenes. The proposed method is superior across all the datasets. Most of the previous methods were focused in character recognition. In addition, a decrease in the overlap of the groundtruth box improves the performance of the system.
When looking at the previous methods presented by Jaderberg and coworkers, I realized that they were mostly focused in character recognition in order to identify words. For instance, Jaderberg et al. (2014) had essentially the same pipeline using CNN, but for character classification instead of whole word recognition. Similarly, Neumann and Matas (2013) developed a method of character detection and recognition by combining a sliding-window and an algorithm that works on “strokes of specific orientations”, which involves a process of convolving the image gradient field with a set of oriented bar filters. Moreover, Alsharif and Pineau (2014) developed an end-to-end text recognition method with hybrid HMM maxout models, which attempts to combine the character and word recognition problem by first starting with character recognition, and then proceeding on to word recognition. However, none of these methods is close to perform as well as the proposed text recognition method using whole-words in CNN.
Jaderberg and colleagues’ whole-word text spotting system has demonstrated to be superior to character recognition or hierarchical dependent models. I think that this draws an interesting parallel between this system of computer vision and word processing in the brain. As I mentioned before, there are neurons that respond preferentially to whole words, particularly real words. Although the VWFA is also responsive to characters, behavioral studies suggest that people with more reading experience tend to recognize words as entities, instead of recognizing each letter individually (Grainger, Lete et al. 2012). Like in humans, it seems that the whole-word approach with CNN could lead to more efficient computer systems for text recognition. This approach was particularly an advantage when detecting disjoint, occluded and blurry word images (Figure 9). On the other hand, the system usually failed when it encountered slanted or vertical text. This makes sense because the authors did not model such type of instances in their framework. In addition, sub-words or multiple adjacent words had a tendency to generate false-positive results.
Figure 9. Text spotting results (Jaderberg, Simonyan et al. 2016). Some examples of the text recognition results in the proposed method. The red bounding boxes are the groundtruth, whereas the green boxes represent the bounding boxes that the algorithm predicted. Notice the small and blurry text recognized by the system in the first image.
In this paper, I had the opportunity to study word processing in the brain and in a system of computer vision. I learned that, despite the relatively new invention of writing, literate humans have a brain region (VWFA) that responds preferentially to characters and words, particularly whole-real words. With respect to the end-to-end text reading pipeline of Jaderberg et al., I learned that it is possible to detect and recognize whole words in natural scenes using a CNN and synthetic training data. I was also able to appreciate how the complexity of the pipeline increased as we moved from text detection to word recognition. I realized that the same simple weak detectors that we studied in class (e.g., face and object recognition) were also useful when performing text detection. It was also interesting to see how the CNN was employed in multiple tasks such as text recognition and bounding box regression. Finally, this text recognition pipeline could improve in terms of recognizing unknown words, words in the same alphabet, but different language, and even arbitrary strings. I think that new models could attempt to go back to a lower level of character recognition if the whole-word is not recognized in the CNN. By combining these methods, maybe the problem of not recognizing new words or vertical text could be solved.
Alsharif, O., & Pineau, J. (2014). End-to-end text recognition with hybrid HMM maxout models. In International conference on learning representations.
Baker, C. I., J. Liu, L. L. Wald, K. K. Kwong, T. Benner and N. Kanwisher (2007). “Visual word processing and experiential origins of functional selectivity in human extrastriate cortex.” Proc Natl Acad Sci U S A 104(21): 9087-9092.
Binder, J. R., D. A. Medler, C. F. Westbury, E. Liebenthal and L. Buchanan (2006). “Tuning of the human left fusiform gyrus to sublexical orthographic structure.” Neuroimage 33(2): 739-748.
Bolger, D. J., C. A. Perfetti and W. Schneider (2005). “Cross-cultural effect on the brain revisited: universal structures plus writing system variation.” Hum Brain Mapp 25(1): 92-104.
Cohen, L., S. Dehaene, L. Naccache, S. Lehericy, G. Dehaene-Lambertz, M. A. Henaff and F. Michel (2000). “The visual word form area: spatial and temporal characterization of an initial stage of reading in normal subjects and posterior split-brain patients.” Brain 123 ( Pt 2): 291-307.
Cohen, L., S. Lehericy, F. Chochon, C. Lemer, S. Rivaud and S. Dehaene (2002). “Language-specific tuning of visual cortex? Functional properties of the Visual Word Form Area.” Brain 125(Pt 5): 1054-1069.
Dehaene, S., L. Naccache, L. Cohen, D. L. Bihan, J. F. Mangin, J. B. Poline and D. Riviere (2001). “Cerebral mechanisms of word masking and unconscious repetition priming.” Nat Neurosci 4(7): 752-758.
Glezer, L. S., X. Jiang and M. Riesenhuber (2009). “Evidence for highly selective neuronal tuning to whole words in the “visual word form area”.” Neuron 62(2): 199-204.
Grainger, J., B. Lete, D. Bertand, S. Dufau and J. C. Ziegler (2012). “Evidence for multiple routes in learning to read.” Cognition 123(2): 280-292.
Hasson, U., I. Levy, M. Behrmann, T. Hendler and R. Malach (2002). “Eccentricity Bias as an Organizing Principle for Human High-Order Object Areas.” Neuron 34(3): 479-490.
Jaderberg, M., K. Simonyan, A. Vedaldi and A. Zisserman (2016). “Reading Text in the Wild with Convolutional Neural Networks.” International Journal of Computer Vision 116(1): 1-20.
Liu, C., W. T. Zhang, Y. Y. Tang, X. Q. Mai, H. C. Chen, T. Tardif and Y. J. Luo (2008). “The Visual Word Form Area: evidence from an fMRI study of implicit processing of Chinese characters.” Neuroimage 40(3): 1350-1361.
Polk, T. A. and M. J. Farah (2002). “Functional MRI evidence for an abstract, not perceptual, word-form area.” J Exp Psychol Gen 131(1): 65-72.
Riesenhuber, M. and T. Poggio (2002). “Neural mechanisms of object recognition.” Curr Opin Neurobiol 12(2): 162-168.
Viola, P. and M. J. Jones (2004). “Robust real-time face detection.” International journal of computer vision 57(2): 137-154.