Home Research Publications

 

 


Bayesian Integrate And Shift (BIAS) model

 

The objective of this project is to develop a model for learning visual object categories that is inspired by salient properties of human vision.  Among the requirements are the ability to:

  • integrate information from a large number of features

  • learn new object categories from only few examples

  • recognize an object within a scene without "pre-segmentation"

  • improve recognition confidence as more information (e.g. from previous fixations) becomes available

Although some of the current computer vision systems do incorporate one or two of the above properties, none of them can handle all of the above requirements.  Typically, state-of-the-art systems require thousands of training examples that usually have to be manually segmented.  Similarly, in order to recognize an object within a scene, these have to be supplied with a pre-segmented, and often normalized, section of the image and can not handle the whole image.  In some situations, instead of processing a section of the image, a recognition system requires a collection of local regions (e.g. in feature-based approaches).  However, an optimal procedure for finding either a section of the image or a collection of local regions does not yet exist and instead various heuristics are in use.  Note that the number of features that are supplied to the recognition system has to be small, in some cases due to computational complexity, but in general due to the curse of dimensionality.  Although numerous methods for reducing the dimensionality of the feature space exist, they do come with a price: excluding some feature can also remove  important information. And finally, conventional recognition systems treat recognition as a one-time matching (using a conveniently defined objective function). This should be contrasted with human visual system where recognition is actually a process of integrating information, refining recognition, and improving confidence. 

We have recently introduced a biologically inspired algorithm for learning object categories that uses Bayesian inference to: a) integrate information from different local regions of the scene, given a fixation point, and b) integrate information from different fixations [1]. In our model, an object is represented as a collection of features of specific classes arranged at specific locations with respect to the location of the fixation point (FP). Even though the number of feature detectors that we use is large, we show that learning does not require a large amount of training examples. This is due to the fact that between an object and features we introduce an intermediate representation, object views, and thus reduce the dependence among the feature detectors.  In order to capture variations in feature locations, we introduce a grid of receptive fields whose sizes increase with the distance from the FP and therefore resemble a fovea-like distribution.   As a consequence, the accuracy of estimating feature locations is high only for the features that are close to the FP. To improve location estimates for the features that are further away from the FP, the recognition system has to select a new fixation point and make a ”saccade”.

The training is done in a supervised way and the procedure that we use draws inspiration from the way humans often learn to recognize objects. As opposed to training procedure used in most supervised learning algorithms where the whole object is assigned to one class, our training procedure requires more detailed labeling of objects. The environment that we constructed and implemented in Matlab (see example) allows the user to mark a section of an object as a region that contains points that constitute the same view.  Once the user marks a specific region, (e.g. the region around the right eye below), the system samples the points within the region (makes "fixations") and updates various learning parameters.  Note that during the training procedure the input to the system is the whole image and the system learns object categories by learning different object views.  During the recognition process, the system outputs the probability that a fixation point represents the center of a specific object view.  Therefore, the system provides much more detailed information about the object in addition to its class.

The following few images illustrate the point.  The system was trained to recognize a face from 7 different view-points or views (e.g. right eye, left eye, tip of the nose, etc.) using different number of training examples.  The system was then tested on faces of people that were not used for training by randomly choosing fixation points within a new image.  In images below the blue crosses denote the locations of the FPs that the system associated with correct views and the red crosses are the FPs that the system associated with the "background".  Note that in this case the background consists of all the points that were not part of the view regions.  

 

The following two figures illustrate the performance of the system using the equilibrium point measure.  In contrast to many conventional algorithms that often require thousands of training examples, our system can learn a new object category from only a few examples (e.g. as few as 3) while at the same it can utilize the outputs of a large number of feature detectors (e.g. over 1,000).  The system can easily learn other object categories and not just faces and the performance always improves as the system makes more fixations, see [1] for more details.  All the tests were done on images from the Caltech database (www.vision.caltech.edu). 

 

 

 

  •  [1] P. Neskovic, L. Wu and L. N. Cooper. Learning by Integrating Information Within and Across Fixations, Lecture Notes In Computer Science: Artificial Neural Networks - ICANN, Vol. 4132, pp. 488-497, 2006. PDF (extended version - IBNS Technical Report 2006-01 PDF)


 

Home Research Publications