Bayesian
Integrate And Shift (BIAS) model
The objective of this project is to
develop a model for learning visual object categories that is
inspired by salient properties of human vision. Among the
requirements are the ability to:
-
integrate information from a large
number of features
-
learn new object categories from
only few examples
-
recognize an object within a scene
without "pre-segmentation"
-
improve recognition confidence as
more information (e.g. from previous fixations) becomes
available
Although some of the
current computer vision systems do incorporate one or two of the
above properties, none of them can handle all of the above
requirements. Typically, state-of-the-art systems require
thousands of training examples that usually have to be manually
segmented. Similarly, in order to recognize an object within a
scene, these have to be supplied with a pre-segmented, and often
normalized, section of the image and can not handle the whole image.
In some situations, instead of processing a section of the image, a
recognition system requires a collection of local regions (e.g. in
feature-based approaches). However, an optimal procedure for
finding either a section of the image or a collection of local
regions does not yet exist and instead various heuristics are in
use. Note that the number of features that are supplied to the
recognition system has to be small, in some cases due to
computational complexity, but in general due to the curse of
dimensionality. Although numerous methods for reducing the
dimensionality of the feature space exist, they do come with a
price: excluding some feature can also remove important
information. And finally, conventional recognition systems treat
recognition as a one-time matching (using a conveniently defined
objective function). This should be contrasted with human visual
system where recognition is actually a process of integrating
information, refining recognition, and improving confidence.
We have recently
introduced a biologically inspired algorithm for learning object
categories that uses Bayesian inference to: a) integrate information
from different local regions of the scene, given a fixation point,
and b) integrate information from different fixations [1]. In our
model, an object is represented as a collection of features of
specific classes arranged at specific locations with respect to the
location of the fixation point (FP). Even though the number of
feature detectors that we use is large, we show that learning does
not require a large amount of training examples. This is due to the
fact that between an object and features we introduce an
intermediate representation, object views, and thus reduce the
dependence among the feature detectors. In order to capture
variations in feature locations, we introduce a grid of receptive
fields whose sizes increase with the distance from the FP and
therefore resemble a fovea-like distribution. As a
consequence, the accuracy of estimating feature locations is high
only for the features that are close to the FP. To improve location
estimates for the features that are further away from the FP, the
recognition system has to select a new fixation point and make a
”saccade”.
The training is done in a supervised way and the procedure that we use draws inspiration from the
way humans often learn to recognize objects. As opposed to training
procedure used in most supervised learning algorithms where the
whole object is assigned to one class, our training procedure
requires more detailed labeling of objects. The environment that we
constructed and implemented in Matlab (see
example) allows the user
to mark a section of an object as a region that contains points that
constitute the same view. Once the user marks a specific
region, (e.g. the region around the right eye below), the system
samples the points within the region (makes "fixations") and updates
various learning parameters. Note that during the training
procedure the input to the system is the whole image and the system
learns object categories by learning different object views.
During the recognition process, the system outputs the probability
that a fixation point represents the center of a specific object
view. Therefore, the system provides much more detailed
information about the object in addition to its class.
The following few
images illustrate the point. The system was trained to
recognize a face from 7 different view-points or views (e.g.
right eye, left eye, tip of the nose, etc.) using different number
of training examples. The system was then tested on faces of
people that were not used for training by randomly choosing fixation
points within a new image. In images below the blue crosses
denote the locations of the FPs that the system associated with
correct views and the red crosses are the FPs that the system
associated with the "background". Note that in this case the
background consists of all the points that were not part of the view
regions.
The following two figures illustrate
the performance of the system using the equilibrium point measure.
In contrast to many conventional algorithms that often require
thousands of training examples, our system can learn a new object
category from only a few examples (e.g. as few as 3) while at the
same it can utilize the outputs of a large number of feature
detectors (e.g. over 1,000). The system can easily learn other
object categories and not just faces and the performance always
improves as the system makes more fixations, see [1] for more
details. All the tests were done on images from the Caltech database (www.vision.caltech.edu).
-
[1] P. Neskovic, L. Wu and L. N.
Cooper.
Learning by Integrating Information
Within and Across Fixations,
Lecture Notes In Computer Science: Artificial Neural Networks
- ICANN, Vol. 4132, pp. 488-497, 2006.
PDF (extended
version -
IBNS Technical Report 2006-01
PDF)