Issues regarding invariant recognition in deep hierarchical networks

The montage below shows the mac activation at the first three levels of a Sparsey model for an original snippet (top row) and a 20% noisy version of the original snippet (bottom row). Columns 1-3 correspond to level L1-L3, respectively. Note that for simplicity, we used an old-style model here, i.e., having a rectangular, rather than hexagonal, topology at each level, and such that the L1 U-RFs are non-overlapped. This makes it easier to describe the issue but the same issue exists for the hexagonal and overlapping L1 U-RF architecture as well.  The input is 64x64 pixels and the snippets are 8 frames long.  At each level, the pixels actually represented by the active macs (generally a subset of all active pixels) are shown (black).  L1 is a 16x16 array of macs, each having a 4x4 pixel aperture onto the input surface.

The main thing to see is that even though we've applied a very simple noise transform to the original training snippet, i.e., randomly choose 10% of the active pixels and move them up to 1 pixel in x and y, the set of active L1 macs (at any given frame) for the original snippet differs significantly from the active set for the noisy snippet. The same is true for the higher levels as well. In this simulation, the L1 U-RFs were 4x4 pixels and the L1 π bounds were set so that an L1 mac only becomes active if it has exactly 4 active pixels in its U-RF. When we apply the noise, that generally changes the numbers of active pixels in many of the L1 U-RFs. Thus, on any given frame, some L1 macs activate for the the learning snippet but not for the noisy test snippet and vice versa. Furthermore, many of the L1 U-RFs that have, for any given frame, four active pixels in both snippets, have different exact patterns of active snippets. Thus, there are two levels of variation that Sparsey (or any model) needs to address, in order to recognize the noisy snippet as an instance of the same class as the learning snippet.

  1. Variation in which different macs are active between learning moment and analogous testing moment.  Note that since the code in an active mac generally represents a particular maximally active feature, we can think of this as variation in which different features are present between the learning and testing moments.
  2. Variation in the exact patterns present in the U-RF of a given mac between learning moment and testing moment.  Again, since we can think of the code activated in a mac as representing a feature (though we emphasize that a crucial differentiator between Sparsey and localist models such as HMAX is that the single active code in a mac represents not only a particular maximally active feature, but in fact the likelihood distribution over all features stored in the mac), we can think of this as variation in the precise appearance of a feature between the learning and test moments.

And, if the the pattern of active L1 macs on a given frame differs between learning and test snippet, this generally causes the numbers of active L1 macs in the L2 mac U-RFs to also differ, leading to different sets of L2 macs being active between the learning and test snippets.  This same principle occurs all the way up the hierarchy (the model in this simulation actually had 9 levels).

Sparsey's existing mechanisms address the second type of variation listed above. We are currently investigating general solutions to the first of the two problems.  As suggested above, the first kind of variation can be thought of as variation in which different features are present across instances of a given class.  This has been the level at which purely symbolic "AI" systems have focused in addressing the problem of recognition.  That is, AI systems typically are described as representing classes (of objects, events, concepts, etc.) as lists of features.  Despite the fact that Wittgenstein showed that exhaustive lists of necessary and sufficient features cannot be a formal basis for defining classes, much of routine intelligent processing, i.e., of one's thought stream, and certainly of one's language stream, seems to heavily involve features. We are therefore happy to be able to define, in Sparsey, a formal mechanistic/representational difference between the two types of variation. While the feature scales present in the levels portrayed above are lower than the scales typically treated in cognitive models, the same dichotomy of types of variation applies all the way up Sparsey's hierarchy. Thus, we anticipate this dichotomy will be very helpful in demonstrating much higher level invariances, e.g., the ability of a model to recognize moving objects/people, i.e., with whole/parts of limbs (features) becoming obscured/visible across time and with arbitrary schedules of occlusion across features.