Episodic Memory of Weizmann HOF snippets

This page describes a 2-level (non-hierarchical) version of Sparsey that is able to learn detailed episodic memory traces of all 90 Weizmann snippets that have been preprocessed using dense-grid HOF filtering (applied within a bouding box (BB) around the human actor).  Episodic memory is memory for the detailed (spatiotemporal) pattern of features comprising individual instances, which contrasts with semantic memory, which is memory for statistical regularities (class structure) of the whole training set.  The video below shows our dense grid (8x8) of HOF centers superimposed on the actor for one of the 90 Weizmann snippets (ido_bend.avi).

The video below shows the 7-frame HOF features vector produced from the data within the BB in the above 13-frame video (which was produced from an original 65-frame Weizmann snippet).  The HOF frames are 24x120 and correspond to an array of 8x8 3x15 HOF grids.  Each of these 3x15 grids is a 3x3 array of 5-element histograms, where the 5 binary elements represent either no motion, or presence/absence of motion in four canonical directions, NE, SE, SW, and NW.  The motion values are computed from a window of successive frames of the original video (in the region corresponding to histogram in question).

Below, we see the Sparsey model (with ~2.5 million binary wts) being reactivated during a test presentation of the above HOF vector sequence, which was presented once during learning (as were all the other 89 HOF sequences, i.e., single-trial learning). There are many things to note in this video. At the most summary level, note that the model has assigned a sequence of 7 sparse codes, which are chained together via Hebbian learning (green "horizontal" (H) weights). Each sparse code occurs in one of the 20 macrocolumns (macs) (a mac can be active twice (or more) times in a row, though that doesn't occur in this example). Each code (set of colored cells in the mac [black: correct, red: incorrectly active, green: incorrectly inactive]), is chosen as a function of the simultaneous input via the H-wts and the bottom-up (U) information from the active input "pixels" (blue wts), i.e., as a function of a spatiotemporal similarity computation.

The supervised learning is accomplished by increasing the weights from the last code active in the sequence onto a field of class nodes (not seen in this figure).

RESULT: We presented 90 snippets, totalling several hundred frames, once each, resulting in 90 spatiotemporal sparse distributed code (SDC) memory traces like the one shown. Training time is ~25 seconds, BUT there is likely at least 10x-100x speedup possible with standard code optimization and much more speedup possible by finding better input features...AND, this is all without even talking about any type of machine parallelism.  We verified that the traces are recapitulated almost perfectly in all cases and that the correct class is output ~98% of the time. Note that while the model has 2.5 million wts (U and H wts combined), relatively small percentages of the wts afferent to most internal units (of which there are 20 x 5 x 7 = 700) are actually increased during learning, e.g., ~2-5%. This suggests that this particular model could probably store a great deal more such input sequences while still being able to recognize (or recall, though we haven't implemented that recall testing yet) them all with high accuracy.

However, specifically because of this particular model's non-hierarchality and parameter settings optimized for episodic memory, it does not generalize well, i.e., it does not exhibit proper semantic memory. Consequently, we do not show that it can do well on the standard Weizmann benchmark classification task, i.e., train on a subset of the 90 snippets, and test on the others.

We are working towards that goal now.  Specifically, we are experimenting with deep hierarchical models (examples of which can be seen throughout this web site).  The overall concept of operations is that macs at higher levels come to represent (embed) higher-level statistical structure (a.k.a., invariances) of the input space, which can then support classification.  In parallel with this, we are investigating alternative pre-processing, i.e., input features, that make the classifications easier to learn.  In particular, our dense-grid HOF input sequences seem to have as much intra-class variation as inter-class, which makes classification quite hard. I believe that if we spend even a modest effort at finding/creating better featural representations of these snippets, even this 2-layer model could get high classification scores on the standard benchmark task.

However, as many in the pattern recognition and machine intelligence fields know, it is easy to spend huge amounts of time and money, engineering the features to yield state-of-art (SOA) performance on this or that task.  Ultimately, the brain receives quite primitive signals from the sensory organs and discovers for itself (perhaps with some relatively small amount of supervision) all higher-order features.  This includes features of the sort that most SOA models have used as their input, e.g., SIFT, HOG, HOF.  But I don't see any fundamental distinction between features such as these and arbitrarily more complex (and nonlinear) features, or in other words, any higher-level concept/class, e.g., the concept of a meal, or of a house, of driving a car, or of being a doctor, etc.  All concepts can also be viewed as features of still higher-level concepts.  My primary goal has always been to discover the core algorithm by which the brain automatically discovers (extremely quickly) concepts of any scale and therefore I have personally spent almost no time/energy on feature engineering/optimization.