Hierarchical Sparsey® Network Results on Weizmann Event Recognition

Our goal is to develop general-purpose hierarchical models capable of spatiotemporal pattern recognition.  As one special case, we are testing 3-level Sparsey networks on the Weizmann benchmark.  Weizmann consists of 90 snippets, 9 from each of 10 classes.  We emphasize at the outset that we are aware of no other hierarchical model based on sparse distributed representations that has even been applied to video event recognition (this includes Numenta's HTM), let alone achieved any particular level of performance.  To avoid misunderstanding, we must also emphasize immediately that sparse distributed representations (SDR), a.k.a., sparse distributed codes (SDC), is a different concept than "sparse coding" (cf. Olshausen & Field).  Of course, the latter concept, sparse coding, is widely present in optimization-centric ML methods.  Moreover the two concepts are completely compatible with each other and in fact co-exist in Sparsey.  But, SDR, has been much less widely applied or even researched.  There are only a handful of models out there which use SDR.  The most prominent is probably Numenta's HTM.  Hecht-Nielsen's Confabulation model is also very clearly an SDR model, though he does not use that terminology.  There have been several others, e.g., Kanerva's sparse distributed memory (SDM), Rachkovskij & Kussel's model, Moll & Miikkulainen's Convergence-zone model, but there is little information regarding large-scale instantiations, composed of hierarchies of canonical modules, and in parricular, applied to spatiotemporal (sequential) problem domains.

Thus, we believe the results reported here to be the first results of a truly hierarchical model which uses SDR at all levels being applied to a benchmark video event recognition problem.  Note that Sparsey is being used in conjunction with an SVM here.  Specifically, we take the internal representations (SDR codes) active at Sparsey’s top level on the last frame of each video snippet and use those codes as inputs to train an SVM. The use the SVM is partly pragmatic, i.e., the overall supervised training paradigm is richly automated by the SVM package.  However, we anticipate being able to remove the back-end SVM shortly, and thus drive the classification decisions directly from the learned SDR codes as is the case for our MNIST results.  We used the original 90 Weizmann snippets as well as 5 slightly noisy instances of each, so that the total training set contained 540 snippets.  That said, our results are still modest.  We currently only achieve 67% accuracy on Weizmann (chance=10%), but trains in 3.5 mins on single CPU core (no machine parallelism whatsoever).  In addition, this result must be understood in light of the following.

  1. All of the model's speed comes from the exploitation of algorithmic parallelism.  In fact, "algorithmic parallelism" is essentially synonymous with using "distributed representation", which is also synonymous with storing information in "physical superposition".
  2. The model uses extremely simple input features, i.e., just binary pixels.  Specifically, each frame is converted into an edge-filtered binary, 42x60 image.  We also decimate in time so that the total input set consists of just 883 such frames.  But crucially, no higher-level features, e.g., HOF, HOG, are used.  Indeed our goal is that all motion features should be learned at higher levels of the network from scratch by experiencing only the raw data.
  3. Each snippet is presented only once (cf. single-trial, or one-shot, learning).
  4. All cells are binary, all weights are effectively binary, a very simple Hebbian learning rule is used. There is no gradient based learning in the model. By effectively binary, we mean this. The weights are actually represented as bytes. However, when a pre-post coincidence occurs, a weight is set to 127. Passive decay is modeled as well as another synaptic property that we call permanence. In any case, ultimately, at the end of learning, the weight distribution will have two narrow peaks, one at 0 and one at 127. At such time, recognition (inference) is essentially using binary weights.
  5. The Sparsey algorithm, called the Code Selection Algorithm (CSA), is very fast. Both during learning and retrieval (recognition, inference), it involves only iteration over weights (which is a fixed quantity for the life of the network), not over stored items. Thus, both learning and retrieval time remain constant for the life of the network.
  6. An individual Sparsey module, which we call a macrocolumn, or "mac", since we view it as analogous to the cortical macrocolumn, has 100s of parameters (probably reducible to a few tens of meta-parameters). Thus far, we have done a very small amount of parameter searching and in fact, we still have a great deal to learn about the optimal relations between parameters both within any individual mac and between macs across levels. Thus, we are quite confident in being able to boost performance into the state-of-art (SOA) range, i.e., 90-100% classification accuracy, relatively quickly.  In addition, one can readily see in the animations below that a lot of macs activate on each level on each frame.  In our search through hyper-parameters, we believe we will find settings that decrease the average number of active macs across frames by at least 2x, thus at least doubling speed.
  7. Given the model's extremely simple representation of unit state and weights, which implies that very low precision is sufficient, and the fact that its operations, learning and retrieval, have constant time regardless of how much information is stored, we believe that Sparsey will be a strong platform on which to base extremely low-power, extremely fast, and extremely scalable applications in the realm of spatiotemporal (and as a special case, spatial) pattern recognition, going forward.  This includes any type of purely sequential pattern recognition task, e.g., any type of language processing application, as well.
  8. As is true for all models attempting to solve problems like this, we expect there could be huge improvements made by introducing information from other modalities.  There will also likely be huge amortization of learning (reduced learning time across data sets/tasks) due to transfer learning.

Fig. 1: A 3-level Sparsey model responding to preprocessed Weizmann snippet

The above figure shows a 3-level model with 12x18=192 macs at L1 and 6x9=54 macs at L2. Rose macs are active. The next figure is the same but highlights the receptive fields (RFs) of four L1 macs (green borders).  Small representative samples of active afferent weights within the mac's bottom-up (U), horizontal (H), and top-down (D) RFs, are shown.  The flashing weights represent only a tiny fraction of all signals sent while processing the snippet.

Fig. 2: Same model just highlighting the RFs of some L1 macs

The U-RFs of the four L1 macs are highlighted in cyan. They contain about 100 input pixels (binary features) each. When a sufficient number of pixels is active in an L1 mac's U-RF (one of many model parameters), it activates. When it activates, an SDR code is activated within the mac, though these codes are not shown here because the scale is too small.

Fig. 3: Same model just highlighting RFs of two L2 macs

The U-RFs of the two L2 macs (green borders) are shown highlighted in cyan: they slightly overlap (several slightly darker cyan L1 macs). Here we show recursive RFs meaning that in addition to seeingthe set of L1 macs comprising the immediate U-RF of the L2 macs, we also see the input-level U-RF of each L2 mac. The input-level U-RF of an L2 mac is the union of the pixels comprising the immediate U-RFs of all the L1 macs comprising the L2 mac's immediate U-RF. Within an L2 mac's immediate U-RF, the active L1 macs are highlighted in purple.  The smaller U-RFs of three L1 macs are also shown.  We also show the actual codes that become active in the L2 macs here.

Again, these are still preliminary investigations of a complex model with many parameters. In particular, though we've recently been focusing on models with just three levels (two internal representation levels), it is likely that more optimal solutions will have more levels (though likely the same number of or fewer weights, overall). We will be increasing the number of levels as we gain sufficient understanding of inter-level parameters.