Convolution is Not Biological
While it is true that ConvNets, a particular flavor of Deep Learning, are achieving great success at an increasingly wide array of pattern recognition tasks, it is unlikely that they capture the essence of biological intelligent computation. Conceptually, the convolution operation involves an iteration, e.g., a systematic translation, of a kernel (filter) over an input space. In the case of vision, we can take the input space to be the whole 2D visual field and the kernel to be a small 2D patch.
It is often asserted that a key advantage of a ConvNet is that only one copy of the parameters specifying the kernel need to be learned/stored. This is because the long-term statistics of the visual field vary little across small patches (apertures). This means that a single basis (lexicon) of features can represent all apertures approximately equally well. It also means that for training the basis, all inputs to all apertures can be used as a single training set, thus greatly increasing the number of samples on which the basis is based, and mitigating against undersampling. Indeed, that is the justification for the shared (tied) parameters method.
But:
- Is it at all likely that the brain, or more precisely any single map in the brain, uses the same kernel at all locations? Almost surely no! Think what it would entail. First, understand that the brain is of course locally connected. Thus, the physical parameters instantiating the kernel for any particular patch of the visual field are the synaptic weights leading from the relevant patch of retina to the relevant patch of LGN and then onto to the V1 cell of the kernel in question. In particular, the LGN-V1 synapses instantiating any one patch's kernel are necessarily disjoint from the synapses isntantiating any other kernel. For it to be the case that the same kernel is used for all patches, there would have to be some mechanism for communicating / copying each such parameter (synaptic weight) to the synapse representing that same topologically registered parameter in all other kernels. On its face, this seems extremely implausible. Of course, the use of shared parameters and averaged gradients relies even more strongly on this extremely implausible biological scenario.
- Is there really any need to use the same kernel at all locations? No, for two reasons.
- Under a set of reasonable assumptions, the number of possible unique inputs to an aperture of the scale seen by a V1 cell, could be "small". For example, in Sparsey those assumptions are implicit in that we edge-filter, binarize, and skeletonize the inputs. This vastly reduces the space of possible inputs to an aperture of the scale seen by a V1 cell. (See this page, especially Fig. 2 and below) This is turn, mitigates the undersampling problem and increases the likelihood that the inputs experienced (in some early, e.g., "critical" period) in any single aperture will be sufficient to train up a basis sufficient to represent ALL future inputs in that aperture. In other words, nonstationarity is probably NOT a problem at lower spatiotemporal scales handled by the coding fields in lower cortical levels.
- Rapidly increasing evidence, e.g., from compressive sensing (e.g., Pitkow, 2012) and random projections (Arriaga et al. 2015), suggests that even completely random bases, i.e., requiring no learning, can be sufficient to represent natural input spaces. If completely random bases can often suffice, then surely bases learned from few samples could suffice. The bases (lexicons) learned for different apertures will be highly redundant (due to the aforementioned fact that the underlying statistics of the input space are highly similar across apertures). But this redundancy is no problem in actual biology because, as mentioned above, the physical synapses needed to represent all the highly redundant kernels in fact exist.
- Is there really any advantage to using the same kernel everywhere? In fact, might it not be a disadvantage? Clearly we think the above arguments imply that there is no advantage in the biological brain. And, for the reasons stated on this page, i.e., the increased burden of moving data entailed by shared parameters and using the same kernel everywhere, which eats power and computational time, we think it's clearly a disadvantage.
Despite the clear and continuing technological success of deep ConvNets, which has been driven by the rise of machine parallelism and massive amounts of data, it is exceedingly unlikely that they operate on information in essentially the same way that the brain does.