Might reliance on machine parallelism be
garden-pathing the Deep Learning community?
it is readily acknowledged that Deep Learning (DL) training times can be very long and that DL's tractability depends on massive machine parallelism (both model and data parallelism) using GPUs and the availability of massive amounts of training data. I suggest that these dependencies belie a deep disimilarity between the essential computational natures of Deep Learning models and of cortex and that the availability of cheap machine parallelization and massive amounts of training data may be leading researchers away from the true nature of neural/cortical computation that gives rise to intelligence.
The problem with relying on shared (tied) parameters in Convolutional Nets
We know that the brain consumes several orders of magnitude less power than existing computer technology. We also know that most of the power consumed by computers is consumed by the movement of data between processor and memory and in a GPU setting, between/amongst the CPU, memory, and GPU cores. Thus, models for which substantial movement of data is intrinsic to the computation may have great difficulty realizing power efficiency commensurate with the brain. In particular, the technique of shared (tied) parameters, on which ConvNets critically depend for acceptable computation time, intrinsically involves massive data movement (i.e., in the accumulation of gradient information across multiple spatial locales from which an average gradient is then computed).
In fact, in a Deep Learning Tutorial (Hinton, Bengio & LeCun) given at NIPS 2016, Dr. LeCun acknowledges that sharing parameters entails a large amount of data movement (at ~minute 46 of the talk, Slide 66 “Distributed Learning”). Thus, the current situation is that:
- Learning (Training) time in ConvNets, which many consider to be long even with massive machine parallelism, depends on parameter sharing.
- By its algorithmic nature, parameter sharing entails massive data movement.
- Moving data eats up most of the power underlying computation.
- Thus, ConvNets will have difficulty attaining brain-like power efficiency unless they can move away from parameter sharing.
Also on slide 66, Dr LeCun references efforts to address this issue, e.g., asynchronous stochastic gradient descent, but indicates that substantial challenges remain. There has been some exploration of “locally connected” ConvNets (Gregor, Szlam, Lecun, 2011), i.e., ConvNets which do not use parameter sharing. But these are nascent efforts, and the fact remains that without parameter sharing, scalabilty of ConvNets to massive problems has to be considered an open question. What would be the training time on a larger benchmark like ImageNet, without parameter sharing?
Stepping back, the brain is, of course, locally connected. Therefore, the most parsimonius view is that the actual algorithm / mechanism of intelligence, having evolved in a locally connected system, requires local connectivity. More directly, one is extremely hard put to imagine any plausible biological instantiion of parameter sharing. Thus, the existence proof that is the brain constitutes another serious challenge to ConvNets.
Despite the near-term practical success of ConvNets, the above arguments call into question whether they capture the true principles/mechanisms of biological intelligence and raise the possibility that other technological advances, i.e., the availabiity of cheap massive parallelism and massive amounts of training data, are diverting research away from understanding / uncovering those principles/mechanisms.
The problem with non-local learning methods
Of course, scientists have been seeking the secret of intelligence for many decades. However, the majority of that research has not specifically attempted to emulate the cortical realization of intelligence. This is true even for the majority of neural network models, up to and including the most recent state-of-the-art DL models. It is true that DL emulates some cortical characteristics, most notably cortex's multi-level, hierarchical architecture. However, both main classes of DL, those based on Restricted Boltzmann Machines (RBMs) and those based on ConvNets, continue to use non-local learning methods that depend on computing gradients and/or generating samples of large vector quantities, e.g., the state of all the cells of a hidden level or of the input level. The lack of plausible biological instantiations of and neural evidence for such methods is a long-standing crticism. There are continuing efforts to ground such methods biologically, e.g. the Bengio et al (2015) paper, "Towards Biologically Plausible Deep Learning", but many hurdles remain. Furthermore, the incremental nature of the learning (large numbers of small weight adjustments) seems at odds with the generally much faster, often single-shot, nature of human learning (as discussed, e.g., in Lake, Ullman, Tenenbaum, Gershman, 2016).
As these computationally demanding operations, i.e., computing gradients, sampling, are at the root of DL's long learning times, we again suggest that the recent massive increases in computing power, most notably massive machine parallelism via GPUs, and training data, may be diverting attention of researchers away from discovering the true principles/mechanisms of biological intelligence.