8.3362 Interdisciplinary Course: Understanding Learning in Deep Neural Networks with the Help of Information Theory (Part II) (KOGW-MPM-IDK)

Leugering, Nieters

Type Language Semester Credits Hours Room Time Term Year
Pr e 8 4 W 2018
MSc Interdisciplinary Course


Large, hierarchical artificial neural networks trained with stochastic gradient descent and backpropagation achieve impressive results in various computer vision and natural language processing machine learning problems. However, a comprehensive theory that explains the power and underlying principles that allow deep networks to be trained efficiently has eluded researchers so far. The common working hypothesis, that deeper networks just allow for more complex abstractions, seems at odds with earlier theoretical results proving that even shallow networks are universal function approximators. To the contrary, the large number of parameters of typical deep networks poses a puzzle for traditional machine learning theories, which would predict insurmountable problems with over-fitting – even with the vast training corpora in use today. The depth of the networks also results in the problem of vanishing gradients, which should render learning in lower layers impractically slow. So how can deep networks perform so well in practice? One hypothesis has been proposed by the computational neuroscientist Naftali Tishby:

In investigative experiments the power of networks trained with gradient descent is explained well by an information theoretic concept: each layer of a deep network compresses its inputs (lossy and in a task related way) and thus reduces the complexity of the data seen by the next layer. Stacking many such layers, all simultaneously learning to compress their inputs, could thus allow a deep network to quickly learn even complex problems without overfitting. Is this a general principle, that all training algorithms for layered neural networks should fulfill? Do networks that adapt to input via biologically plausible plasticity rules show similar patterns? When does the principle break down? Is it possible to get similar results for recurrent neural networks?

To us, these questions are both interesting and promising research avenues. We are looking for a small group of motivated students eager to join us in this investigation within the context of a study project*. We are looking to establish a diverse group of students where members have a strong background in either mathematics or computer science and software engineering, or both. Creative students with a knack for management, writing, and developing ideas are also very welcome!

Our first milestone will be for every group member to get a grip on the relevant portions of information theory and the information bottleneck method. We will then reproduce the experiments from Schwartz-Ziv and Tishby [2017] in a robust, open-source software environment developed by us. This will serve as the basis for our own investigations, as we design equivalent experimental setups for neural networks where (local, unsupervised) plasticity mechanisms are used to adapt connections. Our goal is to publish the results found during the project and open up follow-up research opportunities!

Link: http://www