Structural degeneracy in neural networks

Minor Thesis
submitted in partial fulfilment of
the requirements for the degree of
Master of Computer Science
at
The University of Melbourne

Matthew Farrugia-Roberts

Supervised by Daniel Murfet and Nic Geard

Submitted: October, 2022. Minor revision: December, 2022.

Thesis

Download: Full text PDF (3.2MB).

Abstract

Neural networks learn to implement input–output functions based on data. Their ability to do so has driven applications in many task domains. However, a solid theoretical understanding of deep learning remains elusive. For example, classical theoretical frameworks fail to account for the performance of large neural networks despite their capacity to implement overly complex functions. There is a need for basic theoretical research clarifying the relationship between structure, function, and complexity in neural networks.

For almost all neural networks, the relationship between structure and function is well understood: all parameters implementing a given function are related by simple operations such as exchanging the weights of units. However, for the remaining degenerate neural networks, there are additional parameters implementing the same function. Since some of these additional parameters correspond to smaller, less complex neural networks, degenerate neural networks are positioned to play a crucial role in understanding neural network learning. However, prior work investigating neural network structures has largely emphasised the generic non-degenerate case.

In light of this situation, I present a comprehensive theoretical investigation of structural degeneracy in a simple family of neural networks (single-hidden-layer biased hyperbolic tangent networks). I develop an algorithmic framework for analysing degeneracy, including an ideal measure of degeneracy called the rank of a neural network (the minimum number of hidden units required to implement an identical function). Using this framework, I characterise the class of functionally equivalent parameters including in the degenerate case, extending prior work that has only considered the non-degenerate case. I show that in the degenerate case, the functional equivalence class is piecewise linear path connected. Moreover, I study a measure of approximate degeneracy, the parametric approximate rank, based on proximity to low-rank parameters. Drawing on computational complexity theory, I show that determining such proximity is an NP-complete problem.

The insights into structural degeneracy developed in this thesis have the potential to clarify topics of current interest in deep learning. More broadly, this thesis lays a foundation for future work developing efficient measures of degeneracy and empirically investigating the role of degeneracy in deep learning.

Citation

BibTeX database entry:

@mastersthesis{Farrugia-Roberts2022,
    title={Structural Degeneracy in Neural Networks},
    author={Farrugia-Roberts, Matthew},
    month=dec,
    year=2022,
    school={School of Computing and Information Systems,
        The University of Melbourne},
    url={https://far.in.net/mthesis},
}

Recorded talks

A talk at the Singular Learning Theory & Alignment Summit, 2023

“Hidden unit acrobatics: An introduction to swaps, flips, and other symmetries of singular networks and their applications”. An intuitive-level overview of reducibility and main thesis results, plus important directions for future work. 45 minutes. [youtube] [interactive notebook]

A series of three ~90-120 minute improvised talks at metauni’s singular learning theory seminar, presenting some of the key results in the thesis.

Part 1: Overview of thesis. Three perspectives on degeneracy. Reducibility and reduction algorithms. [youtube]
Part 2: Rank and algorithms. Parametric approximate rank. Statement of hardness theorem. [youtube]
Part 3: Bounding parametric approximate rank (PAR). PAR is NP. Reduction from SAT to PAR (sketch half of proof). [youtube]

An earlier talk in the same series.

Part 0: Preliminary results on piecewise linear path connectivity of the functional equivalence class of reducible parameters. [youtube]

TODO: Upload and link a couple of other related talks.

Colophon

Typeset with LaTeX.

Fonts include Computer Modern, BOONDOX fraktur, and Ralph Smith Formal Script (\usepackage[frak=boondox,scr=rsfs]{mathalfa}).

Most figures drawn with TikZ, and some with pgfplots.

Errata

Please contact me if you find any errors.

Known errors:

Page 38 (pdf page 43). After the list, “so w is reducible” should read “so w is non-minimal”. Oops!
Page 42 (pdf page 47). The final paragraph of the proof of theorem 3.39 is incorrect. The subparameters to which I apply Sussmann’s result are not necessarily irreducible, they may satisfy reducibility condition (i) (this happens when the outgoing weight vectors of some unit have some zero components). However, the proof can be recovered since no units can have an outgoing weight vector with all units zero, and the required transformation can be constructed by piecing together those from the different subnetworks. A corrected proof is provided in a forthcoming conference paper (preprint on arXiv).
Page 90 (pdf page 94). In step (3) of the list, “approximately zero incoming weight” should read “approximately zero outgoing weight”. Oops!
Page 91 (pdf page 95). Algorithm 6.10, line 10. The less-than-or-equal comparison operation should be a greater-than operation. Oops!
Reference to Amari et al., (2017). That should actually be 2018. There are probably some other typos in the reference list, as many reference details were completed by hand.