Skip to main content
To KTH's start page

On large deviations in probabilistic deep learning and generative modeling

Time: Thu 2025-12-11 14.00

Location: Kollegiesalen, Brinellvägen 8, Stockholm

Language: English

Subject area: Mathematics

Doctoral student: Viktor Nilsson , Sannolikhetsteori, matematisk fysik och statistik

Opponent: Professor Luc Rey-Bellet,

Supervisor: Pierre Nyquist, Sannolikhetsteori, matematisk fysik och statistik

Export to calendar

QC 2025-11-21

Abstract

The topic of this thesis is the use of probabilistic methods in machine learning. These play a foundational role in motivating and defining machine learning algorithms, as well as in explaining why, and how well, the algorithms work. During the years in which this work was conducted (2020-2025), machine learning has gone from delivering a handful of impressive demonstrable results to becoming a staple of modern developed society, with products such as ChatGPT, etc., having capacity and applicability far beyond what anyone in the field expected ten years ago. Explaining why the modern methods work so well, despite their conceptual simplicity and elegance, requires both empirical and theoretical studies. This thesis has both parts, while the emphasis is on theory. The first part of the thesis, Papers A-C, concerns the implementation and analysis of novel methodologies in deep learning, whereas Papers D-F concern purely theoretical large deviations results for machine learning adjacent models. The main thread is the application of mathematical tools from probability theory and statistics, such as the theory of large deviations and empirical process theory, to the understanding and improvement of machine learning methodology.

In Paper A, which presents the most applied direction pursued in the thesis, a deep probabilistic network model is applied to a task from the field of clinical radiation therapy, namely dose prediction, where a value of radiation dose target shall be assigned to each pixel/voxel of a human tissue based on a CT-image. The developed probabilistic model is based on mixture density networks. It is empirically demonstrated that a convolutional U-net can learn a satisfactory mixture distribution in each pixel. To the best of our knowledge, this is the first implementation of mixture density networks working on images with a convolutional architecture.

In Paper B, we construct and implement a new method called REMEDI for entropy estimation of continuous distributions using deep neural networks. Further, using empirical process theory, it is shown that such an estimator has a consistency property, ensuring that it has the theoretical capacity to estimate the entropy to arbitrary precision. The method is based on the celebrated Donsker-Varadhan lemma, a well-known fact from the theory of large deviations. The applicability of the method is demonstrated on distributions in moderate dimension, as well as on the task of training model in the information bottleneck framework, with satisfactory performance.

In Paper C, we explore the integration of non-parametric model components into the flow-matching framework. This is done by learning a heavily compressed latent representation of images in the training dataset, which are then used as conditioning variables for the vector field network. Effectively, these can be seen as synthetic, continuous labels. The gain is a more efficient learning process, compared to baseline models, and more interpretable sampling. It is demonstrated that with sufficient compression, overfitting can be avoided, and diversity among samples attained, despite conditioning on training samples.

Arguably, the most impactful area in machine learning is generative modeling. One part of the thesis deals with applying the theory of large deviations to two of its major methods, generative adversarial networks (GAN), and diffusion models, in particular Schrödinger bridges. The goal here is to prove large deviation principles, for certain sequences of probability measures associated with the models. In both cases, this allows a strong characterization of the convergence of these models, under varying certain model parameters, toward an idealized description of their behavior, often carrying a well-understood mathematical structure. In Schrödinger bridges, this idealized limit model consists of a dynamical optimal transport plan. This tells us that when varying the parameter in question, which is the reference noise level, or, as is often equivalent, the level of entropic regularization, the plans converge rapidly toward optimal transport behavior, justifying the interpretation of weakly regularized Schrödinger bridges as approximate optimal transport plans. Since Schrödinger bridges (or entropically regularized optimal transport plans) have nicer computational properties than optimal transport, they are often used in its place, and it is therefore important to understand how close this connection is. The large deviation principles derived here, applicable to several popular deep generative models, thus contribute to this understanding. Stating and proving such large deviations results are the contents of Papers E and F.

For GANs, we consider recently developed particle systems making up cohorts of networks for the generative task, in Paper D. Here, we show that when the parameter is taken to be the number of particles in this system, the training dynamics of these networks converge toward a McKean-Vlasov process, and a large deviation principle is established. This enables the study of the convergence of such particle systems, which have recently been proposed as a new generative model in the GAN literature, toward their mean-field behavior using the theory of large deviations.

urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-373185