WCOM 2024 - Abstracts

Main
Participants
Schedule
Abstracts
Social
Contact

Slides [pw]

WCOM Autumn 2024: 20–21 September

Abstracts of Invited Talks

Ahmet Alacaoglu (UBCV Math): Revisiting Inexact Fixed-Point Iterations for Min-Max Problems: Stochasticity and Structured Nonconvexity
In this talk, we revisit the analysis of inexact Halpern and Krasnosel'skii-Mann (KM) iterations for solving constrained and stochastic min-max problems. First, we relax the inexactness requirement on the computation of the resolvent in Halpern iteration. Second, we analyze the stochastic KM iteration with the multilevel Monte-Carlo estimator which allows obtaining almost-unbiased samples of the resolvent. We present the consequences of these results in the case of constrained and stochastic convex-concave min-max problems such as last iterate convergence guarantees. Then, we apply our developments to solve constrained nonconvex-nonconcave min-max problems satisfying cohypomonotonicity assumption. Within this class of problems, we show how our developments help in extending the range of problems that first-order methods can solve.
Robert Baraldi (Sandia Labs): An Inexact Trust-Region Algorithm for Nonsmooth Nonconvex Regularized Problems
We develop a novel trust-region method to minimize the sum of a smooth nonconvex function and a nonsmooth convex function. Our method permits and systematically controls the use of inexact objective function and derivative evaluations inherent in large-scale system solves and compression techniques. We prove global convergence of our method in Hilbert space and elaborate on potential nonsmooth subproblem solvers based on ideas taken from their smooth counter-parts. Under additional assumptions, we can also prove superlinear, or even quadratic convergence to local minima. We demonstrate its efficacy on examples from PDE-constrained optimization.
Marco Caoduro (UBCV Sauder): A characterization of unimodular hypergraphs with disjoint hyperedges
Grossman et al. demonstrate that the determinants of a graph's incidence matrix can be characterized using the graph's odd cycle packing number. This characterization can be used to model problems as integer programs with bounded determinants, e.g., the bipartite matching problem, which has a totally unimodular constraint matrix, and the stable set problem. In our work, we look to extend Grossman et al.'s characterization to disjoint hypergraphs, i.e., hypergraphs whose hyperedges of size at least three are pairwise disjoint.
This is joint work with Meike Neuwohner and Joseph Paat.
Yiwen Chen (UBCO Math): Characterizing the inscribability of polytopes using slack matrices
The inscribability of polytopes describes whether polytopes have a realization where all vertices lie on the same sphere. In this work, we characterize the problem of checking the inscribability of polytopes as a min-rank optimization problem based on slack matrices. We provide an SDP relaxation of the problem and prove that it is tight for certain classes of polytopes. For general polytopes, we apply the alternating projection method to the min-rank problem and design numerical experiments to demonstrate its accuracy.
Dima Drusvyatskiy (UW Math): The radius of statistical efficiency
Classical results in asymptotic statistics show that the Fisher information matrix controls the difficulty of estimating a statistical model from observed data. In this work, we introduce a companion measure of robustness of an estimation problem: the radius of statistical efficiency (RSE) is the size of the smallest perturbation to the problem data that renders the Fisher information matrix singular. We compute RSE up to numerical constants for a variety of test bed problems, including principal component analysis, generalized linear models, phase retrieval, bilinear sensing, and matrix completion. In all cases, the RSE quantifies the compatibility between the covariance of the population data and the latent model parameter. Interestingly, we observe a precise reciprocal relationship between RSE and the intrinsic complexity/sensitivity of the problem instance, paralleling the classical Eckart–Young theorem in numerical analysis.
Yuan Gao (UBCO Math): On a result by Baillon, Bruck and Reich
It is well known that the iterates of an averaged nonexpansive mapping may only converge weakly to fixed point. A celebrated result by Baillon, Bruck, and Reich from 1978 yields strong convergence in the presence of linearity. In this talk, we extend this result to allow for flexible relaxation parameters. Some variants of the extended result in averaged operators and affine mappings are also discussed. Examples are also provided to illustrate the results.
Begoña García Malaxechebarría (UW Math): The High Line: SGD dynamics of adaptive learning rate algorithms
We develop a framework for analyzing the training and learning rate dynamics on a large class of high-dimensional optimization problems, which we call the high line, trained using one-pass stochastic gradient descent (SGD) with adaptive learning rates. We give exact expressions for the risk and learning rate curves in terms of a deterministic solution to a system of ODEs. We then investigate in detail two adaptive learning rates – an idealized exact line search and AdaGrad-Norm – on the least squares problem. When the data covariance matrix has strictly positive eigenvalues, this idealized exact line search strategy can exhibit arbitrarily slower convergence when compared to the optimal fixed learning rate with SGD. Moreover we exactly characterize the limiting learning rate (as time goes to infinity) for line search in the setting where the data covariance has only two distinct eigenvalues. For noiseless targets, we further demonstrate that the AdaGrad-Norm learning rate converges to a deterministic constant inversely proportional to the average eigenvalue of the data covariance matrix, and identify a phase transition when the covariance density of eigenvalues follows a power law distribution.
Richard Hoshino (Northeastern University (Vancouver)): Cohort-Based Timetabling with Integer Linear Programming
At some educational institutions, students are divided into cohorts, where they complete the same set of courses with everybody else in that cohort. In this paper, we describe an Integer Linear Programming (ILP) solution to the School Timetabling Problem (STP), for schools that are cohort-based. Our Master Timetable is generated from the input data, a user-friendly Excel document that lists all of the course/cohort/teacher/classroom constraints.
Our model, which is coded in Python and solved using the MPSolver from Google OR-Tools, generated the 2023-2024 Master Timetable for three schools in Canada: an elementary school, a middle school, and a post-secondary institute. All three timetables satisfied 100% of the school's hard constraints, and were computed in less than 60 seconds.
Brian Irwin: EnKSGD: A Class of Preconditioned Black Box Optimization and Inversion Algorithms
We introduce the ensemble Kalman–Stein gradient descent (EnKSGD) class of algorithms. The EnKSGD class of algorithms builds on the ensemble Kalman filter (EnKF) line of work, applying techniques from sequential data assimilation to unconstrained optimization and parameter estimation problems. An essential idea is to exploit the EnKF as a black box (i.e., derivative-free, zeroth order) optimization tool if iterated to convergence. In this paper, we return to the foundations of the EnKF as a sequential data assimilation technique, including its continuous-time and mean-field limits, with the goal of developing faster optimization algorithms suited to noisy black box optimization and inverse problems. The resulting EnKSGD class of algorithms can be designed to both maintain the desirable property of affine-invariance and employ the well-known backtracking line search. Furthermore, EnKSGD algorithms are designed to not necessitate the subspace restriction property and to avoid the variance collapse property of previous iterated EnKF approaches to optimization, as both these properties can be undesirable in an optimization context. EnKSGD also generalizes beyond the 𝐿2 loss and is thus applicable to a wider class of problems than the standard EnKF. Numerical experiments with empirical risk minimization type problems, including both linear and nonlinear least squares problems, as well as maximum likelihood estimation, demonstrate the faster empirical convergence of EnKSGD relative to alternative EnKF approaches to optimization.
Tanmaya Karmarkar (UBCO CS): Computing the convex envelope of nonconvex piecewise bivariate linear-quadratic functions
Computing the convex envelope of a piecewise function is not always the same as computing the convex envelope of each piece. Our algorithm starts with computing the convex envelope of each piece obtaining a rational function defined over a polyhedral subdivision. Then we compute the conjugate of each of those pieces and obtain a fractional form defined over a parabolic subdivision. This is followed by computing the maximum of all those functions to obtain a piecewise function defined on a parabolic subdivision. We finally compute the biconjugate to obtain the convex envelope of the original piecewise linear-quadratic function.
Jiajin Li (UBCV Sauder): Spurious stationarity and hardness results for mirror descent
Despite the considerable success of Bregman proximal-type algorithms, such as mirror descent, in machine learning, a critical question remains: Can existing stationarity measures, often based on Bregman divergence, reliably distinguish between stationary and non-stationary points? In this talk, we present an important finding which is frequently overlooked: All existing stationarity measures necessarily imply the existence of spurious stationary points. We further establish an algorithmic independent hardness result: Bregman proximal-type algorithms are unable to escape from a spurious stationary point in finite steps when the initial point is unfavorable, even for convex problems. Our hardness result points out the inherent distinction between Euclidean and Bregman geometries, and introduces both fundamental theoretical and numerical challenges to both machine learning and optimization communities.
Max Neyra-Nesterenko (SFU): Parameter-free and optimal restart schemes for first-order methods via approximate sharpness
Sharpness is a generic assumption in continuous optimization that bounds the distance to the set of minimizers in terms of the suboptimality in the objective function. It leads to the acceleration of first-order optimization methods via restarting schemes. However, previous restart schemes will often rely on typically unknown constants, result in suboptimal convergence rates, and be difficult to apply in the presence of noise or approximate model classes. In this talk, I introduce the notion of approximate sharpness, a generalization of sharpness that incorporates an unknown constant perturbation to the objective function error. By employing a new type of search over the unknown sharpness constants, I will then describe a restart scheme that applies to general first-order methods. This scheme maintains the same convergence rate as when assuming knowledge of the constants. Moreover, for broad classes of problems, our scheme gives rates of convergence that either match known optimal rates or improve on previously established rates. Finally, I will demonstrate the efficacy of this scheme on applications including sparse recovery, compressive imaging and feature selection in machine learning.
This is joint work with Matthew J. Colbrook (Cambridge) and Ben Adcock (SFU). The corresponding paper is found here: https://arxiv.org/abs/2301.02268
Alan Milligan (UBCV CS): Towards understanding the performance gap between Adam and SGD in deep learning
The Adam optimizer has become ubiquitous in modern machine learning. Especially in some tasks such as training large language models, Adam outperforms standard gradient descent (GD) by such a large margin that it has become the default option. Despite this large performance gap, we do not have a good explanation of where this performance gap comes from. While several competing theories for the performance gap have been proposed, there is no clear understanding of what attributes of an optimization problem cause SGD to fail while Adam remains effective. In this work, we identify the cause of this gap on language tasks; the power-law nature of word frequencies leads to heavy-tailed class imbalance. We show that class imbalance makes this gap appear across problem, including on vision tasks and linear models. Our results provide evidence for an existing theory for the improved performance on Adam that relies on a correlation in the magnitude of the gradient and Hessian across coordinates by showing that this behavior appears naturally under class imbalance.
Nicholas Richardson (UBCV Math): Density Separation with Tensor Factorization
The nonnegative tensor factorization is used to separate mixtures of probably densities. A kernel density estimation transforms raw samples into compressed and discretized densities. An implementation of a specialized block coordinate descent algorithm that enforces the required simplex constraints is guaranteed to converge to Nash equilibria. Numerical experiments on real geological and spatial transcriptomics data, using our open source Julia implementation, illustrate the model's effectiveness at separating the density mixtures.
Betty Shea (UBCV CS): Why Line Search when you can Plane Search?
The practical performance of an optimization method depends on details such as using good step sizes. Strategies for setting step sizes are generally limited to hyperparameter tuning for a fixed step size, step size schedules and inexact line searches such as backtracking Armijo. This talk discusses a lesser known step size strategy, a plane search, which could be used for many problem classes including some neural networks.
Ziyuan Wang (UBCO Math): Level proximal subdifferential, variational convexity, and pointwise Lipschitz smoothness
The level proximal subdifferential was introduced by Rockafellar in 2021. Every proximal operator is the resolvent of a level proximal operator. However, to the best of our knowledge, the level proximal subdifferential has been largely left unexplored in the literature. In this talk, we present a systematic study of the level proximal subdifferential, revealing its remarkable connections to proximal hulls and to the variational convexity of proper, lsc, and prox-bounded functions. A pointwise version of Lipschitz smoothness will also be investigated through the lens of the level proximal subdifferential.