Skip to content

WIP: Multisecant Accelerated Descent (MAD) solver for TAO

Alp Dener requested to merge denera/tao-multisecant-accelerated-descent into main

MAD is an experimental solver for generally constrained nonlinear optimization problems with noisy gradients. This implementation supports bound, equality and double-sided inequality constraints, but makes no distinction between linear and nonlinear constraints. This PDF file contains more details on the formulation of the reduced KKT system and its solution.

MAD with a scalar "preconditioner" is supposed to be equivalent to Anderson mixing, but the primal-dual vector space manipulations in the optimization application make it difficult to use existing Anderson/NGMRES code from SNES to do the heavy lifting. Specifically, solving the least-squares subproblem with LAPACK_dgelss_ requires putting the accumulated G matrix into a sequential 1D array but this G matrix is built from VECNEST columns and VecScatterCreateToAll() is not supported for VECNEST.

Previously while working on the TAOALMM solver (augmented Lagrangian method), I had implemented a VecConcatenate() call that can convert an array of vectors into a single contiguous vector. I implemented a new VecNestConcatenate() call that internally uses VecConcatenate() with the internal VECNEST array of vectors to spit out the contiguous version. That contiguous version can work with VecScatterCreateToAll() to generate sequential versions of these matrix columns. With that done, I can then generate the 1D array of the matrix needed for the LAPACK call. I'm not sure if this is the best way to do it, but that's what's implemented for now (not tested yet).

I also implemented a secondary solution method using Normal equations and KSPCG for the least-squares problem. Just like Anderson mixing, this is not ideal because it's possible for columns of G to become close to linearly dependent. SVD is ideal but there are going to be circumstances where it's not possible to do. The Python version of MAD suffers from this because pyTorch does not provide a truncated SVD on GPUs. Instead I perturb the diagonal and do a Cholesky solve, with a failover option to copying everything to the CPU and doing SciPy SVD. Seems to work okay at least for the ML training problem with few failover triggers. Introducing a dependence on MUMPS for the TAO version might be helpful; I think inertia information can probably make the diagonal correction relatively fool-proof.

I would really appreciate any and all feedback, both theoretically and for practical implementation! Highly likely that I may be doing a bunch of the operations in inefficient ways...

Merge request reports