Sort by year:

2017
arXiv Preprint arXiv

In this paper, we propose a StochAstic Recursive grAdient algoritHm (SARAH), as well as its practical variant SARAH+, as a novel approach to the finite-sum minimization problems. Different from the vanilla SGD and other modern stochastic methods such as SVRG, S2GD, SAG and SAGA, SARAH admits a simple recursive framework for updating stochastic gradient estimates; when comparing to SAG/SAGA, SARAH does not require a storage of past gradients. The linear convergence rate of SARAH is proven under strong convexity assumption. We also prove a linear convergence rate (in the strongly convex case) for an inner loop of SARAH, the property that SVRG does not possess. Numerical experiments demonstrate the efficiency of our algorithm.

2017
Journal Paper Optimization Methods and Software

A novel rank-constrained re-formulation of alternating-current optimal power flow problem makes it possible to derive novel semidefinite programming (SDP) relaxations. For those, we develop a solver, which is often as fast as Matpower's interior point method, within the same accuracy.

Detection of the deficiencies affecting the performance of the structures has been studied over the past few decades. How- ever, with the long-term data collection from dense sensor arrays, accurate damage diagnosis has become computationally challenging task. To address such problem, this paper introduces convolutional neural network (CNN), which has led to break- through results in computer vision, to the damage detection challenge. CNN technique has the ability to discover abstract features which are able to discriminate various aspect of interest. In our case, these features are used to classify “damaged” and “healthy” cases modeled through the finite element simulations. CNN is performed by using a Python library called Theano with the graphics processing unit (GPU) to achieve higher performance of these data-intensive calculations. The accuracy and sensitivity of the proposed technique are assessed with a cracked steel gusset connection model with multiplicative noise. Dur- ing the training procedure, strain distributions generated from different crack and loading scenarios are adopted. Completely unseen damage setups are introduced to the simulations while testing. Based on the findings of the proposed study, high accu- racy, robustness and computational efficiency are succeeded for the damage diagnosis.

2016
arXiv Preprint arXiv

We propose a projected semi-stochastic gradient descent method with mini-batch for improving both the theoretical complexity and practical performance of the general stochastic gradient descent method (SGD). We are able to prove linear convergence under weak strong convexity assumption. This requires no strong convexity assumption for minimizing the sum of smooth convex functions subject to a compact polyhedral set, which remains popular across machine learning community. Our PS2GD preserves the low-cost per iteration and high optimization accuracy via stochastic gradient variance-reduced technique, and admits a simple parallel implementation with mini-batches. Moreover, PS2GD is also applicable to dual problem of SVM with hinge loss.

Many steady-state problems in power systems, including rectangular power-voltage formulations of optimal power flows in the alternating-current model (ACOPF), can be cast as polynomial optimisation problems (POP). For a POP, one can derive strong convex relaxations, or rather hierarchies of ever stronger, but ever larger relaxations. We study means of switching from solving the convex relaxation to Newton method working on a non-convex Lagrangian of the POP.

2016
arXiv Preprint arXiv

The scale of modern datasets necessitates the development of efficient distributed optimization methods for machine learning. We present a general-purpose framework for the distributed environment, CoCoA, that has an efficient communication scheme and is applicable to a wide variety of problems in machine learning and signal processing. We extend the framework to cover general non-strongly convex regularizers, including L1-regularized problems like lasso, sparse logistic regression, and elastic net regularization, and show how earlier work can be derived as a special case. We provide convergence guarantees for the class of convex regularized loss minimization objectives, leveraging a novel approach in handling non-strongly convex regularizers and non-smooth loss functions. The resulting framework has markedly improved performance over state-of-the-art methods, as we illustrate with an extensive set of experiments on real distributed datasets.

2017
Journal Paper Journal of Machine Learning Research (to appear)

In this paper we generalize the framework of the feasible descent method (FDM) to a randomized (R-FDM) and a coordinate-wise random feasible descent method (RC-FDM) framework. We show that the famous SDCA algorithm for optimizing the SVM dual problem, or the stochastic coordinate descent method for the LASSO problem, fits into the framework of RC-FDM. We prove linear convergence for both R-FDM and RC-FDM under the weak strong convexity assumption. Moreover, we show that the duality gap converges linearly for RC-FDM, which implies that the duality gap also converges linearly for SDCA applied to the SVM dual problem.

The question of how to parallelize the stochastic gradient descent (SGD) method has received much attention in the literature. In this paper, we focus instead on batch methods that use a sizeable fraction of the training set at each iteration to facilitate parallelism, and that employ second-order information. In order to improve the learning process, we follow a multi-batch approach in which the batch changes at each iteration. This inherently gives the algorithm a stochastic flavor that can cause instability in L-BFGS, a popular batch method in machine learning. These difficulties arise because L-BFGS employs gradient differences to update the Hessian approximations; when these gradients are computed using different data points the process can be unstable. This paper shows how to perform stable quasi-Newton updating in the multi-batch setting, illustrates the behavior of the algorithm in a distributed computing platform, and studies its convergence properties for both the convex and nonconvex cases.

2016
Journal Paper Optimization Letters, 10(6), 1233-1243

We propose and analyze a new parallel coordinate descent method—NSync—in which at each iteration a random subset of coordinates is updated, in parallel, allowing for the subsets to be chosen using an arbitrary probability law. This is the first method of this type. We derive convergence rates under a strong convexity assumption, and comment on how to assign probabilities to the sets to optimize the bound. The complexity and practical performance of the method can outperform its uniform variant by an order of magnitude. Surprisingly, the strategy of updating a single randomly selected coordinate per iteration—with optimal probabilities—may require less iterations, both in theory and practice, than the strategy of updating all coordinates at every iteration.

2016
Journal Paper European Journal of Operational Research (to appear)

Matrix completion under interval uncertainty can be cast as a matrix completion problem with element-wise box constraints. We present an efficient alternating-direction parallel coordinate-descent method for the problem. We show that the method outperforms any other known method on a benchmark in image in-painting in terms of signal-to-noise ratio, and that it provides high-quality solutions for an instance of collaborative filtering with 100,198,805 recommendations within 5 minutes on a single personal computer.

The newsvendor problem is one of the most basic and widely applied inventory models. There are numerous extensions of this problem. One important extension is the multi-item newsvendor problem, in which the demand of each item may be correlated with that of other items. If the joint probability distribution of the demand is known, the problem can be solved analytically. However, approximating the probability distribution is not easy and is prone to error; therefore, the resulting solution to the newsvendor problem may be not optimal. To address this issue, we propose an algorithm based on deep learning that optimizes the order quantities for all products based on features of the demand data. Our algorithm integrates the forecasting and inventory-optimization steps, rather than solving them separately as is typically done. The algorithm does not require the knowledge of the probability distributions of the demand. Numerical experiments on real-world data suggest that our algorithm outperforms other approaches, including data-driven and SVM approaches, especially for demands with high volatility.

Training deep neural network is a high dimensional and a highly non-convex optimization problem. Stochastic gradient descent (SGD) algorithm and it's variations are the current state-of-the-art solvers for this task. However, due to non-covexity nature of the problem, it was observed that SGD slows down near saddle point. Recent empirical work claim that by detecting and escaping saddle point efficiently, it's more likely to improve training performance. With this objective, we revisit Hessian-free optimization method for deep networks. We also develop its distributed variant and demonstrate superior scaling potential to SGD, which allows more efficiently utilizing larger computing resources thus enabling large models and faster time to obtain desired solution. Furthermore, unlike truncated Newton method (Marten's HF) that ignores negative curvature information by using na"ive conjugate gradient method and Gauss-Newton Hessian approximation information - we propose a novel algorithm to explore negative curvature direction by solving the sub-problem with stabilized bi-conjugate method involving possible indefinite stochastic Hessian information. We show that these techniques accelerate the training process for both the standard MNIST dataset and also the TIMIT speech recognition problem, demonstrating robust performance with upto an order of magnitude larger batch sizes. This increased scaling potential is illustrated with near linear speed-up on upto 16 CPU nodes for a simple 4-layer network.

2016
Conference Paper ICML 2016 (33rd International Conference on Machine Learning)

We propose a new algorithm for minimizing regularized empirical loss: Stochastic Dual Newton Ascent (SDNA). Our method is dual in nature: in each iteration we update a random subset of the dual variables. However, unlike existing methods such as stochastic dual coordinate ascent, SDNA is capable of utilizing all curvature information contained in the examples, which leads to striking improvements in both theory and practice - sometimes by orders of magnitude. In the special case when an L2-regularizer is used in the primal, the dual problem is a concave quadratic maximization problem plus a separable term. In this regime, SDNA in each step solves a proximal subproblem involving a random principal submatrix of the Hessian of the quadratic function; whence the name of the method. If, in addition, the loss functions are quadratic, our method can be interpreted as a novel variant of the recently introduced Iterative Hessian Sketch.

2016
arXiv Preprint arXiv

In this paper we study inexact dumped Newton method implemented in a distributed environment. We start with an original DiSCO algorithm [Communication-Efficient Distributed Optimization of Self-Concordant Empirical Loss, Yuchen Zhang and Lin Xiao, 2015]. We will show that this algorithm may not scale well and propose an algorithmic modifications which will lead to less communications, better load-balancing and more efficient computation. We perform numerical experiments with an regularized empirical loss minimization instance described by a 273GB dataset.

2016
Conference Paper ICML 2016 (33rd International Conference on Machine Learning)

We propose an algorithm-independent framework to equip existing optimization methods with primal-dual certificates. Such certificates and corresponding rate of convergence guarantees are important for practitioners to diagnose progress, in particular in machine learning applications. We obtain new primal-dual convergence rates e.g. for the Lasso as well as many L1, Elastic-Net and group-lasso-regularized problems. The theory applies to any norm-regularized generalized linear model. Our approach provides efficiently computable duality gaps which are globally defined, without modifying the original problems in the region of interest.

2016
Journal Paper Journal of Machine Learning Research

In this paper we develop and analyze Hydra: HYbriD cooRdinAte descent method for solving loss minimization problems with big data. We initially partition the coordinates (features) and assign each partition to a different node of a cluster. At every iteration, each node picks a random subset of the coordinates from those it owns, independently from the other computers, and in parallel computes and applies updates to the selected coordinates based on a simple closed-form formula. We give bounds on the number of iterations sufficient to approximately solve the problem with high probability, and show how it depends on the data and on the partitioning. We perform numerical experiments with a LASSO instance described by a 3TB matrix.

2016
Journal Paper IEEE Journal of Selected Topics in Signal Processing (to appear)

We propose mS2GD: a method incorporating a mini-batching scheme for improving the theoretical complexity and practical performance of semi-stochastic gradient descent (S2GD). We consider the problem of minimizing a strongly convex function represented as the sum of an average of a large number of smooth convex functions, and a simple nonsmooth convex regularizer. Our method first performs a deterministic step (computation of the gradient of the objective function at the starting point), followed by a large number of stochastic steps. The process is repeated a few times with the last iterate becoming the new starting point. The novelty of our method is in introduction of mini-batching into the computation of stochastic steps. In each step, instead of choosing a single function, we sample $ functions, compute their gradients, and compute the direction based on this. We analyze the complexity of the method and show that it benefits from two speedup effects. First, we prove that as long as $ is below a certain threshold, we can reach any predefined accuracy with less overall work than without mini-batching. Second, our mini-batching scheme admits a simple parallel implementation, and hence is suitable for further acceleration by parallelization.

With the growth of data and necessity for distributed optimization methods, solvers that work well on a single machine must be re-designed to leverage distributed computation. Recent work in this area has been limited by focusing heavily on developing highly specific methods for the distributed environment. These special-purpose methods are often unable to fully leverage the competitive performance of their well-tuned and customized single machine counterparts. Further, they are unable to easily integrate improvements that continue to be made to single machine methods. To this end, we present a framework for distributed optimization that both allows the flexibility of arbitrary solvers to be used on each (single) machine locally, and yet maintains competitive performance against other state-of-the-art special-purpose distributed methods. We give strong primal-dual convergence rate guarantees for our framework that hold for arbitrary local solvers. We demonstrate the impact of local solver selection both theoretically and in an extensive experimental comparison. Finally, we provide thorough implementation details for our framework, highlighting areas for practical performance gains.

2015
Conference Paper OptML@NIPS 2015

In this paper we study the effect of the way that the data is partitioned in distributed optimization. The original DiSCO algorithm [Communication-Efficient Distributed Optimization of Self-Concordant Empirical Loss, Yuchen Zhang and Lin Xiao, 2015] partitions the input data based on samples. We describe how the original algorithm has to be modified to allow partitioning on features and show its efficiency both in theory and also in practice.

2015
Conference Paper OptML@NIPS 2015

In this paper we develop dual free SDCA with adaptive probabilities for regularized empirical risk minimization. This extends recent work of Shai Shalev-Shwartz [SDCA without Duality, arXiv:1502.06177] to allow non-uniform selection of "dual" coordinate in SDCA. Moreover, the probability can change over time, making it more efficient than uniform selection. Our work focuses on generating adaptive probabilities through iterative process, preferring to choose coordinate with highest potential to decrease sub-optimality. We also propose a practical variant Algorithm adfSDCA+ which is more aggressive. The work is concluded with multiple experiments which shows efficiency of proposed algorithms.

Optimisation problems in power systems employing alternating-current models of power flows have driven much recently interest in (convergent hierarchies of) convex relaxations for polynomial optimisation problems. Readily available second-order methods for solving the convex relaxations on real-world large-scale power systems often fail to perform even a single iteration within reasonable run-times. First-order methods have much lower per-iteration computational and memory requirements, but require many more iterations than second-order methods to converge within the same accuracy. We hence study means of switching from first-order methods for solving the convex relaxation to Newton method working on the original non-convex problem, which would allow for convergence under the same conditions as in solvers for the convex relaxation, but with an improved rate of convergence. We illustrate our approach on the alternating current power flows (ACPF) and alternating current optimal power flows (ACOPF).

We present an improved analysis of mini-batched stochastic dual coordinate ascent for regularized empirical loss minimization (i.e. SVM and SVM-type objectives). Our analysis allows for flexible sampling schemes, including where data is distribute across machines, and combines a dependence on the smoothness of the loss and/or the data spread (measured through the spectral norm).

2015
Conference Paper ICML 2015 (32nd International Conference on Machine Learning)

Distributed optimization algorithms for large-scale machine learning suffer from a communication bottleneck. Reducing communication makes the efficient aggregation of partial work from different machines more challenging. In this paper we present a novel generalization of the recent communication efficient primal-dual coordinate ascent framework (CoCoA). Our framework, CoCoA+, allows for additive combination of local updates to the global parameters at each iteration, whereas previous schemes only allowed conservative averaging. We give stronger (primal-dual) convergence rate guarantees for both CoCoA as well as our new variants, and generalize the theory for both methods to also cover non-smooth convex loss functions. We provide an extensive experimental comparison on several real-world distributed datasets, showing markedly improved performance, especially when scaling up the number of machines.

2015
Journal Paper Mathematical Programming

In this work we show that
randomized (block) coordinate descent methods can be accelerated by
parallelization when applied to the problem of minimizing the sum of
a *partially separable* smooth convex function and a simple
separable convex function. The theoretical speedup, as compared to
the serial method, and referring to the number of iterations needed
to approximately solve the problem with high probability, is a simple
expression depending on the number of parallel processors and a
natural and easily computable measure of separability of the smooth
component of the objective function. In the worst case, when no
degree of separability is present, there may be no speedup; in the
best case, when the problem is separable, the speedup is equal to the
number of processors.

Our analysis also works in the mode when the number of blocks being
updated at each iteration is random, which allows for modeling
situations with busy or unreliable processors. We show that our
algorithm is able to solve a LASSO problem involving a matrix with 20
billion nonzeros in 2 hours on a large memory node with 24 cores.

In this work we study the parallel coordinate descent method (PCDM) proposed by Richtarik and Takac [26] for minimizing a regularized convex function. We adopt elements from the work of Xiao and Lu [39], and combine them with several new insights, to obtain sharper iteration complexity results for PCDM than those presented in [26]. Moreover, we show that PCDM is monotonic in expectation, which was not confirmed in [26], and we also derive the first high probability iteration complexity result where the initial levelset is unbounded.

2014
Conference Paper OPT 2014: Optimization for Machine Learning @NIPS 2014

We propose a mini-batching scheme for improving the theoretical complexity and practical performance of semi-stochastic gradient descent applied to the problem of minimizing a strongly convex composite function represented as the sum of an average of a large number of smooth convex functions, and simple nonsmooth convex function. Our method first performs a deterministic step (computation of the gradient of the objective function at the starting point), followed by a large number of stochastic steps. The process is repeated a few times with the last iterate becoming the new starting point. The novelty of our method is in introduction of mini-batching into the computation of stochastic steps. In each step, instead of choosing a single function, we sample b functions, compute their gradients, and compute the direction based on this. We analyze the complexity of the method and show that the method benefits from two speedup effects. First, we prove that as long as b is below a certain threshold, we can reach predefined accuracy with less overall work than without mini-batching. Second, our mini-batching scheme admits a simple parallel implementation, and hence is suitable for further acceleration by parallelization.

Communication remains the most significant bottleneck in the performance of distributed optimization algorithms for large-scale machine learning. In this paper, we propose a communication-efficient framework, CoCoA, that uses local computation in a primal-dual setting to dramatically reduce the amount of necessary communication. We provide a strong convergence rate analysis for this class of algorithms, as well as experiments on real-world distributed datasets with implementations in Spark. In our experiments, we find that as compared to state-of-the-art mini-batch versions of SGD and SDCA algorithms, CoCoA converges to the same .001-accurate solution quality on average 25x as quickly.

2014
PhD thesis University of Edinburgh

2014
Journal Paper Numerical Analysis and Optimization 2014, Springer Proceedings in Mathematics and Statistics

In this work we propose a distributed randomized block coordinate descent method for minimizing a convex function with a huge number of variables/coordinates. We analyze its complexity under the assumption that the smooth part of the objective function is partially block separable, and show that the degree of separability directly influences the complexity. This extends the results in [22] to a distributed environment. We first show that partially block separable functions admit an expected separable overapproximation (ESO) with respect to a distributed sampling, compute the ESO parameters, and then specialize complexity results from recent literature that hold under the generic ESO assumption. We describe several approaches to distribution and synchronization of the computation across a cluster of multi-core computer and provide promising computational results.

2014
Conference Paper MLSP2014: IEEE International Workshop on Machine Learning for Signal Processing

We propose an efficient distributed randomized coordinate descent
method for minimizing regularized non-strongly convex loss functions.
The method attains the optimal *O(1/k^2)* convergence rate, where
*k* is the iteration counter. The core of the work is the
theoretical study of stepsize parameters. We have implemented the method
on Archer - the largest supercomputer in the UK - and show that the
method is capable of solving a (synthetic) LASSO optimization problem
with 50 billion variables.

We propose a novel topic discovery algorithm for unlabeled images based on the bag-of-words (BoW) framework. We first extract a dictionary of visual words and subsequently for each image compute a visual word occurrence histogram. We view these histograms as rows of a large matrix from which we extract sparse principal components (PCs). Each PC identifies a sparse combination of visual words which co-occur frequently in some images but seldom appear in others. Each sparse PC corresponds to a topic, and images whose interference with the PC is high belong to that topic, revealing the common parts possessed by the images. We propose to solve the associated sparse PCA problems using an Alternating Maximization (AM) method, which we modify for purpose of efficiently extracting multiple PCs in a deflation scheme. Our approach attacks the maximization problem in sparse PCA directly and is scalable to high-dimensional data. Experiments on automatic topic discovery and category prediction demonstrate encouraging performance of our approach.

2013
Conference Paper ICML 2013 (30th International Conference on
Machine Learning)

We address the issue of using
mini-batches in stochastic optimization of SVMs. We show that the
same quantity, the *spectral norm of the data*, controls the
parallelization speedup obtained for both primal stochastic
subgradient descent (SGD) and stochastic dual coordinate ascent
(SCDA) methods and use it to derive novel variants of mini-batched
SDCA. Our guarantees for both methods are expressed in terms of the
original nonsmooth primal problem based on the hinge-loss.

2012
arXiv Preprint

Given a multivariate data set,
sparse principal component analysis (SPCA) aims to extract several
linear combinations of the variables that together explain the
variance in the data as much as possible, while controlling the
number of nonzero loadings in these combinations.

In this paper we consider 8 different optimization formulations for
computing a single sparse loading vector; these are obtained by
combining the following factors: we employ *two* norms for
measuring variance (L2, L1) and *two* sparsity-inducing norms
(L0, L1), which are used in *two* different ways (constraint,
penalty). Three of our formulations, notably the one with L0
constraint and L1 variance, have not been considered in the
literature.

We give a unifying reformulation which we propose to solve via a
natural alternating maximization (AM) method. We show the the AM
method is nontrivially equivalent to GPower (Journee et al; JMLR **11**:517--553,
2010) for all our formulations. Besides this, we provide 24 efficient
parallel SPCA implementations: 3 codes (multi-core, GPU and cluster)
for each of the 8 problems.

Parallelism in the methods is aimed at

- speeding up computations (our GPU code can be 100 times faster than an efficient serial code written in C++),
- obtaining solutions explaining more variance and
- dealing with big data problems (our cluster code is able to solve a 357 GB problem in about a minute).

2011
Conference Paper Operations Research Proceedings 2011, pp. 27-32,
Springer-Verlag 2012

In this work we propose solving huge-scale instances of the truss topology design problem with coordinate descent methods. We develop four efficient codes: serial and parallel implementations of randomized and greedy rules for the selection of the variable (potential bar) to be updated in the next iteration. Both serial methods enjoy an O(n/k) iteration complexity guarantee, where n is the number of potential bars and k the iteration counter. Our parallel implementations, written in CUDA and running on a graphical processing unit (GPU), are capable of speedups of up to two orders of magnitude when compared to their serial counterparts. Numerical experiments were performed on instances with up to 30 million potential bars.

2011
Journal Paper Mathematical Programming, Series A, 38 pages,
2012

In this paper we develop a
randomized block-coordinate descent method for minimizing the sum of
a smooth and a simple nonsmooth block-separable convex function and
prove that it obtains an $\epsilon$-accurate solution with
probability at least $1-
ho$ in at most $O( frac{n}{\epsilon} \log
frac{1}{
ho})$ iterations, where $n$ is the number of blocks. For
strongly convex functions the method converges linearly. This extends
recent results of Nesterov *[Efficiency of coordinate descent
methods on huge-scale optimization problems, CORE Discussion Paper
\#2010/2]*, which cover the smooth case, to composite
minimization, while at the same time improving the complexity by the
factor of 4 and removing $\epsilon$ from the logarithmic term. More
importantly, in contrast with the aforementioned work in which the
author achieves the results by applying the method to a regularized
version of the objective function with an unknown scaling factor, we
show that this is not necessary, thus achieving true iteration
complexity bounds. In the smooth case we also allow for arbitrary
probability vectors and non-Euclidean norms. Finally, we demonstrate
numerically that the algorithm is able to solve huge-scale
$\ell_1$-regularized least squares and support vector machine
problems with a billion variables.

2011
Conference Paper Proceedings of SPARS11 (4th Workshop on Signal
Processing with Adaptive Sparse Structured Representations), June
27-30, 2011

2011
Journal Paper International Journal of Numerical Analysis and
Modeling, Ser. B, 2(2-3), 2011 231-247

In this paper we analyze
American style of floating strike Asian call options belonging to the
class of financial derivatives whose payoff diagram depends not only
on the underlying asset price but also on the path average of
underlying asset prices over some predetermined time interval.

The mathematical model for the option price leads to a free boundary
problem for a parabolic partial differential equation. Applying fixed
domain transformation and transformation of variables we develop an
efficient numerical algorithm based on a solution to a non-local
parabolic partial differential equation for the transformed variable
representing the synthesized portfolio. For various types of
averaging methods we investigate the dependence of the early exercise
boundary on model parameters.