Sort by year:

The classical convergence analysis of SGD is carried out under the assumption that the norm of the stochastic gradient is uniformly bounded. While this might hold for some loss functions, it is violated for cases where the objective function is strongly convex. In Bottou et al. (2016), a new analysis of convergence of SGD is performed under the assumption that stochastic gradients are bounded with respect to the true gradient norm. We show that for stochastic problems arising in machine learning such bound always holds; and we also propose an alternative convergence analysis of SGD with diminishing learning rate regime, which results in more relaxed conditions than those in Bottou et al. (2016). We then move on the asynchronous parallel setting, and prove convergence of Hogwild! algorithm in the same regime in the case of diminished learning rate. It is well-known that SGD converges if a sequence of learning rates {ηt} satisfies ∑∞t=0ηt→∞ and ∑∞t=0η2t<∞. We show the convergence of SGD for strongly convex objective function without using bounded gradient assumption when {ηt} is a diminishing sequence and ∑∞t=0ηt→∞. In other words, we extend the current state-of-the-art class of learning rates satisfying the convergence of SGD.

We develop and analyze a variant of variance reducing stochastic gradient algorithm, known as SARAH, which does not require computation of the exact gradient. Thus this new method can be applied to general expectation minimization problems rather than only finite sum problems. While the original SARAH algorithm, as well as its predecessor, SVRG, require an exact gradient computation on each outer iteration, the inexact variant of SARAH (iSARAH), which we develop here, requires only stochastic gradient computed on a mini-batch of sufficient size. The proposed method combines variance reduction via sample size selection and iterative stochastic gradient updates. We analyze the convergence rate of the algorithms for strongly convex, convex, and nonconvex cases with appropriate mini-batch size selected for each case. We show that with an additional, reasonable, assumption iSARAH achieves the best known complexity among stochastic methods in the case of general convex case stochastic value functions.

2018
arXiv Preprint arXiv

In this paper, we propose a Distributed Accumulated Newton Conjugate gradiEnt (DANCE) method in which sample size is gradually increasing to quickly obtain a solution whose empirical loss is under satisfactory statistical accuracy. Our proposed method is multistage in which the solution of a stage serves as a warm start for the next stage which contains more samples (including the samples in the previous stage). The proposed multistage algorithm reduces the number of passes over data to achieve the statistical accuracy of the full training set. Moreover, our algorithm in nature is easy to be distributed and shares the strong scaling property indicating that acceleration is always expected by using more computing nodes. Various iteration complexity results regarding descent direction computation, communication efficiency and stopping criteria are analyzed under convex setting. Our numerical results illustrate that the proposed method outperforms other comparable methods for solving learning problems including neural networks.

2018
Journal Paper Journal of Machine Learning Research (JMLR)

The scale of modern datasets necessitates the development of efficient distributed optimization methods for machine learning. We present a general-purpose framework for the distributed environment, CoCoA, that has an efficient communication scheme and is applicable to a wide variety of problems in machine learning and signal processing. We extend the framework to cover general non-strongly convex regularizers, including L1-regularized problems like lasso, sparse logistic regression, and elastic net regularization, and show how earlier work can be derived as a special case. We provide convergence guarantees for the class of convex regularized loss minimization objectives, leveraging a novel approach in handling non-strongly convex regularizers and non-smooth loss functions. The resulting framework has markedly improved performance over state-of-the-art methods, as we illustrate with an extensive set of experiments on real distributed datasets.

2018
Journal Paper Frontiers in Applied Mathematics and Statistics, section Optimization

In this paper we develop an adaptive dual free Stochastic Dual Coordinate Ascent (adfSDCA) algorithm for regularized empirical risk minimization problems. This is motivated by the recent work on dual free SDCA of Shalev-Shwartz [1]. The novelty of our approach is that the coordinates to update at each iteration are selected non-uniformly from an adaptive probability distribution, and this extends the previously mentioned work which only allowed for a uniform selection of “dual” coordinates from a fixed probability distribution. We describe an efficient iterative procedure for generating the non-uniform samples, where the scheme selects the coordinate with the greatest potential to decrease the sub-optimality of the current iterate. We also propose a heuristic variant of adfSDCA that is more aggressive than the standard approach. Furthermore, in order to utilize multi-core machines we consider a mini-batch adfSDCA algorithm and develop complexity results that guarantee the algorithm's convergence. The work is concluded with several numerical experiments to demonstrate the practical benefits of the proposed approach.

2018
Conference Paper The 13th World Congress on
Intelligent Control and Automation
(WCICA 2018)

This paper proposes innovative anomaly detection technologies for manufacturing systems. We combine the event ordering relationship based structuring technique and the deep neural networks to develop the structured neural networks for anomaly detection. The event ordering relationship based neural network structuring process is performed before neural network training process and determines important neuron connections and weight initialization. It reduces the complexity of the neural networks and can improve anomaly detection accuracy. The structured time delay neural network (TDNN) is introduced for anomaly detection via supervised learning. To detect anomaly through unsupervised learning, we propose the structured autoencoder. The proposed structured neural networks outperform the unstructured neural networks in terms of anomaly detection accuracy and can reduce test error by 20%. Compared with popular methods such as one-class SVM, decision trees, and distance-based algorithms, our structured neural networks can reduce anomaly detection misclassification error by as much as 64%.

2018
arXiv Preprint arXiv

This paper proposes a framework of L-BFGS based on the (approximate) second-order information with stochastic batches, as a novel approach to the finite-sum minimization problems. Different from the classical L-BFGS where stochastic batches lead to instability, we use a smooth estimate for the evaluations of the gradient differences while achieving acceleration by well-scaling the initial Hessians. We provide theoretical analyses for both convex and nonconvex cases. In addition, we demonstrate that within the popular applications of least-square and cross-entropy losses, the algorithm admits a simple implementation in the distributed environment. Numerical experiments support the efficiency of our algorithms.

2018
Conference Paper In Dynamics of Civil Structures

Structures experience large vibrations and stress variations during their life cycles. This causes reduction in their load-carrying capacity which is the main design criteria for many structures. Therefore, it is important to accurately establish the performance of structures after construction that often needs full-field strain or stress measurements. Many traditional inspection methods collect strain measurements by using wired strain gauges. These strain gauges carry a high installation cost and have high power demand. In contrast, this paper introduces a new methodology to replace this high cost with utilizing inexpensive data coming from wireless sensor networks. The study proposes to collect acceleration responses coming from a structure and give them as an input to deep learning framework to estimate the stress or strain responses. The obtained stress or strain time series then can be used in many applications to better understand the conditions of the structures. In this paper, designed deep learning architecture consists of multi-layer neural networks and Long Short-Term Memory (LSTM). The network achieves to learn the relationship between input and output by exploiting the temporal dependencies of them. In the evaluation of the method, a three-story steel building is simulated by using various dynamic wind and earthquake loading scenarios. The acceleration time histories under these loading cases are utilized to predict the stress time series. The learned architecture is tested on acceleration time series that the structure has never experienced.

Clustering and classification critically rely on distance metrics that provide meaningful comparisons between data points. We present mixed-integer optimization approaches to find optimal distance metrics that generalize the Mahalanobis metric extensively studied in the literature. Additionally, we generalize and improve upon leading methods by removing reliance on pre-designated "target neighbors," "triplets," and "similarity pairs." Another salient feature of our method is its ability to enable active learning by recommending precise regions to sample after an optimal metric is computed to improve classification performance. This targeted acquisition can significantly reduce computational burden by ensuring training data completeness, representativeness, and economy. We demonstrate classification and computational performance of the algorithms through several simple and intuitive examples, followed by results on real image and medical datasets.

2018
Journal Paper Journal of Machine Learning Research (JMLR) (to appear)

We present an improved analysis of mini-batched stochastic dual coordinate ascent for regularized empirical loss minimization (i.e. SVM and SVM-type objectives). Our analysis allows for flexible sampling schemes, including where data is distribute across machines, and combines a dependence on the smoothness of the loss and/or the data spread (measured through the spectral norm).

Low-rank methods for semidefinite programming (SDP) have gained considerable popularity, especially in machine learning applications. Their analyses often assume the use of determinant-based regularisers, which are rarely implemented, due to the run-time cubic in the dimension in conventional implementations of the computation of their gradient. We extend the convergence analyses of low-rank methods to a wide class of regularisers. Further, we show that the gradient of a well-known regulariser can be computed in time linear in the dimension, which makes the regularisation practical. Our results are illustrated on the Max-Cut SDP relaxation.

We present an end-to-end framework for solving Vehicle Routing Problem (VRP) using deep reinforcement learning. In this approach, we train a single model that finds near-optimal solutions for problem instances sampled from a given distribution, only by observing the reward signals and following feasibility rules. Our model represents a parameterized stochastic policy, and by applying a policy gradient algorithm to optimize its parameters, the trained model produces the solution as a sequence of consecutive actions in real time, without the need to re-train for every new problem instance. Our method is faster in both training and inference than a recent method that solves the Traveling Salesman Problem (TSP), with nearly identical solution quality. On the more general VRP, our approach outperforms classical heuristics on medium-sized instances in both solution quality and computation time (after training). Our proposed framework can be applied to variants of the VRP such as the stochastic VRP, and has the potential to be applied more generally to combinatorial optimization problems.

2018
Conference Paper ICML 2018 (35th International Conference on Machine Learning)

Stochastic gradient descent (SGD) is the optimization algorithm of choice in many machine learning applications such as regularized empirical risk minimization and training deep neural networks. The classical analysis of convergence of SGD is carried out under the assumption that the norm of the stochastic gradient is uniformly bounded. While this might hold for some loss functions, it is always violated for cases where the objective function is strongly convex. In (Bottou et al.,2016) a new analysis of convergence of SGD is performed under the assumption that stochastic gradients are bounded with respect to the true gradient norm. Here we show that for stochastic problems arising in machine learning such bound always holds. Moreover, we propose an alternative convergence analysis of SGD with diminishing learning rate regime, which is results in more relaxed conditions that those in (Bottou et al.,2016). We then move on the asynchronous parallel setting, and prove convergence of the Hogwild! algorithm in the same regime, obtaining the first convergence results for this method in the case of diminished learning rate.

2017
Conference Paper Deep Reinforcement Learning Symposium @ Neural Information Processing Systems (NIPS) 2017

The beer game is a decentralized, multi-agent, cooperative problem that can be modeled as a serial supply chain network in which agents cooperatively attempt to minimize the total cost of the network even though each agent can only observe its own local information. We develop a variant of the Deep Q-Network algorithm to solve this problem. Extensive numerical experiment show the effectiveness of our algorithm. Unlike most algorithms in literature, our algorithm does not have any limits on the parameter values, and it provides good solutions even if the agents do not follow a rational policy. The algorithm can be extended to other decentralized multi-agent cooperative games with partially observed information, which is a common type of situation in supply chain problems.

2017
arXiv Preprint arXiv

Distributed optimization algorithms are essential for training machine learning models on very large-scale datasets. However, they often suffer from communication bottlenecks. Confronting this issue, a communication-efficient primal-dual coordinate ascent framework (CoCoA) and its improved variant CoCoA+ have been proposed, achieving a convergence rate of (1/t) for solving empirical risk minimization problems with Lipschitz continuous losses. In this paper, an accelerated variant of CoCoA+ is proposed and shown to possess a convergence rate of (1/t2) in terms of reducing suboptimality. The analysis of this rate is also notable in that the convergence rate bounds involve constants that, except in extreme cases, are significantly reduced compared to those previously provided for CoCoA+. The results of numerical experiments are provided to show that acceleration can lead to significant performance gains.

2017
Conference Paper In Model Validation and Uncertainty Quantification, Volume 3 (pp. 331-337). Springer, Cham.

Detection of the deficiencies affecting the performance of the structures has been studied over the past few decades. How- ever, with the long-term data collection from dense sensor arrays, accurate damage diagnosis has become computationally challenging task. To address such problem, this paper introduces convolutional neural network (CNN), which has led to break- through results in computer vision, to the damage detection challenge. CNN technique has the ability to discover abstract features which are able to discriminate various aspect of interest. In our case, these features are used to classify “damaged” and “healthy” cases modeled through the finite element simulations. CNN is performed by using a Python library called Theano with the graphics processing unit (GPU) to achieve higher performance of these data-intensive calculations. The accuracy and sensitivity of the proposed technique are assessed with a cracked steel gusset connection model with multiplicative noise. Dur- ing the training procedure, strain distributions generated from different crack and loading scenarios are adopted. Completely unseen damage setups are introduced to the simulations while testing. Based on the findings of the proposed study, high accu- racy, robustness and computational efficiency are succeeded for the damage diagnosis.

2017
Conference Paper SMAR 2017 (the fourth International Conference on Smart Monitoring, Assessment and Rehabilitation of Civil Structures)

Damage diagnosis of structures subjected to time-varying environmental and operational conditions has been a challenging task. This task involves a damage indicator selection to characterize the unknown relation between the measurements and damage patterns. The majority of the conventional methods adopt hand designed damage indicators that can be inefficient for some damage patterns and require manual effort to design. To address these challenges, this work uses a special kind of deep learning method called convolutional neural network (CNN) to learn complex damage features and create complex classifier boundaries. In the evaluation of the proposed methodology, multi-dimensional input samples are used — each dimension has an individual strain field resulting from a different force applied to the structure. The network is trained with several crack scenarios and the learned architecture is tested on completely unseen damage setups. Based on the findings of the paper, CNNs fed through multidimensional inputs improve the accuracy of the damage diagnosis. Furthermore, they give the opportunity to capture the behavior of the structures under variations in the loading conditions.

2017
Springer Proceedings in Mathematics & Statistics Springer Proceedings in Mathematics & Statistics

This volume contains a selection of contributions that were presented at the Modeling and Optimization: Theory and Applications Conference (MOPTA) held at Lehigh University in Bethlehem, Pennsylvania, USA on August 17-19, 2016. The conference brought together a diverse group of researchers and practitioners, working on both theoretical and practical aspects of continuous or discrete optimization. Topics presented included algorithms for solving convex, network, mixed-integer, nonlinear, and global optimization problems, and addressed the application of deterministic and stochastic optimization techniques in energy, finance, logistics, analytics, health, and other important fields. The contributions contained in this volume represent a sample of these topics and applications and illustrate the broad diversity of ideas discussed at the meeting.

2017
Conference Paper AAAI Workshop on Distributed Machine Learning

2017
Conference Paper AAAI Workshop on Distributed Machine Learning

In multi-echelon inventory systems the performance of a given node is affected by events that occur at many other nodes and at many other time periods. For example, a supply disruption upstream will have an effect on downstream, customer-facing nodes several periods later as the disruption "cascades" through the system. There is very little research on stock-out prediction in single-echelon systems and (to the best of our knowledge) none on multi-echelon systems. However, in real the world, it is clear that there is significant interest in techniques for this sort of stock-out prediction. Therefore, our research aims to fill this gap by using DNN to predict stock-outs in multi-echelon supply chains.

2017
arXiv Preprint arXiv

Consider a polynomial optimisation problem, whose instances vary continuously over time. We propose to use a coordinate-descent algorithm for solving such time-varying optimisation problems. In particular, we focus on relaxations of transmission-constrained problems in power systems. On the example of the alternating-current optimal power flows (ACOPF), we bound the difference between the current approximate optimal cost generated by our algorithm and the optimal cost for a relaxation using the most recent data from above by a function of the properties of the instance and the rate of change to the instance over time. We also bound the number of floating-point operations that need to be performed between two updates in order to guarantee the error is bounded from above by a given constant.

In this work we introduce the concept of an Underestimate Sequence (UES), which is a natural extension of Nesterov's estimate sequence. Our definition of a UES utilizes three sequences, one of which is a lower bound (or under-estimator) of the objective function. The question of how to construct an appropriate sequence of lower bounds is also addressed, and we present lower bounds for strongly convex smooth functions and for strongly convex composite functions, which adhere to the UES framework. Further, we propose several first order methods for minimizing strongly convex functions in both the smooth and composite cases. The algorithms, based on efficiently updating lower bounds on the objective functions, have natural stopping conditions, which provides the user with a certificate of optimality. Convergence of all algorithms is guaranteed through the UES framework, and we show that all presented algorithms converge linearly, with the accelerated variants enjoying the optimal linear rate of convergence.

2017
Journal Paper Optimization Methods and Software

In this work we study the parallel coordinate descent method (PCDM) proposed by Richtarik and Takac [26] for minimizing a regularized convex function. We adopt elements from the work of Xiao and Lu [39], and combine them with several new insights, to obtain sharper iteration complexity results for PCDM than those presented in [26]. Moreover, we show that PCDM is monotonic in expectation, which was not confirmed in [26], and we also derive the first high probability iteration complexity result where the initial levelset is unbounded.

2017
Journal Paper IEEE Transactions on Smart Grid

Many steady-state problems in power systems, including rectangular power-voltage formulations of optimal power flows in the alternating-current model (ACOPF), can be cast as polynomial optimisation problems (POP). For a POP, one can derive strong convex relaxations, or rather hierarchies of ever stronger, but ever larger relaxations. We study means of switching from solving the convex relaxation to Newton method working on a non-convex Lagrangian of the POP.

This paper describes an implementation of the L-BFGS method designed to deal with two adversarial situations. The first occurs in distributed computing environments where some of the computational nodes devoted to the evaluation of the function and gradient are unable to return results on time. A similar challenge occurs in a multi-batch approach in which the data points used to compute function and gradients are purposely changed at each iteration to accelerate the learning process. Difficulties arise because L-BFGS employs gradient differences to update the Hessian approximations, and when these gradients are computed using different data points the updating process can be unstable. This paper shows how to perform stable quasi-Newton updating in the multi-batch setting, studies the convergence properties for both convex and nonconvex functions, and illustrates the behavior of the algorithm in a distributed computing platform on binary classification logistic regression and neural network training problems that arise in machine learning.

2017
Journal Paper Optimization Methods and Software

With the growth of data and necessity for distributed optimization methods, solvers that work well on a single machine must be re-designed to leverage distributed computation. Recent work in this area has been limited by focusing heavily on developing highly specific methods for the distributed environment. These special-purpose methods are often unable to fully leverage the competitive performance of their well-tuned and customized single machine counterparts. Further, they are unable to easily integrate improvements that continue to be made to single machine methods. To this end, we present a framework for distributed optimization that both allows the flexibility of arbitrary solvers to be used on each (single) machine locally, and yet maintains competitive performance against other state-of-the-art special-purpose distributed methods. We give strong primal-dual convergence rate guarantees for our framework that hold for arbitrary local solvers. We demonstrate the impact of local solver selection both theoretically and in an extensive experimental comparison. Finally, we provide thorough implementation details for our framework, highlighting areas for practical performance gains.

2017
Journal Paper Optimization Methods and Software

A novel rank-constrained re-formulation of alternating-current optimal power flow problem makes it possible to derive novel semidefinite programming (SDP) relaxations. For those, we develop a solver, which is often as fast as Matpower's interior point method, within the same accuracy.

2017
arXiv Preprint arXiv

We develop a family of reformulations of an arbitrary consistent linear system into a stochastic problem. The reformulations are governed by two user-defined parameters: a positive definite matrix defining a norm, and an arbitrary discrete or continuous distribution over random matrices. Our reformulation has several equivalent interpretations, allowing for researchers from various communities to leverage their domain specific insights. In particular, our reformulation can be equivalently seen as a stochastic optimization problem, stochastic linear system, stochastic fixed point problem and a probabilistic intersection problem. We prove sufficient, and necessary and sufficient conditions for the reformulation to be exact. Further, we propose and analyze three stochastic algorithms for solving the reformulated problem---basic, parallel and accelerated methods---with global linear convergence rates. The rates can be interpreted as condition numbers of a matrix which depends on the system matrix and on the reformulation parameters. This gives rise to a new phenomenon which we call stochastic preconditioning, and which refers to the problem of finding parameters (matrix and distribution) leading to a sufficiently small condition number. Our basic method can be equivalently interpreted as stochastic gradient descent, stochastic Newton method, stochastic proximal point method, stochastic fixed point method, and stochastic projection method, with fixed stepsize (relaxation parameter), applied to the reformulations.

In this paper, we study and analyze the mini-batch version of StochAstic Recursive grAdient algoritHm (SARAH), a method employing the stochastic recursive gradient, for solving empirical loss minimization for the case of nonconvex losses. We provide a sublinear convergence rate (to stationary points) for general nonconvex functions and a linear convergence rate for gradient dominated functions, both of which have some advantages compared to other modern stochastic gradient algorithms for nonconvex losses.

2017
Conference Paper ICML 2017 (34th International Conference on Machine Learning)

In this paper, we propose a StochAstic Recursive grAdient algoritHm (SARAH), as well as its practical variant SARAH+, as a novel approach to the finite-sum minimization problems. Different from the vanilla SGD and other modern stochastic methods such as SVRG, S2GD, SAG and SAGA, SARAH admits a simple recursive framework for updating stochastic gradient estimates; when comparing to SAG/SAGA, SARAH does not require a storage of past gradients. The linear convergence rate of SARAH is proven under strong convexity assumption. We also prove a linear convergence rate (in the strongly convex case) for an inner loop of SARAH, the property that SVRG does not possess. Numerical experiments demonstrate the efficiency of our algorithm.

2017
Conference Paper Proceedings of MOPTA 2016

We propose a projected semi-stochastic gradient descent method with mini-batch for improving both the theoretical complexity and practical performance of the general stochastic gradient descent method (SGD). We are able to prove linear convergence under weak strong convexity assumption. This requires no strong convexity assumption for minimizing the sum of smooth convex functions subject to a compact polyhedral set, which remains popular across machine learning community. Our PS2GD preserves the low-cost per iteration and high optimization accuracy via stochastic gradient variance-reduced technique, and admits a simple parallel implementation with mini-batches. Moreover, PS2GD is also applicable to dual problem of SVM with hinge loss.

2016
Journal Paper Journal of Machine Learning Research

In this paper we generalize the framework of the feasible descent method (FDM) to a randomized (R-FDM) and a coordinate-wise random feasible descent method (RC-FDM) framework. We show that the famous SDCA algorithm for optimizing the SVM dual problem, or the stochastic coordinate descent method for the LASSO problem, fits into the framework of RC-FDM. We prove linear convergence for both R-FDM and RC-FDM under the weak strong convexity assumption. Moreover, we show that the duality gap converges linearly for RC-FDM, which implies that the duality gap also converges linearly for SDCA applied to the SVM dual problem.

The question of how to parallelize the stochastic gradient descent (SGD) method has received much attention in the literature. In this paper, we focus instead on batch methods that use a sizeable fraction of the training set at each iteration to facilitate parallelism, and that employ second-order information. In order to improve the learning process, we follow a multi-batch approach in which the batch changes at each iteration. This inherently gives the algorithm a stochastic flavor that can cause instability in L-BFGS, a popular batch method in machine learning. These difficulties arise because L-BFGS employs gradient differences to update the Hessian approximations; when these gradients are computed using different data points the process can be unstable. This paper shows how to perform stable quasi-Newton updating in the multi-batch setting, illustrates the behavior of the algorithm in a distributed computing platform, and studies its convergence properties for both the convex and nonconvex cases.

2016
Journal Paper Optimization Letters, 10(6), 1233-1243

We propose and analyze a new parallel coordinate descent method—NSync—in which at each iteration a random subset of coordinates is updated, in parallel, allowing for the subsets to be chosen using an arbitrary probability law. This is the first method of this type. We derive convergence rates under a strong convexity assumption, and comment on how to assign probabilities to the sets to optimize the bound. The complexity and practical performance of the method can outperform its uniform variant by an order of magnitude. Surprisingly, the strategy of updating a single randomly selected coordinate per iteration—with optimal probabilities—may require less iterations, both in theory and practice, than the strategy of updating all coordinates at every iteration.

2016
Journal Paper European Journal of Operational Research (to appear)

Matrix completion under interval uncertainty can be cast as a matrix completion problem with element-wise box constraints. We present an efficient alternating-direction parallel coordinate-descent method for the problem. We show that the method outperforms any other known method on a benchmark in image in-painting in terms of signal-to-noise ratio, and that it provides high-quality solutions for an instance of collaborative filtering with 100,198,805 recommendations within 5 minutes on a single personal computer.

The newsvendor problem is one of the most basic and widely applied inventory models. There are numerous extensions of this problem. One important extension is the multi-item newsvendor problem, in which the demand of each item may be correlated with that of other items. If the joint probability distribution of the demand is known, the problem can be solved analytically. However, approximating the probability distribution is not easy and is prone to error; therefore, the resulting solution to the newsvendor problem may be not optimal. To address this issue, we propose an algorithm based on deep learning that optimizes the order quantities for all products based on features of the demand data. Our algorithm integrates the forecasting and inventory-optimization steps, rather than solving them separately as is typically done. The algorithm does not require the knowledge of the probability distributions of the demand. Numerical experiments on real-world data suggest that our algorithm outperforms other approaches, including data-driven and SVM approaches, especially for demands with high volatility.

2016
arXiv Preprint arXiv

Training deep neural network is a high dimensional and a highly non-convex optimization problem. Stochastic gradient descent (SGD) algorithm and it's variations are the current state-of-the-art solvers for this task. However, due to non-covexity nature of the problem, it was observed that SGD slows down near saddle point. Recent empirical work claim that by detecting and escaping saddle point efficiently, it's more likely to improve training performance. With this objective, we revisit Hessian-free optimization method for deep networks. We also develop its distributed variant and demonstrate superior scaling potential to SGD, which allows more efficiently utilizing larger computing resources thus enabling large models and faster time to obtain desired solution. Furthermore, unlike truncated Newton method (Marten's HF) that ignores negative curvature information by using na"ive conjugate gradient method and Gauss-Newton Hessian approximation information - we propose a novel algorithm to explore negative curvature direction by solving the sub-problem with stabilized bi-conjugate method involving possible indefinite stochastic Hessian information. We show that these techniques accelerate the training process for both the standard MNIST dataset and also the TIMIT speech recognition problem, demonstrating robust performance with upto an order of magnitude larger batch sizes. This increased scaling potential is illustrated with near linear speed-up on upto 16 CPU nodes for a simple 4-layer network.

2016
Conference Paper ICML 2016 (33rd International Conference on Machine Learning)

We propose a new algorithm for minimizing regularized empirical loss: Stochastic Dual Newton Ascent (SDNA). Our method is dual in nature: in each iteration we update a random subset of the dual variables. However, unlike existing methods such as stochastic dual coordinate ascent, SDNA is capable of utilizing all curvature information contained in the examples, which leads to striking improvements in both theory and practice - sometimes by orders of magnitude. In the special case when an L2-regularizer is used in the primal, the dual problem is a concave quadratic maximization problem plus a separable term. In this regime, SDNA in each step solves a proximal subproblem involving a random principal submatrix of the Hessian of the quadratic function; whence the name of the method. If, in addition, the loss functions are quadratic, our method can be interpreted as a novel variant of the recently introduced Iterative Hessian Sketch.

2016
arXiv Preprint arXiv

In this paper we study inexact dumped Newton method implemented in a distributed environment. We start with an original DiSCO algorithm [Communication-Efficient Distributed Optimization of Self-Concordant Empirical Loss, Yuchen Zhang and Lin Xiao, 2015]. We will show that this algorithm may not scale well and propose an algorithmic modifications which will lead to less communications, better load-balancing and more efficient computation. We perform numerical experiments with an regularized empirical loss minimization instance described by a 273GB dataset.

2016
Conference Paper ICML 2016 (33rd International Conference on Machine Learning)

We propose an algorithm-independent framework to equip existing optimization methods with primal-dual certificates. Such certificates and corresponding rate of convergence guarantees are important for practitioners to diagnose progress, in particular in machine learning applications. We obtain new primal-dual convergence rates e.g. for the Lasso as well as many L1, Elastic-Net and group-lasso-regularized problems. The theory applies to any norm-regularized generalized linear model. Our approach provides efficiently computable duality gaps which are globally defined, without modifying the original problems in the region of interest.

2016
Journal Paper Journal of Machine Learning Research

In this paper we develop and analyze Hydra: HYbriD cooRdinAte descent method for solving loss minimization problems with big data. We initially partition the coordinates (features) and assign each partition to a different node of a cluster. At every iteration, each node picks a random subset of the coordinates from those it owns, independently from the other computers, and in parallel computes and applies updates to the selected coordinates based on a simple closed-form formula. We give bounds on the number of iterations sufficient to approximately solve the problem with high probability, and show how it depends on the data and on the partitioning. We perform numerical experiments with a LASSO instance described by a 3TB matrix.

2016
Journal Paper IEEE Journal of Selected Topics in Signal Processing (to appear)

We propose mS2GD: a method incorporating a mini-batching scheme for improving the theoretical complexity and practical performance of semi-stochastic gradient descent (S2GD). We consider the problem of minimizing a strongly convex function represented as the sum of an average of a large number of smooth convex functions, and a simple nonsmooth convex regularizer. Our method first performs a deterministic step (computation of the gradient of the objective function at the starting point), followed by a large number of stochastic steps. The process is repeated a few times with the last iterate becoming the new starting point. The novelty of our method is in introduction of mini-batching into the computation of stochastic steps. In each step, instead of choosing a single function, we sample $ functions, compute their gradients, and compute the direction based on this. We analyze the complexity of the method and show that it benefits from two speedup effects. First, we prove that as long as $ is below a certain threshold, we can reach any predefined accuracy with less overall work than without mini-batching. Second, our mini-batching scheme admits a simple parallel implementation, and hence is suitable for further acceleration by parallelization.

Optimisation problems in power systems employing alternating-current models of power flows have driven much recently interest in (convergent hierarchies of) convex relaxations for polynomial optimisation problems. Readily available second-order methods for solving the convex relaxations on real-world large-scale power systems often fail to perform even a single iteration within reasonable run-times. First-order methods have much lower per-iteration computational and memory requirements, but require many more iterations than second-order methods to converge within the same accuracy. We hence study means of switching from first-order methods for solving the convex relaxation to Newton method working on the original non-convex problem, which would allow for convergence under the same conditions as in solvers for the convex relaxation, but with an improved rate of convergence. We illustrate our approach on the alternating current power flows (ACPF) and alternating current optimal power flows (ACOPF).

2015
Conference Paper OptML@NIPS 2015

In this paper we study the effect of the way that the data is partitioned in distributed optimization. The original DiSCO algorithm [Communication-Efficient Distributed Optimization of Self-Concordant Empirical Loss, Yuchen Zhang and Lin Xiao, 2015] partitions the input data based on samples. We describe how the original algorithm has to be modified to allow partitioning on features and show its efficiency both in theory and also in practice.

2015
Conference Paper OptML@NIPS 2015

In this paper we develop dual free SDCA with adaptive probabilities for regularized empirical risk minimization. This extends recent work of Shai Shalev-Shwartz [SDCA without Duality, arXiv:1502.06177] to allow non-uniform selection of "dual" coordinate in SDCA. Moreover, the probability can change over time, making it more efficient than uniform selection. Our work focuses on generating adaptive probabilities through iterative process, preferring to choose coordinate with highest potential to decrease sub-optimality. We also propose a practical variant Algorithm adfSDCA+ which is more aggressive. The work is concluded with multiple experiments which shows efficiency of proposed algorithms.

2015
Conference Paper ICML 2015 (32nd International Conference on Machine Learning)

Distributed optimization algorithms for large-scale machine learning suffer from a communication bottleneck. Reducing communication makes the efficient aggregation of partial work from different machines more challenging. In this paper we present a novel generalization of the recent communication efficient primal-dual coordinate ascent framework (CoCoA). Our framework, CoCoA+, allows for additive combination of local updates to the global parameters at each iteration, whereas previous schemes only allowed conservative averaging. We give stronger (primal-dual) convergence rate guarantees for both CoCoA as well as our new variants, and generalize the theory for both methods to also cover non-smooth convex loss functions. We provide an extensive experimental comparison on several real-world distributed datasets, showing markedly improved performance, especially when scaling up the number of machines.

2015
Journal Paper Mathematical Programming

In this work we show that
randomized (block) coordinate descent methods can be accelerated by
parallelization when applied to the problem of minimizing the sum of
a *partially separable* smooth convex function and a simple
separable convex function. The theoretical speedup, as compared to
the serial method, and referring to the number of iterations needed
to approximately solve the problem with high probability, is a simple
expression depending on the number of parallel processors and a
natural and easily computable measure of separability of the smooth
component of the objective function. In the worst case, when no
degree of separability is present, there may be no speedup; in the
best case, when the problem is separable, the speedup is equal to the
number of processors.

Our analysis also works in the mode when the number of blocks being
updated at each iteration is random, which allows for modeling
situations with busy or unreliable processors. We show that our
algorithm is able to solve a LASSO problem involving a matrix with 20
billion nonzeros in 2 hours on a large memory node with 24 cores.

2014
Conference Paper OPT 2014: Optimization for Machine Learning @NIPS 2014

We propose a mini-batching scheme for improving the theoretical complexity and practical performance of semi-stochastic gradient descent applied to the problem of minimizing a strongly convex composite function represented as the sum of an average of a large number of smooth convex functions, and simple nonsmooth convex function. Our method first performs a deterministic step (computation of the gradient of the objective function at the starting point), followed by a large number of stochastic steps. The process is repeated a few times with the last iterate becoming the new starting point. The novelty of our method is in introduction of mini-batching into the computation of stochastic steps. In each step, instead of choosing a single function, we sample b functions, compute their gradients, and compute the direction based on this. We analyze the complexity of the method and show that the method benefits from two speedup effects. First, we prove that as long as b is below a certain threshold, we can reach predefined accuracy with less overall work than without mini-batching. Second, our mini-batching scheme admits a simple parallel implementation, and hence is suitable for further acceleration by parallelization.

Communication remains the most significant bottleneck in the performance of distributed optimization algorithms for large-scale machine learning. In this paper, we propose a communication-efficient framework, CoCoA, that uses local computation in a primal-dual setting to dramatically reduce the amount of necessary communication. We provide a strong convergence rate analysis for this class of algorithms, as well as experiments on real-world distributed datasets with implementations in Spark. In our experiments, we find that as compared to state-of-the-art mini-batch versions of SGD and SDCA algorithms, CoCoA converges to the same .001-accurate solution quality on average 25x as quickly.

2014
PhD thesis University of Edinburgh

2014
Journal Paper Numerical Analysis and Optimization 2014, Springer Proceedings in Mathematics and Statistics

In this work we propose a distributed randomized block coordinate descent method for minimizing a convex function with a huge number of variables/coordinates. We analyze its complexity under the assumption that the smooth part of the objective function is partially block separable, and show that the degree of separability directly influences the complexity. This extends the results in [22] to a distributed environment. We first show that partially block separable functions admit an expected separable overapproximation (ESO) with respect to a distributed sampling, compute the ESO parameters, and then specialize complexity results from recent literature that hold under the generic ESO assumption. We describe several approaches to distribution and synchronization of the computation across a cluster of multi-core computer and provide promising computational results.

2014
Conference Paper MLSP2014: IEEE International Workshop on Machine Learning for Signal Processing

We propose an efficient distributed randomized coordinate descent
method for minimizing regularized non-strongly convex loss functions.
The method attains the optimal *O(1/k^2)* convergence rate, where
*k* is the iteration counter. The core of the work is the
theoretical study of stepsize parameters. We have implemented the method
on Archer - the largest supercomputer in the UK - and show that the
method is capable of solving a (synthetic) LASSO optimization problem
with 50 billion variables.

We propose a novel topic discovery algorithm for unlabeled images based on the bag-of-words (BoW) framework. We first extract a dictionary of visual words and subsequently for each image compute a visual word occurrence histogram. We view these histograms as rows of a large matrix from which we extract sparse principal components (PCs). Each PC identifies a sparse combination of visual words which co-occur frequently in some images but seldom appear in others. Each sparse PC corresponds to a topic, and images whose interference with the PC is high belong to that topic, revealing the common parts possessed by the images. We propose to solve the associated sparse PCA problems using an Alternating Maximization (AM) method, which we modify for purpose of efficiently extracting multiple PCs in a deflation scheme. Our approach attacks the maximization problem in sparse PCA directly and is scalable to high-dimensional data. Experiments on automatic topic discovery and category prediction demonstrate encouraging performance of our approach.

2013
Conference Paper ICML 2013 (30th International Conference on
Machine Learning)

We address the issue of using
mini-batches in stochastic optimization of SVMs. We show that the
same quantity, the *spectral norm of the data*, controls the
parallelization speedup obtained for both primal stochastic
subgradient descent (SGD) and stochastic dual coordinate ascent
(SCDA) methods and use it to derive novel variants of mini-batched
SDCA. Our guarantees for both methods are expressed in terms of the
original nonsmooth primal problem based on the hinge-loss.

2012
arXiv Preprint

Given a multivariate data set,
sparse principal component analysis (SPCA) aims to extract several
linear combinations of the variables that together explain the
variance in the data as much as possible, while controlling the
number of nonzero loadings in these combinations.

In this paper we consider 8 different optimization formulations for
computing a single sparse loading vector; these are obtained by
combining the following factors: we employ *two* norms for
measuring variance (L2, L1) and *two* sparsity-inducing norms
(L0, L1), which are used in *two* different ways (constraint,
penalty). Three of our formulations, notably the one with L0
constraint and L1 variance, have not been considered in the
literature.

We give a unifying reformulation which we propose to solve via a
natural alternating maximization (AM) method. We show the the AM
method is nontrivially equivalent to GPower (Journee et al; JMLR **11**:517--553,
2010) for all our formulations. Besides this, we provide 24 efficient
parallel SPCA implementations: 3 codes (multi-core, GPU and cluster)
for each of the 8 problems.

Parallelism in the methods is aimed at

- speeding up computations (our GPU code can be 100 times faster than an efficient serial code written in C++),
- obtaining solutions explaining more variance and
- dealing with big data problems (our cluster code is able to solve a 357 GB problem in about a minute).

2011
Conference Paper Operations Research Proceedings 2011, pp. 27-32,
Springer-Verlag 2012

In this work we propose solving huge-scale instances of the truss topology design problem with coordinate descent methods. We develop four efficient codes: serial and parallel implementations of randomized and greedy rules for the selection of the variable (potential bar) to be updated in the next iteration. Both serial methods enjoy an O(n/k) iteration complexity guarantee, where n is the number of potential bars and k the iteration counter. Our parallel implementations, written in CUDA and running on a graphical processing unit (GPU), are capable of speedups of up to two orders of magnitude when compared to their serial counterparts. Numerical experiments were performed on instances with up to 30 million potential bars.

2011
Journal Paper Mathematical Programming, Series A, 38 pages,
2012

In this paper we develop a
randomized block-coordinate descent method for minimizing the sum of
a smooth and a simple nonsmooth block-separable convex function and
prove that it obtains an $\epsilon$-accurate solution with
probability at least $1-
ho$ in at most $O( frac{n}{\epsilon} \log
frac{1}{
ho})$ iterations, where $n$ is the number of blocks. For
strongly convex functions the method converges linearly. This extends
recent results of Nesterov *[Efficiency of coordinate descent
methods on huge-scale optimization problems, CORE Discussion Paper
\#2010/2]*, which cover the smooth case, to composite
minimization, while at the same time improving the complexity by the
factor of 4 and removing $\epsilon$ from the logarithmic term. More
importantly, in contrast with the aforementioned work in which the
author achieves the results by applying the method to a regularized
version of the objective function with an unknown scaling factor, we
show that this is not necessary, thus achieving true iteration
complexity bounds. In the smooth case we also allow for arbitrary
probability vectors and non-Euclidean norms. Finally, we demonstrate
numerically that the algorithm is able to solve huge-scale
$\ell_1$-regularized least squares and support vector machine
problems with a billion variables.

2011
Conference Paper Proceedings of SPARS11 (4th Workshop on Signal
Processing with Adaptive Sparse Structured Representations), June
27-30, 2011

2011
Journal Paper International Journal of Numerical Analysis and
Modeling, Ser. B, 2(2-3), 2011 231-247

In this paper we analyze
American style of floating strike Asian call options belonging to the
class of financial derivatives whose payoff diagram depends not only
on the underlying asset price but also on the path average of
underlying asset prices over some predetermined time interval.

The mathematical model for the option price leads to a free boundary
problem for a parabolic partial differential equation. Applying fixed
domain transformation and transformation of variables we develop an
efficient numerical algorithm based on a solution to a non-local
parabolic partial differential equation for the transformed variable
representing the synthesized portfolio. For various types of
averaging methods we investigate the dependence of the early exercise
boundary on model parameters.