Sort by year:

2023
Conference Paper
IJCAI 2023

This paper introduces a Reinforcement Learning approach to better generalize heuristic dispatching rules on the Job-shop Scheduling Problem (JSP). Current models on the JSP do not focus on generalization, although, as we show in this work, this is key to learning better heuristics on the problem. A well-known technique to improve generalization is to learn on increasingly complex instances using Curriculum Learning (CL). However, as many works in the literature indicate, this technique might suffer from catastrophic forgetting when transferring the learned skills between different problem sizes. To address this issue, we introduce a novel Adversarial Curriculum Learning (ACL) strategy, which dynamically adjusts the difficulty level during the learning process to revisit the worst-performing instances. This work also presents a deep learning model to solve the JSP, which is equivariant w.r.t. the job definition and size-agnostic. Conducted experiments on Taillard's and Demirkol's instances show that the presented approach significantly improves the current state-of-the-art models on the JSP. It reduces the average optimality gap from 19.35% to 10.46% on Taillard's instances and from 38.43% to 18.85% on Demirkol's instances. Our implementation is available online.

2023
Conference Paper
ICLR 2023 Workshop: Tackling Climate Change with Machine Learning

Integrating variable renewable energy into the grid has posed challenges to system operators in achieving optimal trade-offs among energy availability, cost affordability, and pollution controllability. This paper proposes a multi-agent reinforcement learning framework for managing energy transactions in microgrids. The framework addresses the challenges above: it seeks to optimize the usage of available resources by minimizing the carbon footprint while benefiting all stakeholders. The proposed architecture consists of three layers of agents, each pursuing different objectives. The first layer, comprised of prosumers and consumers, minimizes the total energy cost. The other two layers control the energy price to decrease the carbon impact while balancing the consumption and production of both renewable and conventional energy. This framework also takes into account fluctuations in energy demand and supply.

2023
Journal Paper
Transactions on Machine Learning Research (TMLR)

We present an adaptive stochastic variance reduced method with an implicit approach for adaptivity. As a variant of SARAH, our method employs the stochastic recursive gradient yet adjusts step-size based on local geometry. We provide convergence guarantees for finite-sum minimization problems and show a faster convergence than SARAH can be achieved if local geometry permits. Furthermore, we propose a practical, fully adaptive variant, which does not require any knowledge of local geometry and any effort of tuning the hyper-parameters. This algorithm implicitly computes step-size and efficiently estimates local Lipschitz smoothness of stochastic functions. The numerical experiments demonstrate the algorithm's strong performance compared to its classical counterparts and other state-of-the-art first-order methods.

2023
Conference Paper
26th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2023)

This work studies non-cooperative Multi-Agent Reinforce- ment Learning (MARL) where multiple agents interact in the same environment and whose goal is to maximize the individual returns. Chal- lenges arise when scaling up the number of agents due to the resultant non-stationary that the many agents introduce. In order to address this issue, Mean Field Games (MFG) rely on the symmetry and homogeneity assumptions to approximate games with very large populations. Recently, deep Reinforcement Learning has been used to scale MFG to games with larger number of states. Current methods rely on smoothing tech- niques such as averaging the q-values or the updates on the mean-field distribution. This work presents a different approach to stabilize the learning based on proximal updates on the mean-field policy. We name our algorithm Mean Field Proximal Policy Optimization (MF-PPO), and we empirically show the effectiveness of our method in the OpenSpiel framework.

In this paper, we propose Cubic Regularized Quasi-Newton Methods for (strongly) star-convex and Accelerated Cubic Regularized Quasi-Newton for convex optimization. The proposed class of algorithms combines the global convergence of the Cubic Newton with the affordability of constructing the Hessian approximation via a Quasi-Newton update. To construct these methods we introduce two novel concepts of Hessian inexactness and propose the Adaptive Inexact Cubic Regularized Newton algorithm and its accelerated version, which adjust to the approximation error. We show that different Quasi-Newton updates satisfy these approximation conditions and provide a complexity-efficient way to solve the cubic subproblem. We provide global convergence rates with explicit dependence on Hessian approximation error. By combining cubic regularization, acceleration technique, and adaptivity, our algorithms show good practical performance, especially from the points where classical Newton diverges.

Recently the SP (Stochastic Polyak step size) method has emerged as a competitive adaptive method for setting the step sizes of SGD. SP can be interpreted as a method specialized to interpolated models, since it solves the interpolation equations. SP solves these equation by using local linearizations of the model. We take a step further and develop a method for solving the interpolation equations that uses the local second-order approximation of the model. Our resulting method SP2 uses Hessian-vector products to speed-up the convergence of SP. Furthermore, and rather uniquely among second-order methods, the design of SP2 in no way relies on positive definite Hessian matrices or convexity of the objective function. We show SP2 is competitive both in experiments and in theory. We show SP2 is very competitive on matrix completion, non-convex test problems and logistic regression. We also provide a convergence theory on sums-of-quadratics.

2023
arXiv Preprint

The performance of the Deep Learning (DL) models depends on the quality of labels. In some areas, the involvement of human annotators may lead to noise in the data. When these corrupted labels are blindly regarded as the ground truth (GT), DL models suffer from performance deficiency. This paper presents a method that aims to learn a confident model in the presence of noisy labels. This is done in conjunction with estimating the uncertainty of multiple annotators. We robustly estimate the predictions given only the noisy labels by adding entropy or information-based regularizer to the classifier network. We conduct our experiments on a noisy version of MNIST, CIFAR-10, and FMNIST datasets. Our empirical results demonstrate the robustness of our method as it outperforms or performs comparably to other state-of-the-art (SOTA) methods. In addition, we evaluated the proposed method on the curated dataset, where the noise type and level of various annotators depend on the input image style. We show that our approach performs well and is adept at learning annotators' confusion. Moreover, we demonstrate how our model is more confident in predicting GT than other baselines. Finally, we assess our approach for segmentation problem and showcase its effectiveness with experiments.

2022
Journal Paper
Journal of Physics: Energy

In this work, we develop a deep neural network model for the reaction rate of oxidative coupling of methane from published high-throughput experimental catalysis data. A neural network is formulated so that the rate model satisfies the plug flow reactor design equation. The model is then employed to understand the variation of reactant and product composition within the reactor for the reference catalyst Mn–Na2WO4/SiO2 at different temperatures and to identify new catalysts and combinations of known catalysts that would increase yield and selectivity relative to the reference catalyst. The model revealed that methane is converted in the first half of the catalyst bed, while the second part largely consolidates the products (i.e. increases ethylene to ethane ratio). A screening study of >=3400 combinations of pairs of previously studied catalysts of the form M1(M2)$_{1-2}$M3Ox/support (where M1, M2 and M3 are metals) revealed that a reactor configuration comprising two sequential catalyst beds leads to synergistic effects resulting in increased yield of C2 compared to the reference catalyst at identical conditions and contact time. Finally, an expanded screening study of 7400 combinations (comprising previously studied metals but with several new permutations) revealed multiple catalyst choices with enhanced yields of C2 products. This study demonstrates the value of learning a deep neural network model for the instantaneous reaction rate directly from high-throughput data and represents a first step in constraining a data-driven reaction model to satisfy domain information.

Federated learning has become a popular machine learning paradigm with many potential real-life applications, including recommendation systems, the Internet of Things (IoT), healthcare, and self-driving cars. Though most current applications focus on classification-based tasks, learning personalized generative models remains largely unexplored, and their benefits in the heterogeneous setting still need to be better understood. This work proposes a novel architecture combining global client-agnostic and local client-specific generative models. We show that using standard techniques for training federated models, our proposed model achieves privacy and personalization that is achieved by implicitly disentangling the globally-consistent representation (i.e. content) from the client-dependent variations (i.e. style). Using such decomposition, personalized models can generate locally unseen labels while preserving the given style of the client and can predict the labels for all clients with high accuracy by training a simple linear classifier on the global content features. Furthermore, disentanglement enables other essential applications, such as data anonymization, by sharing only content. Extensive experimental evaluation corroborates our findings, and we also provide partial theoretical justifications for the proposed approach.

2022
Conference Paper
Distributed Autonomous Robotic Systems,
15th International Symposium

We propose a classification-aware path planning architecture for a team of robots in order to traverse along the most informative paths with the objective of completing map classification tasks using localized (partial) observations from the environment. In this method, the neural network layers with parallel structure utilize each agent’s memorized history and solve the path planning problem to achieve classification. The objective is to avoid visiting less informative regions and significantly reduce the total energy cost (e.g., battery life) when solving the classification problem. Moreover, the parallel design of the path planning structure reduces the training complexity drastically. The efficacy of our approach has been validated by a map classification problem in the simulation environment of satellite campus maps using quadcopters with onboard cameras.

Transmission-constrained problems in power systems can be cast as polynomial optimization problems whose coefficients vary over time. We consider the complications therein and suggest several approaches. On the example of the alternating-current optimal power flows (ACOPFs), we illustrate one of the approaches in detail. For the time-varying ACOPF, we provide an upper bound for the difference between the optimal cost for a relaxation using the most recent data and the current approximate optimal cost generated by our algorithm. This bound is a function of the properties of the instance and the rate of change of the coefficients over time. Moreover, we also bound the number of floating-point operations to perform between two subsequent updates to ensure a bounded error.

2022
arXiv Preprint

This work shows that applying Gradient Descent (GD) with a fixed step size to minimize a (possibly nonconvex) quadratic function is equivalent to running the Power Method (PM) on the gradients. The connection between GD with a fixed step size and the PM, both with and without fixed momentum, is thus established. Consequently, valuable eigen-information is available via GD. Recent examples show that GD with a fixed step size, applied to locally quadratic nonconvex functions, can take exponential time to escape saddle points (Simon S. Du, Chi Jin, Jason D. Lee, Michael I. Jordan, Aarti Singh, and Barnabas Poczos: "Gradient descent can take exponential time to escape saddle points"; S. Paternain, A. Mokhtari, and A. Ribeiro: "A newton-based method for nonconvex optimization with fast evasion of saddle points"). Here, those examples are revisited and it is shown that eigenvalue information was missing, so that the examples may not provide a complete picture of the potential practical behaviour of GD. Thus, ongoing investigation of the behaviour of GD on nonconvex functions, possibly with an \emph{adaptive} or \emph{variable} step size, is warranted. It is shown that, in the special case of a quadratic in R2, if an eigenvalue is known, then GD with a fixed step size will converge in two iterations, and a complete eigen-decomposition is available. By considering the dynamics of the gradients and iterates, new step size strategies are proposed to improve the practical performance of GD. Several numerical examples are presented, which demonstrate the advantages of exploiting the GD--PM connection.

2022
Conference Paper
Order up! The Benefits of Higher-Order Optimization in Machine Learning Workshop @ NeurIPS 2022

The family of Stochastic Gradient Methods with Polyak Step-size offers an update rule that alleviates the need of fine-tuning the learning rate of an optimizer. Recent work (Robert M Gower, Mathieu Blondel, Nidham Gazagnadou, and Fabian Pedregosa: Cutting some slack for SGD with adaptive polyak stepsizes) has been proposed to introduce a slack variable, which makes these methods applicable outside of the interpolation regime. In this paper, we combine preconditioning and slack in an updated optimization algorithm to show its performance on badly scaled and/or ill-conditioned datasets. We use Hutchinson's method to obtain an estimate of a Hessian which is used as the preconditioner.

2022
Conference Paper
Order up! The Benefits of Higher-Order Optimization in Machine Learning Workshop @ NeurIPS 2022

In this paper, we propose a Cubic Regularized L-BFGS. Cubic Regularized Newton outperforms the classical Newton method in terms of global performance. In classics, L-BFGS approximation is applied for the Newton method. We propose a new variant of inexact Cubic Regularized Newton. Then, we use L-BFGS approximation as an inexact Hessian for Cubic Regularized Newton. It allows us to get better theoretical convergence rates and good practical performance, especially from the points where classical Newton is diverging.

2022
Conference Paper
Order up! The Benefits of Higher-Order Optimization in Machine Learning Workshop @ NeurIPS 2022

In the recent paper FLECS (Agafonov et al, FLECS: A Federated Learning Second-Order Framework via Compression and Sketching), the second-order framework FLECS was proposed for the Federated Learning problem. This method utilize compression of sketched Hessians to make communication costs low. However, the main bottleneck of FLECS is gradient communication without compression. In this paper, we propose the modification of FLECS with compressed gradient differences, which we call FLECS-CGD (FLECS with Compressed Gradient Differences) and make it applicable for stochastic optimization. Convergence guarantees are provided in strongly convex and nonconvex cases. Experiments show the practical benefit of proposed approach.

The paper studies the properties of stochastic gradient methods with preconditioning. We focus on momentum updated preconditioners with momentum coefficient $\beta$. Seeking to explain practical efficiency of scaled methods, we provide convergence analysis in a norm associated with preconditioner, and demonstrate that scaling allows one to get rid of gradients Lipschitz constant in convergence rates. Along the way, we emphasize important role of $\beta$, undeservedly set to constant $0.99...9$ at the arbitrariness of various authors. Finally, we propose the explicit constructive formulas for adaptive $\beta$ and step size values.

Recently the SP (Stochastic Polyak step size) method has emerged as a competitive adaptive method for setting the step sizes of SGD. SP can be interpreted as a method specialized to interpolated models, since it solves the interpolation equations. SP solves these equation by using local linearizations of the model. We take a step further and develop a method for solving the interpolation equations that uses the local second-order approximation of the model. Our resulting method SP2 uses Hessian-vector products to speed-up the convergence of SP. Furthermore, and rather uniquely among second-order methods, the design of SP2 in no way relies on positive definite Hessian matrices or convexity of the objective function. We show SP2 is competitive both in experiments and in theory. We show SP2 is very competitive on matrix completion, non-convex test problems and logistic regression. We also provide a convergence theory on sums-of-quadratics.

2022
Conference Paper
NeurIPS

In this paper, we present the first stepsize schedule for Newton method resulting in fast global and local convergence guarantees. In particular, we a) prove an $\mathcal O \left( 1/{k^2} \right)$ global rate, which matches the state-of-the-art global rate of cubically regularized Newton method of Polyak and Nesterov (2006) and of regularized Newton method of Mishchenko (2021), and the later variant of Doikov and Nesterov (2021), b) prove a local quadratic rate, which matches the best-known local rate of second-order methods, and c) our stepsize formula is simple, explicit, and does not require solving any subproblem. Our convergence proofs hold under affine-invariant assumptions closely related to the notion of self-concordance. Finally, our method has competitive performance when compared to existing baselines which share the same fast global convergence guarantees.

2022
Conference Paper
31st ACM International Conference on Information and Knowledge Management, CIKM2022

2022
Journal Paper
EURO Journal on Computational Optimization

This paper considers the problem of decentralized, personalized federated learning. For centralized personalized federated learning, a penalty that measures the deviation from the local model and its average, is often added to the objective function. However, in a decentralized setting this penalty is expensive in terms of communication costs, so here, a different penalty -- one that is built to respect the structure of the underlying computational network -- is used instead. We present lower bounds on the communication and local computation costs for this problem formulation and we also present provably optimal methods for decentralized personalized federated learning. Numerical experiments are presented to demonstrate the practical performance of our methods.

Exploiting higher-order derivatives in convex optimization is known at least since 1970's. In each iteration higher-order (also called tensor) methods minimize a regularized Taylor expansion of the objective function, which leads to faster convergence rates if the corresponding higher-order derivative is Lipschitz-continuous. Recently a series of lower iteration complexity bounds for such methods were proved, and a gap between upper an lower complexity bounds was revealed. Moreover, it was shown that such methods can be implementable since the appropriately regularized Taylor expansion of a convex function is also convex and, thus, can be minimized in polynomial time. Only very recently an algorithm with optimal convergence rate 1/k(3p+1)/2 was proposed for minimizing convex functions with Lipschitz p-th derivative. For convex functions with Lipschitz third derivative, these developments allowed to propose a second-order method with convergence rate 1/k5, which is faster than the rate 1/k3.5 of existing second-order methods.

2022
Conference Paper
AISTATS

The problem of constrained Markov decision process is considered. An agent aims to maximize the expected accumulated discounted reward subject to multiple constraints on its costs (the number of constraints is relatively small). A new dual approach is proposed with the integration of two ingredients: entropy regularized policy optimizer and Vayda’s dual optimizer, both of which are critical to achieve faster convergence. The finite-time error bound of the proposed approach is provided. Despite the challenge of the nonconcave objective subject to nonconcave constraints, the proposed approach is shown to converge (with linear rate) to the global optimum. The complexity expressed in terms of the optimality gap and the constraint violation significantly improves upon the existing primal-dual approaches.

The assumption of bounded sharpness/smoothness, that is bounded maximum eigenvalue of the Hessian of the objective function, plays an important role in the analysis of optimization algorithms in the non-convex setting. However, recent empirical findings have shown that neural networks have unbounded sharpness, even worse the sharpness of the obtained solution seems to depend on the used learning rate. This finding questions the value of this very commonly used assumption and suggests that many theoretical analyses are based on false premises. Here, focusing on the case of deep linear networks, i.e., no activation function, we present a number of results to characterize the sharpness of global minima. First, we lower bound the sharpness in the minima, showing that, while unbounded, its minimal value is an increasing function of the number of layers. Then, we analyze the effect of L2 regularization, proving that it constrains the weights to be "balanced" at the minima, that in turn constraints the possible sharpness to a unique value, completely independent of the learning rate used. Finally, we show that GD with a specific stepsize converges to a critical point of a L2 regularized deep linear network from any initialization, because it will only visit points in a set with bounded sharpness. We also empirically validate our theoretical findings.

2022
arXiv Preprint

Inspired by the recent work FedNL (Safaryan et al, FedNL: Making Newton-Type Methods Applicable to Federated Learning), we propose a new communication efficient second-order framework for Federated learning, namely FLECS. The proposed method reduces the high-memory requirements of FedNL by the usage of an L-SR1 type update for the Hessian approximation which is stored on the central server. A low dimensional `sketch' of the Hessian is all that is needed by each device to generate an update, so that memory costs as well as number of Hessian-vector products for the agent are low. Biased and unbiased compressions are utilized to make communication costs also low. Convergence guarantees for FLECS are provided in both the strongly convex, and nonconvex cases, and local linear convergence is also established under strong convexity. Numerical experiments confirm the practical benefits of this new FLECS algorithm.

This paper investigates the problem of regret minimization in linear time-varying (LTV) dynamical systems. Due to the simultaneous presence of uncertainty and non-stationarity, designing online control algorithms for unknown LTV systems remains a challenging task. At a cost of NP-hard offline planning, prior works have introduced online convex optimization algorithms, although they suffer from nonparametric rate of regret. In this paper, we propose the first computationally tractable online algorithm with regret guarantees that avoids offline planning over the state linear feedback policies. Our algorithm is based on the optimism in the face of uncertainty (OFU) principle in which we optimistically select the best model in a high confidence region. Our algorithm is then more explorative when compared to previous approaches. To overcome non-stationarity, we propose either a restarting strategy (R-OFU) or a sliding window (SW-OFU) strategy. With proper configuration, our algorithm is attains sublinear regret $O(T^{5/6})$. These algorithms utilize data from the current phase for tracking variations on the system dynamics. We corroborate our theoretical findings with numerical experiments, which highlight the effectiveness of our methods. To the best of our knowledge, our study establishes the first model-based online algorithm with regret guarantees under LTV dynamical systems.

Methods with adaptive scaling of different features play a key role in solving saddle point problems, primarily due to Adam's popularity for solving adversarial machine learning problems, including GANS training. This paper carries out a theoretical analysis of the following scaling techniques for solving SPPs: the well-known Adam and RmsProp scaling and the newer AdaHessian and OASIS based on Hutchison approximation. We use the Extra Gradient and its improved version with negative momentum as the basic method. Experimental studies on GANs show good applicability not only for Adam, but also for other less popular methods.

This work considers non-convex finite sum minimization. There are a number of algorithms for such problems, but existing methods often work poorly when the problem is badly scaled and/or ill-conditioned, and a primary goal of this work is to introduce methods that alleviate this issue. Thus, here we include a preconditioner that is based upon Hutchinson's approach to approximating the diagonal of the Hessian, and couple it with several gradient based methods to give new `scaled' algorithms: Scaled SARAH and Scaled L-SVRG. Theoretical complexity guarantees under smoothness assumptions are presented, and we prove linear convergence when both smoothness and the PL-condition is assumed. Because our adaptively scaled methods use approximate partial second order curvature information, they are better able to mitigate the impact of badly scaled problems, and this improved practical performance is demonstrated in the numerical experiments that are also presented in this work.

2022
Conference Paper
MICCAI2022

Collaboration among multiple data-owning entities (e.g., hospitals) can accelerate the training process and yield better machine learning models due to the availability and diversity of data. However, privacy concerns make it challenging to exchange data while preserving confidentiality. Federated Learning (FL) is a promising solution that enables collaborative training through exchange of model parameters instead of raw data. However, most existing FL solutions work under the assumption that participating clients are honest and thus can fail against poisoning attacks from malicious parties, whose goal is to deteriorate the global model performance. In this work, we propose a robust aggregation rule called Distance-based Outlier Suppression (DOS) that is resilient to byzantine failures. The proposed method computes the distance between local parameter updates of different clients and obtains an outlier score for each client using Copula-based Outlier Detection (COPOD). The resulting outlier scores are converted into normalized weights using a softmax function, and a weighted average of the local parameters is used for updating the global model. DOS aggregation can effectively suppress parameter updates from malicious clients without the need for any hyperparameter selection, even when the data distributions are heterogeneous. Evaluation on two medical imaging datasets (CheXpert and HAM10000) demonstrates the higher robustness of DOS method against a variety of poisoning attacks in comparison to other state-of-the-art methods.

2022
arXiv Preprint

We propose general non-accelerated and accelerated tensor methods under inexact information on the derivatives of the objective, analyze their convergence rate. Further, we provide conditions for the inexactness in each derivative that is sufficient for each algorithm to achieve the desired accuracy. As a corollary, we propose stochastic tensor methods for convex optimization and obtain sufficient mini-batch sizes for each derivative.

2022
Journal Paper
Journal of Selected Topics in Signal Processing

A very large number of communications are typically required to solve distributed learning tasks, and this critically limits scalability and convergence speed in wireless communications applications. In this paper, we devise a Gradient Descent method with Sparsification and Error Correction (GD-SEC) to improve the communications efficiency in a general worker-server architecture. Motivated by a variety of wireless communications learning scenarios, GD-SEC reduces the number of bits per communication from worker to server with no degradation in the order of the convergence rate. This enables larger-scale model learning without sacrificing convergence or accuracy. At each iteration of GD-SEC, instead of directly transmitting the entire gradient vector, each worker computes the difference between its current gradient and a linear combination of its previously transmitted gradients, and then transmits the sparsified gradient difference to the server. A key feature of GD-SEC is that any given component of the gradient difference vector will not be transmitted if its magnitude is not sufficiently large. An error correction technique is used at each worker to compensate for the error resulting from sparsification. We prove that GD-SEC is guaranteed to converge for strongly convex, convex, and nonconvex optimization problems with the same order of convergence rate as GD. Furthermore, if the objective function is strongly convex, GD-SEC has a fast linear convergence rate. Numerical results not only validate the convergence rate of GD-SEC but also explore the communication bit savings it provides. Given a target accuracy, GD-SEC can significantly reduce the communications load compared to the best existing algorithms without slowing down the optimization process.

2022
Conference Paper
ICML

Gradient-free/zeroth-order methods for black-box convex optimization have been extensively studied in the last decade with the main focus on oracle calls complexity. In this paper, besides the oracle complexity, we focus also on iteration complexity, and propose a generic approach that, based on optimal first-order methods, allows to obtain in a black-box fashion new zeroth-order algorithms for non-smooth convex optimization problems. Our approach not only leads to optimal oracle complexity, but also allows to obtain iteration complexity similar to first-order methods, which, in turn, allows to exploit parallel computations to accelerate the convergence of our algorithms. We also elaborate on extensions for stochastic optimization problems, saddle-point problems, and distributed optimization.

2022
arXiv Preprint

In this study, we develop a limited memory nonconvex Quasi-Newton (QN) method, tailored to deep learning (DL) applications. Since the stochastic nature of (sampled) function information in minibatch processing can affect the performance of QN methods, three strategies are utilized to overcome this issue. These involve a novel progressive trust-region radius update (suitable for stochastic models), batched evaluation instead of the entire data set, for selecting gradient batch-size and a restart strategy when quasi-Netwon approximation accuracy deteriorates. We analyze the convergence properties of our proposed method and provide the required theoretical analysis for different components of our algorithm. The numerical results illustrate that our proposed methodology with the new adjustments outperforms the previous similar methods, and is competitive with the best tuned stochastic first-order methods, in cases where large batch-size is required. Finally, we empirically show that our method is robust to the choices of hyper-parameters, thus, requiring less tuning compared to Stochastic Gradient Descent (SGD) method.

2021
Conference Paper
International Conference on Learning Representations (ICLR)

We present a novel adaptive optimization algorithm for large-scale machine learning problems. Equipped with a low-cost estimate of local curvature and Lipschitz smoothness, our method dynamically adapts the search direction and step-size. The search direction contains gradient information preconditioned by a well-scaled diagonal preconditioning matrix that captures the local curvature information. Our methodology does not require the tedious task of learning rate tuning, as the learning rate is updated automatically without adding an extra hyperparameter. We provide convergence guarantees on a comprehensive collection of optimization problems, including convex, strongly convex, and nonconvex problems, in both deterministic and stochastic regimes. We also conduct an extensive empirical evaluation on standard machine learning problems, justifying our algorithm's versatility and demonstrating its strong performance compared to other start-of-the-art first-order and second-order methods.

2021
Conference Paper
Optimization for Machine Learning Workshop @ NeurIPS 2021

The StochAstic Recursive grAdient algoritHm (SARAH) algorithm is a variance reduced variant of the Stochastic Gradient Descent (SGD) algorithm that needs a gradient of the objective function from time to time. In this paper, we remove the necessity of a full gradient computation. This is achieved by using a randomized reshuffling strategy and aggregating stochastic gradients obtained in each epoch. The aggregated stochastic gradients serve as an estimate of a full gradient in the SARAH algorithm. We provide a theoretical analysis of the proposed approach and conclude the paper with numerical experiments that demonstrate the efficiency of this approach.

Personalized Federated Learning (PFL) has recently seen tremendous progress, allowing the design of novel machine learning applications to preserve the privacy of the training data. Existing theoretical results in this field mainly focus on distributed optimization for minimization problems. This paper is the first to study PFL for saddle point problems (which cover a broader class of optimization problems), allowing for a more rich class of applications requiring more than just solving minimization problems. In this work, we consider a recently proposed PFL setting with the mixing objective function, an approach combining the learning of a global model together with locally distributed learners. Unlike most previous work, which considered only the centralized setting, we work in a more general and decentralized setup that allows us to design and analyze more practical and federated ways to connect devices to the network. We proposed new algorithms to address this problem and provide a theoretical analysis of the smooth (strongly-)convex-(strongly-)concave saddle point problems in stochastic and deterministic cases. Numerical experiments for bilinear problems and neural networks with adversarial noise demonstrate the effectiveness of the proposed methods.

2021
Journal Paper
Optimization Methods and Software (OMS)

We present two sampled quasi-Newton methods for deep learning: sampled LBFGS (S-LBFGS) and sampled LSR1 (S-LSR1). Contrary to the classical variants of these methods that sequentially build Hessian or inverse Hessian approximations as the optimization progresses, our proposed methods sample points randomly around the current iterate at every iteration to produce these approximations. As a result, the approximations constructed make use of more reliable (recent and local) information, and do not depend on past iterate information that could be significantly stale. Our proposed algorithms are efficient in terms of accessed data points (epochs) and have enough concurrency to take advantage of parallel/distributed computing environments. We provide convergence guarantees for our proposed methods. Numerical tests on a toy classification problem as well as on popular benchmarking neural network training tasks reveal that the methods outperform their classical variants.

2021
Conference Paper
The British Machine Vision Conference (BMVC 2021)

The goal of text-to-image synthesis is to generate a visually realistic image that matches a given text description. In practice, the captions annotated by humans for the same image have large variance in terms of contents and the choice of words. The linguistic discrepancy between the captions of the identical image leads to the synthetic images deviating from the ground truth. To address this issue, we propose a contrastive learning approach to improve the quality and enhance the semantic consistency of synthetic images. In the pre-training stage, we utilize the contrastive learning approach to learn the consistent textual representations for the captions corresponding to the same image. Furthermore, in the following stage of GAN training, we employ the contrastive learning method to enhance the consistency between the generated images from the captions related to the same image. We evaluate our approach over two popular text-to-image synthesis models, AttnGAN and DM-GAN, on datasets CUB and COCO, respectively. Experimental results have shown that our approach can effectively improve the quality of synthetic images in terms of three metrics: IS, FID and R-precision. Especially, on the challenging COCO dataset, our approach boosts the FID significantly by 29.60% over AttnGAn and by 21.96% over DM-GAN.

2021
Journal Paper
Computational Optimization and Applications

In this work we introduce the concept of an Underestimate Sequence (UES), which is a natural extension of Nesterov's estimate sequence. Our definition of a UES utilizes three sequences, one of which is a lower bound (or under-estimator) of the objective function. The question of how to construct an appropriate sequence of lower bounds is also addressed, and we present lower bounds for strongly convex smooth functions and for strongly convex composite functions, which adhere to the UES framework. Further, we propose several first order methods for minimizing strongly convex functions in both the smooth and composite cases. The algorithms, based on efficiently updating lower bounds on the objective functions, have natural stopping conditions, which provides the user with a certificate of optimality. Convergence of all algorithms is guaranteed through the UES framework, and we show that all presented algorithms converge linearly, with the accelerated variants enjoying the optimal linear rate of convergence.

2021
Conference Paper
AISTATS '21

This work presents a new algorithm for empirical risk minimization. The algorithm bridges the gap between first- and second-order methods by computing a search direction that uses a second-order-type update in one subspace, coupled with a scaled steepest descent step in the orthogonal complement. To this end, partial curvature information is incorporated to help with ill-conditioning, while simultaneously allowing the algorithm to scale to the large problem dimensions often encountered in machine learning applications. Theoretical results are presented to confirm that the algorithm converges to a stationary point in both the strongly convex and nonconvex cases. A stochastic variant of the algorithm is also presented, along with corresponding theoretical guarantees. Numerical results confirm the strengths of the new approach on standard machine learning problems.

2021
Journal Paper
Manufacturing and Service Operations Management (accepted)

Problem definition: The beer game is widely used in supply chain management classes to demonstrate the bullwhip effect and the importance of supply chain coordination. The game is a decentralized, multiagent, cooperative problem that can be modeled as a serial supply chain network in which agents choose order quantities while cooperatively attempting to minimize the network’s total cost, although each agent only observes local information. Academic/practical relevance: Under some conditions, a base-stock replenishment policy is optimal. However, in a decentralized supply chain in which some agents act irrationally, there is no known optimal policy for an agent wishing to act optimally. Methodology: We propose a deep reinforcement learning (RL) algorithm to play the beer game. Our algorithm makes no assumptions about costs or other settings. As with any deep RL algorithm, training is computationally intensive, but once trained, the algorithm executes in real time. We propose a transfer-learning approach so that training performed for one agent can be adapted quickly for other agents and settings. Results: When playing with teammates who follow a base-stock policy, our algorithm obtains near-optimal order quantities. More important, it performs significantly better than a base-stock policy when other agents use a more realistic model of human ordering behavior. We observe similar results using a real-world data set. Sensitivity analysis shows that a trained model is robust to changes in the cost coefficients. Finally, applying transfer learning reduces the training time by one order of magnitude. Managerial implications: This paper shows how artificial intelligence can be applied to inventory optimization. Our approach can be extended to other supply chain optimization problems, especially those in which supply chain partners act in irrational or unpredictable ways. Our RL agent has been integrated into a new online beer game, which has been played more than 17,000 times by more than 4,000 people.

2021
Journal Paper
Computers and Chemical Engineering

Clustering and classification critically rely on distance metrics that provide meaningful comparisons between data points. To this end, learning optimal distance functions from data, known as metric learning, aims to facilitate supervised classification, particularly in high-dimensional spaces where visualization is challenging or infeasible. In particular, the Mahalanobis metric is the default choice due to simplicity and interpretability as a transformation of the simple Euclidean metric using a combination of rotation and scaling. In this work, we present several novel contributions to metric learning, both by way of formulation as well as solution methods. Our approach is motivated by agglomerative clustering with certain novel modifications that enable natural interpretation of the user-defined classes as clusters with the optimal metric. Our approach generalizes and improves upon leading methods by removing reliance on pre-designated “target neighbors,” “triplets,” and “similarity pairs.” Starting with the definition of a generalized metric that has the Mahalanobis metric as the second order term, we propose an objective function for metric selection that does not aim to isolate classes from each other like most previous work, but tries to distort the space minimally by aggregating co-class members into local clusters. Further, we formulate the problem as a mixed-integer optimization that can be solved efficiently for small/medium datasets and approximated for larger datasets. Another salient feature of our method is that it facilitates active learning by recommending precise regions to sample using the optimal metric to improve classification performance. These regions are indicated by boundary and outlier points of the dataset as defined by the metric. This targeted acquisition can significantly reduce computation and data acquisition by ensuring training data completeness, representativeness, and economy, which could also provide advantages in training data selection for other established methods like Deep Learning and Random Forests. We demonstrate classification and computational performance of our approach through several simple and intuitive examples, followed by results on real image and benchmark datasets.

2020
Conference Paper
6th Annual Conference on Machine Learning, Optimization and Data Science (LOD), 2020

In this paper, we present a scalable distributed implementation of the sampled LSR1 (S-LSR1) algorithm. First, we show that a naive distributed implementation of S-LSR1 requires multiple rounds of expensive communications at every iteration and thus is inefficient. We then propose DS-LSR1, a communication-efficient variant of the S-LSR1 method, that drastically reduces the amount of data communicated at every iteration, that has favorable work-load balancing across nodes and that is matrix-free and inverse-free. The proposed method scales well in terms of both the dimension of the problem and the number of data points. Finally, we illustrate the performance of DS-LSR1 on standard neural network training tasks.

We consider the problem of classifying a map using a team of communicating robots. It is assumed that all robots have localized visual sensing capabilities and can exchange their information with neighboring robots. Using a graph decomposition technique, we proposed an offline learning structure that makes every robot capable of communicating with and fusing information from its neighbors to plan its next move towards the most informative parts of the environment for map classification purposes. The main idea is to decompose a given undirected graph into a union of directed star graphs and train robots w.r.t a bounded number of star graphs. This will significantly reduce the computational cost of offline training and makes learning scalable (independent of the number of robots). Our approach is particularly useful for fast map classification in large environments using a large number of communicating robots. We validate the usefulness of our proposed methodology through extensive simulations.

2020
Journal Paper
Engineering Structures

Data-driven models for predicting dynamic responses of linear and nonlinear systems are of great importance due to their wide application from probabilistic analysis to inverse problems such as system identification and damage diagnosis. In this study, a physics-based recurrent neural network model is designed that is able to learn the dynamics of linear and nonlinear multiple degrees of freedom systems given a ground motion. The model is able to estimate a complete set of responses, including displacement, velocity, acceleration, and internal forces. Compared to the most advanced counterparts, this model requires a smaller number of trainable variables while the accuracy of predictions is higher for long trajectories. In addition, the architecture of the recurrent block is inspired by differential equation solver algorithms and it is expected that this approach yields more generalized solutions. In the training phase, we propose multiple novel techniques to dramatically accelerate the learning process using smaller datasets, such as hardsampling, utilization of trajectory loss function, and implementation of a trust-region approach. Numerical case studies are conducted to examine the strength of the network to learn different nonlinear behaviors. It is shown that the network is able to capture different nonlinear behaviors of dynamic systems with very high accuracy and with no need for prior information or very large datasets.

2020
Journal Paper
Optimization and Engineering (OPTE)

Given a multivariate data set, sparse principal component analysis (SPCA) aims to extract several linear combinations of the variables that together explain the variance in the data as much as possible, while controlling the number of nonzero loadings in these combinations. In this paper we consider 8 different optimization formulations for computing a single sparse loading vector; these are obtained by combining the following factors: we employ norms for measuring variance (L2, L1) and sparsity-inducing norms (L0, L1), which are used in different ways (constraint, penalty). Three of our formulations, notably the one with L0 constraint and L1 variance, have not been considered in the literature. We give a unifying reformulation which we propose to solve via a natural alternating maximization (AM) method. We show the the AM method is nontrivially equivalent to GPower (Journee et al; JMLR :517--553, 2010) for all our formulations. Besides this, we provide 24 efficient parallel SPCA implementations: 3 codes (multi-core, GPU and cluster) for each of the 8 problems. Parallelism in the methods is aimed at

2020
Journal Paper
Optimization Methods and Software (GOMS)

We develop and analyze a variant of variance reducing stochastic gradient algorithm, known as SARAH, which does not require computation of the exact gradient. Thus this new method can be applied to general expectation minimization problems rather than only finite sum problems. While the original SARAH algorithm, as well as its predecessor, SVRG, require an exact gradient computation on each outer iteration, the inexact variant of SARAH (iSARAH), which we develop here, requires only stochastic gradient computed on a mini-batch of sufficient size. The proposed method combines variance reduction via sample size selection and iterative stochastic gradient updates. We analyze the convergence rate of the algorithms for strongly convex, convex, and nonconvex cases with appropriate mini-batch size selected for each case. We show that with an additional, reasonable, assumption iSARAH achieves the best known complexity among stochastic methods in the case of general convex case stochastic value functions.

2020
Journal Paper
Journal of Machine Learning Research

We consider learning problems over training sets in which both, the number of training examples and the dimension of the feature vectors, are large. To solve these problems we propose the random parallel stochastic algorithm (RAPSA). We call the algorithm random parallel because it utilizes multiple parallel processors to operate on a randomly chosen subset of blocks of the feature vector. RAPSA is doubly stochastic since each processor utilizes a random set of functions to compute the stochastic gradient associated with a randomly chosen sets of variable coordinates. Algorithms that are parallel in either of these dimensions exist, but RAPSA is the first attempt at a methodology that is parallel in both the selection of blocks and the selection of elements of the training set. In RAPSA, processors utilize the randomly chosen functions to compute the stochastic gradient component associated with a randomly chosen block. The technical contribution of this paper is to show that this minimally coordinated algorithm converges to the optimal classifier when the training objective is strongly convex. Moreover, we present an accelerated version of RAPSA (ARAPSA) that incorporates the objective function curvature information by premultiplying the descent direction by a Hessian approximation matrix. We further extend the results for asynchronous settings and show that if the processors perform their updates without any coordination the algorithms are still convergent to the optimal argument. RAPSA and its extensions are then numerically evaluated on a linear estimation problem and a binary image classification task using the MNIST handwritten digit dataset.

2020
Journal Paper
Structure and Infrastructure Engineering

This paper presents a framework to tackle constrained combinatorial optimization problems using deep Reinforcement Learning (RL). To this end, we extend the Neural Combinatorial Optimization (NCO) theory in order to deal with constraints in its formulation. Notably, we propose defining constrained combinatorial problems as fully observable Constrained Markov Decision Processes (CMDP). In that context, the solution is iteratively constructed based on interactions with the environment. The model, in addition to the reward signal, relies on penalty signals generated from constraint dissatisfaction to infer a policy that acts as a heuristic algorithm. Moreover, having access to the complete state representation during the optimization process allows us to rely on memory-less architectures, enhancing the results obtained in previous sequence-to-sequence approaches. Conducted experiments on the constrained Job Shop and Resource Allocation problems prove the superiority of the proposal for computing rapid solutions when compared to classical heuristic, metaheuristic, and Constraint Programming (CP) solvers.

2020
Conference Paper
19th IEEE International Conference on Machine Learning and Applications

Discovering the underlying behavior of complex systems is an important topic in many science and engineering disciplines. In this paper, we propose a novel neural network framework, finite difference neural networks (FDNet), to learn partial differential equations from data. Specifically, our proposed finite difference inspired network is designed to learn the underlying governing partial differential equations from trajectory data, and to iteratively estimate the future dynamical behavior using only a few trainable parameters. We illustrate the performance (predictive power) of our framework on the heat equation, with and without noise and/or forcing, and compare our results to the Forward Euler method. Moreover, we show the advantages of using a Hessian-Free Trust Region method to train the network.

2020
Journal Paper
SIAM Journal on Matrix Analysis and Applications (SIMAX)

We develop a family of reformulations of an arbitrary consistent linear system into a stochastic problem. The reformulations are governed by two user-defined parameters: a positive definite matrix defining a norm, and an arbitrary discrete or continuous distribution over random matrices. Our reformulation has several equivalent interpretations, allowing for researchers from various communities to leverage their domain specific insights. In particular, our reformulation can be equivalently seen as a stochastic optimization problem, stochastic linear system, stochastic fixed point problem and a probabilistic intersection problem. We prove sufficient, and necessary and sufficient conditions for the reformulation to be exact. Further, we propose and analyze three stochastic algorithms for solving the reformulated problem---basic, parallel and accelerated methods---with global linear convergence rates. The rates can be interpreted as condition numbers of a matrix which depends on the system matrix and on the reformulation parameters. This gives rise to a new phenomenon which we call stochastic preconditioning, and which refers to the problem of finding parameters (matrix and distribution) leading to a sufficiently small condition number. Our basic method can be equivalently interpreted as stochastic gradient descent, stochastic Newton method, stochastic proximal point method, stochastic fixed point method, and stochastic projection method, with fixed stepsize (relaxation parameter), applied to the reformulations.

2020
Journal Paper
Computer-Aided Civil and Infrastructure Engineering

Many of the civil structures experience significant vibrations and repeated stress cycles during their life span. These conditions are the bases for fatigue analysis to accurately establish the remaining fatigue life of the structures that ideally requires a full-field strain assessment of the structures over years of data collection. Traditional inspection methods collect strain measurements by using strain gauges for a short time span and extrapolate the measurements in time; nevertheless, large-scale deployment of strain gauges is expensive and laborious as more spatial information is desired. This paper introduces a deep learning-based approach to replace this high cost by employing inexpensive data coming from acceleration sensors. The proposed approach utilizes collected acceleration responses as inputs to a multistage deep neural network based on long short-term memory and fully connected layers to estimate the strain responses. The memory requirement of training long acceleration sequences is reduced by proposing a novel training strategy. In the evaluation of the method, a laboratory-scale horizontally curved girder subjected to various loading scenarios is tested.

2020
Journal Paper
IMA Journal of Numerical Analysis

In this paper we consider large-scale smooth optimization problems with multiple linear coupled constraints. Due to the non-separability of the constraints, arbitrary random sketching would not be guaranteed to work. Thus, we first investigate necessary and sufficient conditions for the sketch sampling to have well-defined algorithms. Based on these sampling conditions we developed new sketch descent methods for solving general smooth linearly constrained problems, in particular, random sketch descent and accelerated random sketch descent methods. From our knowledge, this is the first convergence analysis of random sketch descent algorithms for optimization problems with multiple non-separable linear constraints. For the general case, when the objective function is smooth and non-convex, we prove for the non-accelerated variant sublinear rate in expectation for an appropriate optimality measure. In the smooth convex case, we derive for both algorithms, non-accelerated and accelerated random sketch descent, sublinear convergence rates in the expected values of the objective function. Additionally, if the objective function satisfies a strong convexity type condition, both algorithms converge linearly in expectation. In special cases, where complexity bounds are known for some particular sketching algorithms, such as coordinate descent methods for optimization problems with a single linear coupled constraint, our theory recovers the best-known bounds. We also show that when random sketch is sketching the coordinate directions randomly produces better results than the fixed selection rule. Finally, we present some numerical examples to illustrate the performances of our new algorithms.

2020
Journal Paper
ASCE's Journal of Engineering Mechanics

2020
Conference Paper
AISTATS 2020

In this paper, we propose a Distributed Accumulated Newton Conjugate gradiEnt (DANCE) method in which sample size is gradually increasing to quickly obtain a solution whose empirical loss is under satisfactory statistical accuracy. Our proposed method is multistage in which the solution of a stage serves as a warm start for the next stage which contains more samples (including the samples in the previous stage). The proposed multistage algorithm reduces the number of passes over data to achieve the statistical accuracy of the full training set. Moreover, our algorithm in nature is easy to be distributed and shares the strong scaling property indicating that acceleration is always expected by using more computing nodes. Various iteration complexity results regarding descent direction computation, communication efficiency and stopping criteria are analyzed under convex setting. Our numerical results illustrate that the proposed method outperforms other comparable methods for solving learning problems including neural networks.

2020
Conference Paper
Dynamics of Civil Structures, Volume 2 pp. 205-210

Large quantities of data which contain detailed condition information over an extended period of time should be utilized to prioritize infrastructure repairs. As the temporal and spatial resolution of monitoring data drastically increase by advances in sensing technology, structural health monitoring applications reach the thresholds of big data. Deep neural networks are ideally suited to use large representative training datasets to learn complex damage features. In the previous study of authors, a real-time deep learning platform was developed to solve damage detection and localization challenge. The network was trained by using simulated structural connection mimicking the real test object with a variety of loading cases, damage scenarios, and measurement noise levels for successful and robust diagnosis of damage. In this study, the proposed damage diagnosis platform is validated by using temporally and spatially dense data collected by Digital Image Correlation (DIC) from the specimen. Laboratory testing of the specimen with induced damage condition is performed to evaluate the performance and efficiency of damage detection and localization approach.

We propose basic and natural assumptions under which iterative optimization methods with compressed iterates can be analyzed. This problem is motivated by the practice of federated learning, where a large model stored in the cloud is compressed before it is sent to a mobile device, which then proceeds with training based on local data. We develop standard and variance reduced methods, and establish communication complexity bounds. Our algorithms are the first distributed methods with compressed iterates, and the first fixed point methods with compressed iterates.

2019
Conference Paper
Structural Health Monitoring 2019

Mathematical platforms that are able to estimate modal characteristics from mobile sensors are not much investigated. Mobile sensors collect spatially dense data compared to limited spatial density of fixed sensor networks. This feature potentially enables to refine the identified natural mode shapes as well as more robust estimations of other modal characteristics, e.g., natural frequencies and damping ratios. In this paper, highresolution natural mode shape identification of a simple-span bridge using mobile data is investigated. A recent methodology developed by authors is used to reconstruct a full bridge response matrix from mobile data. Matrix completion technique approximates unobserved signals at many virtual stationary locations via a convex optimization procedure. This reconstructed data is then fed in batches into available output-only system identification algorithms to extract modal properties. Mode shape refinement then is performed by superimposing identified results of all considered batches. The accuracy of the matrix completion for signal reconstruction was shown before, however, the performance of the estimated signal for modal identification has not been demonstrated yet. In this study, a numerical case study is examined to compare identification results from this procedure compared to a conventional sensing network consists of fixed sensors.

2019
Conference Paper
Beyond First Order Methods in ML Workshop @ NeurIPS 2019

2019
Conference Paper
Beyond First Order Methods in ML Workshop @ NeurIPS 2019

Discovering the underlying physical behavior of complex systems is a crucial, but less well-understood topic in many engineering disciplines. This study proposes a finite-difference inspired convolutional neural network framework to learn hidden partial differential equations from given data and iteratively estimate future dynamical behavior. The methodology designs the filter sizes such that they mimic the finite difference between the neighboring points. By learning the governing equation, the network predicts the future evolution of the solution by using only a few trainable parameters. In this paper, we provide numerical results to compare the efficiency of the second-order Trust-Region Conjugate Gradient (TRCG) method with the first-order ADAM optimizer.

2019
Conference Paper
Optimization and Machine Learning @ NeurIPS 2019

We present two sampled quasi-Newton methods: sampled LBFGS and sampled LSR1. Contrary to the classical variants that sequentially build Hessian approximations, our proposed methods sample points randomly around the current iterate to produce these approximations. As a result, the approximations constructed make use of more reliable (recent and local) information, and do not depend on past information that could be significantly stale. We provide convergence guarantees for our proposed methods, and illustrate their performance in practice.

2019
Conference Paper
Beyond First Order Methods in ML Workshop @ NeurIPS 2019

2019
arXiv Preprint

We propose a planning and perception mechanism for a robot (agent), that can only observe the underlying environment partially, in order to solve an image classification problem. A three-layer architecture is suggested that consists of a meta-layer that decides the intermediate goals, an action-layer that selects local actions as the agent navigates towards a goal, and a classification-layer that evaluates the reward and makes a prediction. We design and implement these layers using deep reinforcement learning. A generalized policy gradient algorithm is utilized to learn the parameters of these layers to maximize the expected reward. Our proposed methodology is tested on the MNIST dataset of handwritten digits, which provides us with a level of explainability while interpreting the agent's intermediate goals and course of action.

2019
Journal Paper
Optimization Methods and Software

This paper describes an implementation of the L-BFGS method designed to deal with two adversarial situations. The first occurs in distributed computing environments where some of the computational nodes devoted to the evaluation of the function and gradient are unable to return results on time. A similar challenge occurs in a multi-batch approach in which the data points used to compute function and gradients are purposely changed at each iteration to accelerate the learning process. Difficulties arise because L-BFGS employs gradient differences to update the Hessian approximations, and when these gradients are computed using different data points the updating process can be unstable. This paper shows how to perform stable quasi-Newton updating in the multi-batch setting, studies the convergence properties for both convex and nonconvex functions, and illustrates the behavior of the algorithm in a distributed computing platform on binary classification logistic regression and neural network training problems that arise in machine learning.

2019
Journal Paper
Optimization Methods and Software

Distributed optimization algorithms are essential for training machine learning models on very large-scale datasets. However, they often suffer from communication bottlenecks. Confronting this issue, a communication-efficient primal-dual coordinate ascent framework (CoCoA) and its improved variant CoCoA+ have been proposed, achieving a convergence rate of (1/t) for solving empirical risk minimization problems with Lipschitz continuous losses. In this paper, an accelerated variant of CoCoA+ is proposed and shown to possess a convergence rate of (1/t2) in terms of reducing suboptimality. The analysis of this rate is also notable in that the convergence rate bounds involve constants that, except in extreme cases, are significantly reduced compared to those previously provided for CoCoA+. The results of numerical experiments are provided to show that acceleration can lead to significant performance gains.

2019
Journal Paper
minor revision in Journal of Machine Learning Research (JMLR)

The classical convergence analysis of SGD is carried out under the assumption that the norm of the stochastic gradient is uniformly bounded. While this might hold for some loss functions, it is violated for cases where the objective function is strongly convex. In Bottou et al. (2016), a new analysis of convergence of SGD is performed under the assumption that stochastic gradients are bounded with respect to the true gradient norm. We show that for stochastic problems arising in machine learning such bound always holds; and we also propose an alternative convergence analysis of SGD with diminishing learning rate regime, which results in more relaxed conditions than those in Bottou et al. (2016). We then move on the asynchronous parallel setting, and prove convergence of Hogwild! algorithm in the same regime in the case of diminished learning rate. It is well-known that SGD converges if a sequence of learning rates $\{η_t\}$ satisfies $\sum_{t=0}^\infty η_t \to \infty$ and $\sum_{t=0}^\infty η^2_t < \infty$. We show the convergence of SGD for strongly convex objective function without using bounded gradient assumption when {ηt} is a diminishing sequence and ∑∞t=0ηt→∞. In other words, we extend the current state-of-the-art class of learning rates satisfying the convergence of SGD.

2019
Conference Paper
Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2019)

We investigate a classification problem using multiple mobile agents that are capable of collecting (partial) pose-dependent observations of an unknown environment. The objective is to classify an image (e.g, map of a large area) over a finite time horizon. We propose a network architecture on how agents should form a local belief, take local actions, extract relevant features and specification from their raw partial observations. Agents are allowed to exchange information with their neighboring agents and run a decentralized consensus protocol to update their own beliefs. It is shown how reinforcement learning techniques can be utilized to achieve decentralized implementation of the classification problem. Our experimental results on MNIST handwritten digit dataset demonstrates the effectiveness of our proposed framework.

The newsvendor problem is one of the most basic and widely applied inventory models. There are numerous extensions of this problem. One important extension is the multi-item newsvendor problem, in which the demand of each item may be correlated with that of other items. If the joint probability distribution of the demand is known, the problem can be solved analytically. However, approximating the probability distribution is not easy and is prone to error; therefore, the resulting solution to the newsvendor problem may be not optimal. To address this issue, we propose an algorithm based on deep learning that optimizes the order quantities for all products based on features of the demand data. Our algorithm integrates the forecasting and inventory-optimization steps, rather than solving them separately as is typically done. The algorithm does not require the knowledge of the probability distributions of the demand. Numerical experiments on real-world data suggest that our algorithm outperforms other approaches, including data-driven and SVM approaches, especially for demands with high volatility.

2019
arXiv Preprint
arXiv

Although reinforcement learning (RL) can provide reliable solutions in many settings, practitioners are often wary of the discrepancies between the RL solution and their status quo procedures. Therefore, they may be reluctant to adapt to the novel way of executing tasks proposed by RL. On the other hand, many real-world problems require relatively small adjustments from the status quo policies to achieve improved performance. Therefore, we propose a student-teacher RL mechanism in which the RL (the ``student'') learns to maximize its reward, subject to a constraint that bounds the difference between the RL policy and the ``teacher'' policy. The teacher can be another RL policy (e.g., trained under a slightly different setting), the status quo policy, or any other exogenous policy. We formulate this problem using a stochastic optimization model and solve it using a primal-dual policy gradient algorithm. We prove that the policy is asymptotically optimal. However, a naive implementation suffers from high variance and convergence to a stochastic optimal policy. With a few practical adjustments to address these issues, our numerical experiments confirm the effectiveness of our proposed method in multiple GridWorld scenarios.

Low-rank methods for semidefinite programming (SDP) have gained considerable popularity, especially in machine learning applications. Their analyses often assume the use of determinant-based regularisers, which are rarely implemented, due to the run-time cubic in the dimension in conventional implementations of the computation of their gradient. We extend the convergence analyses of low-rank methods to a wide class of regularisers. Further, we show that the gradient of a well-known regulariser can be computed in time linear in the dimension, which makes the regularisation practical. Our results are illustrated on the Max-Cut SDP relaxation.

2019
Conference Paper
Springer Proceedings in Mathematics & Statistics (MOPTA)

We propose a novel topic discovery algorithm for unlabeled images based on the bag-of-words (BoW) framework. We first extract a dictionary of visual words and subsequently for each image compute a visual word occurrence histogram. We view these histograms as rows of a large matrix from which we extract sparse principal components (PCs). Each PC identifies a sparse combination of visual words which co-occur frequently in some images but seldom appear in others. Each sparse PC corresponds to a topic, and images whose interference with the PC is high belong to that topic, revealing the common parts possessed by the images. We propose to solve the associated sparse PCA problems using an Alternating Maximization (AM) method, which we modify for purpose of efficiently extracting multiple PCs in a deflation scheme. Our approach attacks the maximization problem in sparse PCA directly and is scalable to high-dimensional data. Experiments on automatic topic discovery and category prediction demonstrate encouraging performance of our approach.

2019
Journal Paper
Journal of Computing in Civil Engineering (Volume 33 Issue 3 - May 2019)

Damage diagnosis has been a challenging inverse problem in structural health monitoring. The main difficulty is characterizing the unknown relation between the measurements and damage patterns (i.e., damage indicator selection). Such damage indicators would ideally be able to identify the existence, location, and severity of damage. Therefore, this procedure requires complex data processing algorithms and dense sensor arrays, which brings computational intensity with it. To address this limitation, this paper introduces convolutional neural network (CNN), which is one of the major breakthroughs in image recognition, to the damage detection and localization problem. The CNN technique has the ability to discover abstract features and complex classifier boundaries that are able to distinguish various attributes of the problem. In this paper, a CNN topology was designed to classify simulated damaged and healthy cases and localize the damage when it exists. The performance of the proposed technique was evaluated through the finite-element simulations of undamaged and damaged structural connections. Samples were trained by using strain distributions as a consequence of various loads with several different crack scenarios. Completely new damage setups were introduced to the model during the testing process. Based on the findings of the proposed study, the damage diagnosis and localization were achieved with high accuracy, robustness, and computational efficiency.

Training very large machine learning models requires a distributed computing approach, with communication of the model updates often being the bottleneck. For this reason, several methods based on the compression (e.g., sparsification and/or quantization) of the updates were recently proposed, including QSGD (Alistarh et al., 2017), TernGrad (Wen et al., 2017), SignSGD (Bernstein et al., 2018), and DQGD (Khirirat et al., 2018). However, none of these methods are able to learn the gradients, which means that they necessarily suffer from several issues, such as the inability to converge to the true optimum in the batch mode, inability to work with a nonsmooth regularizer, and slow convergence rates. In this work we propose a new distributed learning method---DIANA---which resolves these issues via compression of gradient differences. We perform a theoretical analysis in the strongly convex and nonconvex settings and show that our rates are vastly superior to existing rates. Our analysis of block-quantization and differences between ℓ2 and ℓ∞ quantization closes the gaps in theory and practice. Finally, by applying our analysis technique to TernGrad, we establish the first convergence rate for this method.

2018
arXiv Preprint

This paper proposes a framework of L-BFGS based on the (approximate) second-order information with stochastic batches, as a novel approach to the finite-sum minimization problems. Different from the classical L-BFGS where stochastic batches lead to instability, we use a smooth estimate for the evaluations of the gradient differences while achieving acceleration by well-scaling the initial Hessians. We provide theoretical analyses for both convex and nonconvex cases. In addition, we demonstrate that within the popular applications of least-square and cross-entropy losses, the algorithm admits a simple implementation in the distributed environment. Numerical experiments support the efficiency of our algorithms.

2018
Journal Paper
Journal of Machine Learning Research (JMLR)

The scale of modern datasets necessitates the development of efficient distributed optimization methods for machine learning. We present a general-purpose framework for the distributed environment, CoCoA, that has an efficient communication scheme and is applicable to a wide variety of problems in machine learning and signal processing. We extend the framework to cover general non-strongly convex regularizers, including L1-regularized problems like lasso, sparse logistic regression, and elastic net regularization, and show how earlier work can be derived as a special case. We provide convergence guarantees for the class of convex regularized loss minimization objectives, leveraging a novel approach in handling non-strongly convex regularizers and non-smooth loss functions. The resulting framework has markedly improved performance over state-of-the-art methods, as we illustrate with an extensive set of experiments on real distributed datasets.

2018
Conference Paper
The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases

2018
Journal Paper
Frontiers in Applied Mathematics and Statistics, section Optimization

In this paper we develop an adaptive dual free Stochastic Dual Coordinate Ascent (adfSDCA) algorithm for regularized empirical risk minimization problems. This is motivated by the recent work on dual free SDCA of Shalev-Shwartz [1]. The novelty of our approach is that the coordinates to update at each iteration are selected non-uniformly from an adaptive probability distribution, and this extends the previously mentioned work which only allowed for a uniform selection of “dual” coordinates from a fixed probability distribution. We describe an efficient iterative procedure for generating the non-uniform samples, where the scheme selects the coordinate with the greatest potential to decrease the sub-optimality of the current iterate. We also propose a heuristic variant of adfSDCA that is more aggressive than the standard approach. Furthermore, in order to utilize multi-core machines we consider a mini-batch adfSDCA algorithm and develop complexity results that guarantee the algorithm's convergence. The work is concluded with several numerical experiments to demonstrate the practical benefits of the proposed approach.

2018
Conference Paper
The 13th World Congress on Intelligent Control and Automation (WCICA 2018)

This paper proposes innovative anomaly detection technologies for manufacturing systems. We combine the event ordering relationship based structuring technique and the deep neural networks to develop the structured neural networks for anomaly detection. The event ordering relationship based neural network structuring process is performed before neural network training process and determines important neuron connections and weight initialization. It reduces the complexity of the neural networks and can improve anomaly detection accuracy. The structured time delay neural network (TDNN) is introduced for anomaly detection via supervised learning. To detect anomaly through unsupervised learning, we propose the structured autoencoder. The proposed structured neural networks outperform the unstructured neural networks in terms of anomaly detection accuracy and can reduce test error by 20%. Compared with popular methods such as one-class SVM, decision trees, and distance-based algorithms, our structured neural networks can reduce anomaly detection misclassification error by as much as 64%.

2018
Conference Paper
In Dynamics of Civil Structures

Structures experience large vibrations and stress variations during their life cycles. This causes reduction in their load-carrying capacity which is the main design criteria for many structures. Therefore, it is important to accurately establish the performance of structures after construction that often needs full-field strain or stress measurements. Many traditional inspection methods collect strain measurements by using wired strain gauges. These strain gauges carry a high installation cost and have high power demand. In contrast, this paper introduces a new methodology to replace this high cost with utilizing inexpensive data coming from wireless sensor networks. The study proposes to collect acceleration responses coming from a structure and give them as an input to deep learning framework to estimate the stress or strain responses. The obtained stress or strain time series then can be used in many applications to better understand the conditions of the structures. In this paper, designed deep learning architecture consists of multi-layer neural networks and Long Short-Term Memory (LSTM). The network achieves to learn the relationship between input and output by exploiting the temporal dependencies of them. In the evaluation of the method, a three-story steel building is simulated by using various dynamic wind and earthquake loading scenarios. The acceleration time histories under these loading cases are utilized to predict the stress time series. The learned architecture is tested on acceleration time series that the structure has never experienced.

2019
Journal Paper
Journal of Machine Learning Research (JMLR) (to appear)

We present an improved analysis of mini-batched stochastic dual coordinate ascent for regularized empirical loss minimization (i.e. SVM and SVM-type objectives). Our analysis allows for flexible sampling schemes, including where data is distribute across machines, and combines a dependence on the smoothness of the loss and/or the data spread (measured through the spectral norm).

2018
Conference Paper
Neural Information Processing Systems (NeurIPS) 2018

We present an end-to-end framework for solving Vehicle Routing Problem (VRP) using deep reinforcement learning. In this approach, we train a single model that finds near-optimal solutions for problem instances sampled from a given distribution, only by observing the reward signals and following feasibility rules. Our model represents a parameterized stochastic policy, and by applying a policy gradient algorithm to optimize its parameters, the trained model produces the solution as a sequence of consecutive actions in real time, without the need to re-train for every new problem instance. Our method is faster in both training and inference than a recent method that solves the Traveling Salesman Problem (TSP), with nearly identical solution quality. On the more general VRP, our approach outperforms classical heuristics on medium-sized instances in both solution quality and computation time (after training). Our proposed framework can be applied to variants of the VRP such as the stochastic VRP, and has the potential to be applied more generally to combinatorial optimization problems.

2018
Conference Paper
ICML 2018 (35th International Conference on Machine Learning)

Stochastic gradient descent (SGD) is the optimization algorithm of choice in many machine learning applications such as regularized empirical risk minimization and training deep neural networks. The classical analysis of convergence of SGD is carried out under the assumption that the norm of the stochastic gradient is uniformly bounded. While this might hold for some loss functions, it is always violated for cases where the objective function is strongly convex. In (Bottou et al.,2016) a new analysis of convergence of SGD is performed under the assumption that stochastic gradients are bounded with respect to the true gradient norm. Here we show that for stochastic problems arising in machine learning such bound always holds. Moreover, we propose an alternative convergence analysis of SGD with diminishing learning rate regime, which is results in more relaxed conditions that those in (Bottou et al.,2016). We then move on the asynchronous parallel setting, and prove convergence of the Hogwild! algorithm in the same regime, obtaining the first convergence results for this method in the case of diminished learning rate.

2017
Conference Paper
Deep Reinforcement Learning Symposium @ Neural Information Processing Systems (NeurIPS) 2017

The beer game is a decentralized, multi-agent, cooperative problem that can be modeled as a serial supply chain network in which agents cooperatively attempt to minimize the total cost of the network even though each agent can only observe its own local information. We develop a variant of the Deep Q-Network algorithm to solve this problem. Extensive numerical experiment show the effectiveness of our algorithm. Unlike most algorithms in literature, our algorithm does not have any limits on the parameter values, and it provides good solutions even if the agents do not follow a rational policy. The algorithm can be extended to other decentralized multi-agent cooperative games with partially observed information, which is a common type of situation in supply chain problems.

2017
Conference Paper
In Model Validation and Uncertainty Quantification, Volume 3 (pp. 331-337). Springer, Cham.

Detection of the deficiencies affecting the performance of the structures has been studied over the past few decades. How- ever, with the long-term data collection from dense sensor arrays, accurate damage diagnosis has become computationally challenging task. To address such problem, this paper introduces convolutional neural network (CNN), which has led to break- through results in computer vision, to the damage detection challenge. CNN technique has the ability to discover abstract features which are able to discriminate various aspect of interest. In our case, these features are used to classify “damaged” and “healthy” cases modeled through the finite element simulations. CNN is performed by using a Python library called Theano with the graphics processing unit (GPU) to achieve higher performance of these data-intensive calculations. The accuracy and sensitivity of the proposed technique are assessed with a cracked steel gusset connection model with multiplicative noise. Dur- ing the training procedure, strain distributions generated from different crack and loading scenarios are adopted. Completely unseen damage setups are introduced to the simulations while testing. Based on the findings of the proposed study, high accu- racy, robustness and computational efficiency are succeeded for the damage diagnosis.

2017
Conference Paper
SMAR 2017 (the fourth International Conference on Smart Monitoring, Assessment and Rehabilitation of Civil Structures)

Damage diagnosis of structures subjected to time-varying environmental and operational conditions has been a challenging task. This task involves a damage indicator selection to characterize the unknown relation between the measurements and damage patterns. The majority of the conventional methods adopt hand designed damage indicators that can be inefficient for some damage patterns and require manual effort to design. To address these challenges, this work uses a special kind of deep learning method called convolutional neural network (CNN) to learn complex damage features and create complex classifier boundaries. In the evaluation of the proposed methodology, multi-dimensional input samples are used — each dimension has an individual strain field resulting from a different force applied to the structure. The network is trained with several crack scenarios and the learned architecture is tested on completely unseen damage setups. Based on the findings of the paper, CNNs fed through multidimensional inputs improve the accuracy of the damage diagnosis. Furthermore, they give the opportunity to capture the behavior of the structures under variations in the loading conditions.

2017
Springer Proceedings in Mathematics & Statistics
Springer Proceedings in Mathematics & Statistics

This volume contains a selection of contributions that were presented at the Modeling and Optimization: Theory and Applications Conference (MOPTA) held at Lehigh University in Bethlehem, Pennsylvania, USA on August 17-19, 2016. The conference brought together a diverse group of researchers and practitioners, working on both theoretical and practical aspects of continuous or discrete optimization. Topics presented included algorithms for solving convex, network, mixed-integer, nonlinear, and global optimization problems, and addressed the application of deterministic and stochastic optimization techniques in energy, finance, logistics, analytics, health, and other important fields. The contributions contained in this volume represent a sample of these topics and applications and illustrate the broad diversity of ideas discussed at the meeting.

2016
Conference Paper
AAAI Workshop on Distributed Machine Learning

Training deep neural network is a high dimensional and a highly non-convex optimization problem. Stochastic gradient descent (SGD) algorithm and it's variations are the current state-of-the-art solvers for this task. However, due to non-covexity nature of the problem, it was observed that SGD slows down near saddle point. Recent empirical work claim that by detecting and escaping saddle point efficiently, it's more likely to improve training performance. With this objective, we revisit Hessian-free optimization method for deep networks. We also develop its distributed variant and demonstrate superior scaling potential to SGD, which allows more efficiently utilizing larger computing resources thus enabling large models and faster time to obtain desired solution. Furthermore, unlike truncated Newton method (Marten's HF) that ignores negative curvature information by using naive conjugate gradient method and Gauss-Newton Hessian approximation information - we propose a novel algorithm to explore negative curvature direction by solving the sub-problem with stabilized bi-conjugate method involving possible indefinite stochastic Hessian information. We show that these techniques accelerate the training process for both the standard MNIST dataset and also the TIMIT speech recognition problem, demonstrating robust performance with upto an order of magnitude larger batch sizes. This increased scaling potential is illustrated with near linear speed-up on upto 16 CPU nodes for a simple 4-layer network.

2017
Conference Paper
AAAI Workshop on Distributed Machine Learning

In this paper we study inexact dumped Newton method implemented in a distributed environment. We start with an original DiSCO algorithm [Communication-Efficient Distributed Optimization of Self-Concordant Empirical Loss, Yuchen Zhang and Lin Xiao, 2015]. We will show that this algorithm may not scale well and propose an algorithmic modifications which will lead to less communications, better load-balancing and more efficient computation. We perform numerical experiments with an regularized empirical loss minimization instance described by a 273GB dataset.

In multi-echelon inventory systems the performance of a given node is affected by events that occur at many other nodes and at many other time periods. For example, a supply disruption upstream will have an effect on downstream, customer-facing nodes several periods later as the disruption \"cascades\" through the system. There is very little research on stock-out prediction in single-echelon systems and (to the best of our knowledge) none on multi-echelon systems. However, in real the world, it is clear that there is significant interest in techniques for this sort of stock-out prediction. Therefore, our research aims to fill this gap by using DNN to predict stock-outs in multi-echelon supply chains.

2018
Conference Paper
20th Power Systems Computation Conference

Consider a polynomial optimisation problem, whose instances vary continuously over time. We propose to use a coordinate-descent algorithm for solving such time-varying optimisation problems. In particular, we focus on relaxations of transmission-constrained problems in power systems. On the example of the alternating-current optimal power flows (ACOPF), we bound the difference between the current approximate optimal cost generated by our algorithm and the optimal cost for a relaxation using the most recent data from above by a function of the properties of the instance and the rate of change to the instance over time. We also bound the number of floating-point operations that need to be performed between two updates in order to guarantee the error is bounded from above by a given constant.

2017
Journal Paper
Optimization Methods and Software

In this work we study the parallel coordinate descent method (PCDM) proposed by Richtárik and Takac [26] for minimizing a regularized convex function. We adopt elements from the work of Xiao and Lu [39], and combine them with several new insights, to obtain sharper iteration complexity results for PCDM than those presented in [26]. Moreover, we show that PCDM is monotonic in expectation, which was not confirmed in [26], and we also derive the first high probability iteration complexity result where the initial levelset is unbounded.

2017
Journal Paper
IEEE Transactions on Smart Grid

Many steady-state problems in power systems, including rectangular power-voltage formulations of optimal power flows in the alternating-current model (ACOPF), can be cast as polynomial optimisation problems (POP). For a POP, one can derive strong convex relaxations, or rather hierarchies of ever stronger, but ever larger relaxations. We study means of switching from solving the convex relaxation to Newton method working on a non-convex Lagrangian of the POP.

2017
Journal Paper
Optimization Methods and Software

With the growth of data and necessity for distributed optimization methods, solvers that work well on a single machine must be re-designed to leverage distributed computation. Recent work in this area has been limited by focusing heavily on developing highly specific methods for the distributed environment. These special-purpose methods are often unable to fully leverage the competitive performance of their well-tuned and customized single machine counterparts. Further, they are unable to easily integrate improvements that continue to be made to single machine methods. To this end, we present a framework for distributed optimization that both allows the flexibility of arbitrary solvers to be used on each (single) machine locally, and yet maintains competitive performance against other state-of-the-art special-purpose distributed methods. We give strong primal-dual convergence rate guarantees for our framework that hold for arbitrary local solvers. We demonstrate the impact of local solver selection both theoretically and in an extensive experimental comparison. Finally, we provide thorough implementation details for our framework, highlighting areas for practical performance gains.

2017
Journal Paper
Optimization Methods and Software

A novel rank-constrained re-formulation of alternating-current optimal power flow problem makes it possible to derive novel semidefinite programming (SDP) relaxations. For those, we develop a solver, which is often as fast as Matpower's interior point method, within the same accuracy.

In this paper, we study and analyze the mini-batch version of StochAstic Recursive grAdient algoritHm (SARAH), a method employing the stochastic recursive gradient, for solving empirical loss minimization for the case of nonconvex losses. We provide a sublinear convergence rate (to stationary points) for general nonconvex functions and a linear convergence rate for gradient dominated functions, both of which have some advantages compared to other modern stochastic gradient algorithms for nonconvex losses.

2017
Conference Paper
ICML 2017 (34th International Conference on Machine Learning)

In this paper, we propose a StochAstic Recursive grAdient algoritHm (SARAH), as well as its practical variant SARAH+, as a novel approach to the finite-sum minimization problems. Different from the vanilla SGD and other modern stochastic methods such as SVRG, S2GD, SAG and SAGA, SARAH admits a simple recursive framework for updating stochastic gradient estimates; when comparing to SAG/SAGA, SARAH does not require a storage of past gradients. The linear convergence rate of SARAH is proven under strong convexity assumption. We also prove a linear convergence rate (in the strongly convex case) for an inner loop of SARAH, the property that SVRG does not possess. Numerical experiments demonstrate the efficiency of our algorithm.

2017
Conference Paper
Proceedings of MOPTA 2016

We propose a projected semi-stochastic gradient descent method with mini-batch for improving both the theoretical complexity and practical performance of the general stochastic gradient descent method (SGD). We are able to prove linear convergence under weak strong convexity assumption. This requires no strong convexity assumption for minimizing the sum of smooth convex functions subject to a compact polyhedral set, which remains popular across machine learning community. Our PS2GD preserves the low-cost per iteration and high optimization accuracy via stochastic gradient variance-reduced technique, and admits a simple parallel implementation with mini-batches. Moreover, PS2GD is also applicable to dual problem of SVM with hinge loss.

2016
Journal Paper
Journal of Machine Learning Research

In this paper we generalize the framework of the feasible descent method (FDM) to a randomized (R-FDM) and a coordinate-wise random feasible descent method (RC-FDM) framework. We show that the famous SDCA algorithm for optimizing the SVM dual problem, or the stochastic coordinate descent method for the LASSO problem, fits into the framework of RC-FDM. We prove linear convergence for both R-FDM and RC-FDM under the weak strong convexity assumption. Moreover, we show that the duality gap converges linearly for RC-FDM, which implies that the duality gap also converges linearly for SDCA applied to the SVM dual problem.

The question of how to parallelize the stochastic gradient descent (SGD) method has received much attention in the literature. In this paper, we focus instead on batch methods that use a sizeable fraction of the training set at each iteration to facilitate parallelism, and that employ second-order information. In order to improve the learning process, we follow a multi-batch approach in which the batch changes at each iteration. This inherently gives the algorithm a stochastic flavor that can cause instability in L-BFGS, a popular batch method in machine learning. These difficulties arise because L-BFGS employs gradient differences to update the Hessian approximations; when these gradients are computed using different data points the process can be unstable. This paper shows how to perform stable quasi-Newton updating in the multi-batch setting, illustrates the behavior of the algorithm in a distributed computing platform, and studies its convergence properties for both the convex and nonconvex cases.

2016
Journal Paper
Optimization Letters, 10(6), 1233-1243

We propose and analyze a new parallel coordinate descent method—NSync—in which at each iteration a random subset of coordinates is updated, in parallel, allowing for the subsets to be chosen using an arbitrary probability law. This is the first method of this type. We derive convergence rates under a strong convexity assumption, and comment on how to assign probabilities to the sets to optimize the bound. The complexity and practical performance of the method can outperform its uniform variant by an order of magnitude. Surprisingly, the strategy of updating a single randomly selected coordinate per iteration—with optimal probabilities—may require less iterations, both in theory and practice, than the strategy of updating all coordinates at every iteration.

2016
Journal Paper
European Journal of Operational Research

Matrix completion under interval uncertainty can be cast as a matrix completion problem with element-wise box constraints. We present an efficient alternating-direction parallel coordinate-descent method for the problem. We show that the method outperforms any other known method on a benchmark in image in-painting in terms of signal-to-noise ratio, and that it provides high-quality solutions for an instance of collaborative filtering with 100,198,805 recommendations within 5 minutes on a single personal computer.

2016
Conference Paper
ICML 2016 (33rd International Conference on Machine Learning)

We propose a new algorithm for minimizing regularized empirical loss: Stochastic Dual Newton Ascent (SDNA). Our method is dual in nature: in each iteration we update a random subset of the dual variables. However, unlike existing methods such as stochastic dual coordinate ascent, SDNA is capable of utilizing all curvature information contained in the examples, which leads to striking improvements in both theory and practice - sometimes by orders of magnitude. In the special case when an L2-regularizer is used in the primal, the dual problem is a concave quadratic maximization problem plus a separable term. In this regime, SDNA in each step solves a proximal subproblem involving a random principal submatrix of the Hessian of the quadratic function; whence the name of the method. If, in addition, the loss functions are quadratic, our method can be interpreted as a novel variant of the recently introduced Iterative Hessian Sketch.

2016
Conference Paper
ICML 2016 (33rd International Conference on Machine Learning)

We propose an algorithm-independent framework to equip existing optimization methods with primal-dual certificates. Such certificates and corresponding rate of convergence guarantees are important for practitioners to diagnose progress, in particular in machine learning applications. We obtain new primal-dual convergence rates e.g. for the Lasso as well as many L1, Elastic-Net and group-lasso-regularized problems. The theory applies to any norm-regularized generalized linear model. Our approach provides efficiently computable duality gaps which are globally defined, without modifying the original problems in the region of interest.

2016
Journal Paper
Journal of Machine Learning Research

In this paper we develop and analyze Hydra: HYbriD cooRdinAte descent method for solving loss minimization problems with big data. We initially partition the coordinates (features) and assign each partition to a different node of a cluster. At every iteration, each node picks a random subset of the coordinates from those it owns, independently from the other computers, and in parallel computes and applies updates to the selected coordinates based on a simple closed-form formula. We give bounds on the number of iterations sufficient to approximately solve the problem with high probability, and show how it depends on the data and on the partitioning. We perform numerical experiments with a LASSO instance described by a 3TB matrix.

2016
Journal Paper
IEEE Journal of Selected Topics in Signal Processing

We propose mS2GD: a method incorporating a mini-batching scheme for improving the theoretical complexity and practical performance of semi-stochastic gradient descent (S2GD). We consider the problem of minimizing a strongly convex function represented as the sum of an average of a large number of smooth convex functions, and a simple nonsmooth convex regularizer. Our method first performs a deterministic step (computation of the gradient of the objective function at the starting point), followed by a large number of stochastic steps. The process is repeated a few times with the last iterate becoming the new starting point. The novelty of our method is in introduction of mini-batching into the computation of stochastic steps. In each step, instead of choosing a single function, we sample $b$ functions, compute their gradients, and compute the direction based on this. We analyze the complexity of the method and show that it benefits from two speedup effects. First, we prove that as long as $b$ is below a certain threshold, we can reach any predefined accuracy with less overall work than without mini-batching. Second, our mini-batching scheme admits a simple parallel implementation, and hence is suitable for further acceleration by parallelization.

2015
Conference Paper
OptML@NeurIPS 2015

In this paper we study the effect of the way that the data is partitioned in distributed optimization. The original DiSCO algorithm [Communication-Efficient Distributed Optimization of Self-Concordant Empirical Loss, Yuchen Zhang and Lin Xiao, 2015] partitions the input data based on samples. We describe how the original algorithm has to be modified to allow partitioning on features and show its efficiency both in theory and also in practice.

2015
Conference Paper
OptML@NeurIPS 2015

In this paper we develop dual free SDCA with adaptive probabilities for regularized empirical risk minimization. This extends recent work of Shai Shalev-Shwartz [SDCA without Duality, arXiv:1502.06177] to allow non-uniform selection of \"dual\" coordinate in SDCA. Moreover, the probability can change over time, making it more efficient than uniform selection. Our work focuses on generating adaptive probabilities through iterative process, preferring to choose coordinate with highest potential to decrease sub-optimality. We also propose a practical variant Algorithm adfSDCA+ which is more aggressive. The work is concluded with multiple experiments which shows efficiency of proposed algorithms.

2015
Conference Paper
ICML 2015 (32nd International Conference on Machine Learning)

Distributed optimization algorithms for large-scale machine learning suffer from a communication bottleneck. Reducing communication makes the efficient aggregation of partial work from different machines more challenging. In this paper we present a novel generalization of the recent communication efficient primal-dual coordinate ascent framework (CoCoA). Our framework, CoCoA+, allows for additive combination of local updates to the global parameters at each iteration, whereas previous schemes only allowed conservative averaging. We give stronger (primal-dual) convergence rate guarantees for both CoCoA as well as our new variants, and generalize the theory for both methods to also cover non-smooth convex loss functions. We provide an extensive experimental comparison on several real-world distributed datasets, showing markedly improved performance, especially when scaling up the number of machines.

2015
Journal Paper
Mathematical Programming

In this work we show that randomized (block) coordinate descent methods can be accelerated by parallelization when applied to the problem of minimizing the sum of a smooth convex function and a simple separable convex function. The theoretical speedup, as compared to the serial method, and referring to the number of iterations needed to approximately solve the problem with high probability, is a simple expression depending on the number of parallel processors and a natural and easily computable measure of separability of the smooth component of the objective function. In the worst case, when no degree of separability is present, there may be no speedup; in the best case, when the problem is separable, the speedup is equal to the number of processors. Our analysis also works in the mode when the number of blocks being updated at each iteration is random, which allows for modeling situations with busy or unreliable processors. We show that our algorithm is able to solve a LASSO problem involving a matrix with 20 billion nonzeros in 2 hours on a large memory node with 24 cores.

2014
Conference Paper
OPT 2014: Optimization for Machine Learning @NeurIPS 2014

We propose a mini-batching scheme for improving the theoretical complexity and practical performance of semi-stochastic gradient descent applied to the problem of minimizing a strongly convex composite function represented as the sum of an average of a large number of smooth convex functions, and simple nonsmooth convex function. Our method first performs a deterministic step (computation of the gradient of the objective function at the starting point), followed by a large number of stochastic steps. The process is repeated a few times with the last iterate becoming the new starting point. The novelty of our method is in introduction of mini-batching into the computation of stochastic steps. In each step, instead of choosing a single function, we sample b functions, compute their gradients, and compute the direction based on this. We analyze the complexity of the method and show that the method benefits from two speedup effects. First, we prove that as long as b is below a certain threshold, we can reach predefined accuracy with less overall work than without mini-batching. Second, our mini-batching scheme admits a simple parallel implementation, and hence is suitable for further acceleration by parallelization.

Communication remains the most significant bottleneck in the performance of distributed optimization algorithms for large-scale machine learning. In this paper, we propose a communication-efficient framework, CoCoA, that uses local computation in a primal-dual setting to dramatically reduce the amount of necessary communication. We provide a strong convergence rate analysis for this class of algorithms, as well as experiments on real-world distributed datasets with implementations in Spark. In our experiments, we find that as compared to state-of-the-art mini-batch versions of SGD and SDCA algorithms, CoCoA converges to the same .001-accurate solution quality on average 25x as quickly.

2014
PhD thesis
(PhD thesis), University of Edinburgh

2014
Journal Paper
Numerical Analysis and Optimization 2014, Springer Proceedings in Mathematics and Statistics

In this work we propose a distributed randomized block coordinate descent method for minimizing a convex function with a huge number of variables/coordinates. We analyze its complexity under the assumption that the smooth part of the objective function is partially block separable, and show that the degree of separability directly influences the complexity. This extends the results in [22] to a distributed environment. We first show that partially block separable functions admit an expected separable overapproximation (ESO) with respect to a distributed sampling, compute the ESO parameters, and then specialize complexity results from recent literature that hold under the generic ESO assumption. We describe several approaches to distribution and synchronization of the computation across a cluster of multi-core computer and provide promising computational results.

2014
Conference Paper
MLSP2014: IEEE International Workshop on Machine Learning for Signal Processing

We propose an efficient distributed randomized coordinate descent method for minimizing regularized non-strongly convex loss functions. The method attains the optimal convergence rate, where is the iteration counter. The core of the work is the theoretical study of stepsize parameters. We have implemented the method on Archer - the largest supercomputer in the UK - and show that the method is capable of solving a (synthetic) LASSO optimization problem with 50 billion variables.

2013
Conference Paper
ICML 2013 (30th International Conference on Machine Learning)

We address the issue of using mini-batches in stochastic optimization of SVMs. We show that the same quantity, the , controls the parallelization speedup obtained for both primal stochastic subgradient descent (SGD) and stochastic dual coordinate ascent (SCDA) methods and use it to derive novel variants of mini-batched SDCA. Our guarantees for both methods are expressed in terms of the original nonsmooth primal problem based on the hinge-loss.

2011
Conference Paper
Operations Research Proceedings 2011, pp. 27-32, Springer-Verlag 2012

In this work we propose solving huge-scale instances of the truss topology design problem with coordinate descent methods. We develop four efficient codes: serial and parallel implementations of randomized and greedy rules for the selection of the variable (potential bar) to be updated in the next iteration. Both serial methods enjoy an O(n/k) iteration complexity guarantee, where n is the number of potential bars and k the iteration counter. Our parallel implementations, written in CUDA and running on a graphical processing unit (GPU), are capable of speedups of up to two orders of magnitude when compared to their serial counterparts. Numerical experiments were performed on instances with up to 30 million potential bars.

2011
Journal Paper
Mathematical Programming, Series A, 38 pages, 2012

In this paper we develop a randomized block-coordinate descent method for minimizing the sum of a smooth and a simple nonsmooth block-separable convex function and prove that it obtains an $\epsilon$-accurate solution with probability at least $1- ho$ in at most \$O( frac{n}{\epsilon} \log frac{1}{ ho})$ iterations, where \$n\$ is the number of blocks. For strongly convex functions the method converges linearly. This extends recent results of Nesterov , which cover the smooth case, to composite minimization, while at the same time improving the complexity by the factor of 4 and removing $\epsilon$ from the logarithmic term. More importantly, in contrast with the aforementioned work in which the author achieves the results by applying the method to a regularized version of the objective function with an unknown scaling factor, we show that this is not necessary, thus achieving true iteration complexity bounds. In the smooth case we also allow for arbitrary probability vectors and non-Euclidean norms. Finally, we demonstrate numerically that the algorithm is able to solve huge-scale $\ell_1$-regularized least squares and support vector machine problems with a billion variables.

2011
Conference Paper
Proceedings of SPARS11 (4th Workshop on Signal Processing with Adaptive Sparse Structured Representations), June 27-30, 2011

2011
Journal Paper
International Journal of Numerical Analysis and Modeling, Ser. B, 2(2-3), 2011 231-247

In this paper we analyze American style of floating strike Asian call options belonging to the class of financial derivatives whose payoff diagram depends not only on the underlying asset price but also on the path average of underlying asset prices over some predetermined time interval. The mathematical model for the option price leads to a free boundary problem for a parabolic partial differential equation. Applying fixed domain transformation and transformation of variables we develop an efficient numerical algorithm based on a solution to a non-local parabolic partial differential equation for the transformed variable representing the synthesized portfolio. For various types of averaging methods we investigate the dependence of the early exercise boundary on model parameters.