Most machine learning tasks require solving non-convex optimization problems. For these problems, as the data dimensions grow, the number of critical points can grow exponentially. Local search methods such as gradient descent can get stuck in one of these critical points. Finding the globally optimal solution is computationally hard in the worst case for nonconvex optimization. Instead, over the last few years, focus has shifted in characterizing transparent conditions for non-convex problems which are tractable. In many instances, these conditions turn out to be mild and natural for machine learning applications. This tutorial will provide an overview of the recent theoretical success stories in nonconvex optimization. This includes learning latent variable models, dictionary learning, robust principal component analysis, and so on. Simple iterative methods such as spectral methods, alternating projections, and so on, can be proven to learn consistent models with polynomial sample and computational complexity. This tutorial will present main ingredients towards establishing these results. For instance, one approach is to first obtain an approximate solution which lands in the basin of the globally optimal solution, and then run gradient descent to reach the global optimum. Another approach is to discard the original objective function, and to instead optimize over different objective functions which are better behaved. For example, the use of moment fitting objective instead of maximum likelihood for learning latent variable models allows us to establish consistency in estimation. For problems where finding the globally optimal solution is not tractable, it is desirable to at least reach a local optimum. It turns out that even this is computationally hard in the worst case. In this tutorial, we will cover the important problem of escaping saddle points. Saddle points are critical points which are not local minima, meaning there exist directions where the objective value decreases (for minimization problems). Saddle points can slow down gradient descent arbitrarily. Alternatively, if Newton's method is run, it converges to an arbitrary critical point, and does not distinguish between a local minimum and a saddle 1 point. Solutions such as Trust region methods and recent work on noisy stochastic gradient descent will be covered. There are certainly many challenging open problems in the area of non-convex optimization. While guarantees have been established in individual instances, there is no common unifying theme of what makes non-convex problem tractable. Many challenging instances such as optimization for training multi-layer neural networks or analyzing novel regularization techniques such as dropout for non-convex optimization still remain wide open. On the practical side, conversations between theorists and practitioners can help identify what kind of conditions are reasonable for specific applications, and thus lead to the design of practically motivated algorithms for non-convex optimization with rigorous guarantees. The tutorial will cover these issues as well.

By Majid Janzamin.

By Furong Huang.

By Anima Anandkumar, Rong Ge.

By Rong Ge, Furong Huang, Chi Jin, Yang Yuan.