Google engineers machine learning dry goods: from the three aspects of expressiveness, trainability and generalization

This blog post, written by Eric Jiang from Google Brain, explores a framework for evaluating machine learning research based on three key dimensions: Expressivity, Trainability, and Generalization. The author suggests that these concepts provide a useful lens through which to assess the contributions of various papers, especially in theoretical AI research. This article is considered one of the most insightful summaries of machine learning techniques and research from the year it was published, and it's highly recommended for anyone interested in understanding the broader landscape of AI. When I read a paper, I often ask myself whether its contribution falls under one or more of these categories: Expressivity, Trainability, or Generalization. This classification system was introduced by my colleague Jascha Sohl-Dickstein, and I found it incredibly helpful in connecting different subfields of AI—such as robotics, generative models, and NLP—to a larger, unified perspective. In this blog, I'll discuss how these concepts relate to current (as of November 2017) research in supervised, unsupervised, and reinforcement learning. I also break down Generalization into two types: "weak generalization" and "strong generalization," which I will explore separately. Here's a summary of my thoughts: [Image: Google engineers machine learning dry goods: from the three aspects of expressiveness, trainability and generalization] I’d like to thank Jascha Sohl-Dickstein and Ben Poole for their feedback and editing, and Marcin Moczulski for his valuable discussions on RNN trainability. This article covers a wide range of topics and is largely based on my personal opinions. Therefore, I want to emphasize that any factual errors are my own and do not reflect the views of my colleagues or editors. If you have questions or suggestions, feel free to share them in the comments or send me an email—I’m writing this for learning and growth. **Expressivity: What can this model compute?** Expressivity refers to the complexity of functions that a parametric model, such as a neural network, can represent. Deep neural networks grow exponentially with depth, meaning that many of the problems we study today—supervised, unsupervised, and reinforcement learning—can be fully expressed using medium-sized neural networks. One example is that deep networks can memorize large datasets. Neural networks can represent a wide variety of data: continuous, complex, discrete, and even random variables. With advancements in generative modeling and Bayesian deep learning, they’ve been used to build probabilistic neural networks that produce impressive results in generation tasks. Recent breakthroughs in generative modeling highlight the expressive power of neural networks. For instance, GANs can generate data manifolds (like images or audio) that are almost indistinguishable from real data. The image below shows the output of a recent NVIDIA GAN architecture: [Image: Google engineers machine learning dry goods: from the three aspects of expressiveness, trainability and generalization] While not perfect, these results show great progress. Similarly, WaveNet has achieved remarkable success in audio synthesis, producing samples that sound like real human speech. Unsupervised learning isn't just about generating data. Some researchers, like Yann LeCun, refer to it as “predictive learning,” where the goal is to infer past, estimate the present, or predict the future. However, since many unsupervised methods focus on predicting complex joint distributions, I believe generative modeling remains a strong benchmark for evaluating performance in this field. Neural networks seem capable of expressing reinforcement learning. A small network (two convolutional layers and two fully connected layers) is powerful enough to solve Atari and Mujoco control tasks. However, training such models is still challenging, and I’ll discuss this later. On its own, expressivity isn’t particularly interesting. We can improve it by adding more layers or connections. The real challenge now is to ensure that the network performs well on test data while keeping training manageable. For example, even if a deep fully connected network can remember the training set, a two-dimensional convolution may be needed to make an image classifier generalizable. Expressivity is the easiest problem to address (just add more layers!), but it’s also the most mysterious. We don’t yet have a good way to measure how much expressivity a given task requires. What kind of problem needs a neural network orders of magnitude larger than what we currently use? Why do these problems demand so much computational power? Does our current network have enough expressive power to reach human-like intelligence? And does solving a generalization problem require a super-powerful model? The number of neurons in the brain (around 1e11) is many orders of magnitude greater than that of a large neural network (like Inception-ResNet-V2, which has about 25 million ReLU units). Moreover, biological neurons are far more complex than their artificial counterparts. A single neuron in a fruit fly can implement a collision detection system that outperforms drone systems. Where does this expressive power come from? To achieve human-level intelligence, how powerful does a neural network need to be? **Trainability: Can we find a good model in the space of all possible models?** A machine learning model is any program that learns functions from data. During the “learning” process, we search for a better model that can use the knowledge in the data to make decisions from a potentially huge model space. This search is typically framed as an optimization problem. There are several types of optimization, and one common approach in deep learning is to define a scalar metric that evaluates the “advantage” of the model. These metrics are then maximized (or minimized) using numerical optimization techniques. For example, minimizing the average cross-entropy error is a standard method for training neural networks in image classification. The idea is that when the model achieves the lowest cross-entropy loss on the training set, it performs the desired task, such as correctly classifying images with high precision and recall on the test set. However, evaluation metrics can’t always be directly optimized (since we don’t have access to the test set), so alternative functions like cross-entropy on the training set are used instead. Finding a good model—training—is essentially equivalent to solving an optimization problem. But the goal of optimization is sometimes ambiguous. In supervised learning, for instance, downsampling makes it difficult to define a scalar quantity that captures “perceptual loss.” Similarly, super-resolution and image synthesis are hard because maximizing “goodness” is challenging. To address this, co-adaptation techniques have become popular. These convert the optimization problem into a solution between two non-stationary distributions that evolve together, much like predator-prey dynamics in nature. The generator and discriminator in GANs work similarly, avoiding the need for explicit perceptual loss objectives. Evolutionary strategies treat optimization as a simulation. A dynamic system is defined over a population of models, and the population is updated according to the system’s rules at each time step. Over time, the system is expected to converge toward a “good model.” In reinforcement learning, evolutionary strategies have been applied successfully, and there are resources like David Ha’s “A Visual Guide to Evolution Strategies” that explain the concept in detail. **Current Research and Work Summary** The direct goal of feedforward networks and supervised learning has largely solved the issue of trainability (empirically, not theoretically). Breakthroughs like Batch Norm, ResNets, and Good Init from 2015 have made training feedforward networks much easier. Today, deep networks with hundreds of layers can minimize training errors for large-scale datasets. Recurrent Neural Networks (RNNs), however, remain tricky. While the research community has made significant progress, putting LSTM into complex robot strategies and expecting them to “work properly” is no longer unrealistic. It was unthinkable just a few years ago. In unsupervised learning, models often deal with larger outputs, such as high-resolution images or long sequences of speech and text. This makes training more challenging. A major breakthrough in 2017 was making GANs easier to train. Improvements like using Wasserstein distance instead of Jensen-Shannon divergence have made GANs more stable and practical for industrial applications. Latent discrete variable inference is also difficult due to high variance Monte Carlo gradient estimators, but recent advances in architectures have made it more reliable. Deep reinforcement learning, however, still lags behind in terms of pure trainability. For environments with long time steps, we’re looking for models that reason optimally. Internal optimization processes lead to optimal control, while external loops learn from the agent’s experience. Recently, I added an extra dimension to a continuous robot control task, and my reinforcement learning algorithm’s performance dropped from over 80% to 10%. Reinforcement learning is not only hard to train but also unstable. Because of the high randomness in optimization, different random seeds can lead to vastly different results. As a result, we report reward curves across multiple trials with different seeds. The trainability problem in reinforcement learning remains unsolved. We still can’t expand the problem slightly and expect the same learning process to perform consistently. If we think of RL as a pure optimization problem, the situation is still very challenging. Consider an environment where a sparse reward is only available at the end of an episode. The number of possible action sequences increases exponentially, but only a few lead to success. Estimating the policy gradient at any point in the model space requires a large number of samples before a useful learning signal is obtained. This is similar to Monte Carlo estimation, but with a concentration of quality on a delta distribution. When the proposal distribution doesn’t overlap with the reward distribution, finite sample Monte Carlo estimates fail entirely, regardless of the number of samples collected. Moreover, if the data distribution is non-stationary (as in off-policy learning with a replay buffer), collecting “bad data” introduces an unstable feedback loop in the external optimization process. From an optimization perspective, the state space looks like “Swiss cheese”—a convex optimum surrounded by plateaus where the policy gradient is useless. This means the entire model space has little information, making learning extremely difficult. Without a good representation, we might as well randomly seed and run policy loops until we stumble upon a good model. In fact, the reinforcement learning baseline suggests that our optimization landscape may look like this. I believe that RL benchmarks like Atari and Mujoco don’t truly push the limits of machine learning. They solve a single monolithic policy problem, which is quite sterile. The selection pressure for “generalization” is minimal, making the problem more of an optimization challenge than a real ML problem. It’s not that I want to complicate the trainability of reinforcement learning with generalization (it’s certainly not easy to debug!), but I believe that understanding the environment and tasks is the only way to enhance learning and tackle real-world robot problems. Compared to supervised and unsupervised learning, we can easily get learning signals anywhere in the model search space. The minibatch gradient proposal overlaps with the gradient distribution non-zero. Even with a small batch size of 1 SGD, the probability of sampling a useful learning signal is 1/N, where N is the dataset size. We can solve problems by throwing more computation and data, making our approach effective. Improving perceptual generalization in lower layers, by “bootstrapping” on lower-level features, may actually reduce variance. To solve high-dimensional and complex RL problems, generalization and perceptual understanding must be considered before tackling numerical optimization. We need to reach a point where each data point provides a non-zero number of bits for the RL algorithm, and when the task is very complex (no more data is collected at the exponential level), importance sampling gradients can be implemented. Only then can we reasonably assume that we can solve the problem through brute computation. Learning from demonstration, imitation learning, inverse reinforcement learning, and interaction with natural language instructions may provide ways to quickly introduce the starting policy. At some point, the learning signal or the search space can be shaped. For example, the environment provides a reward of 0, but observations help the model’s planning module derive inductive biases. Overall, on the issue of trainability: supervised learning is easy; unsupervised learning is still difficult, but we're working on it; reinforcement learning is very bad. **Generalization: The Core of Machine Learning** Generalization is the most profound of the three issues and the core of machine learning itself. Simply put, generalization is how well a model trained on a training dataset performs on a test dataset. There are two scenarios when discussing generalization: 1) training and test data come from the same distribution (we only need to learn this distribution from the training data), or 2) training and test data come from different distributions (we need to extend from the training set to the test distribution). These two cases correspond to “weak generalization” and “strong generalization,” as I call them below. This classification can also be referred to as “interpolation vs. extrapolation” or “robustness vs. understanding.” **Weak Generalization: How well does the model handle small perturbations in the data distribution?** In “weak generalization,” we usually assume that training and test data samples are drawn from the same distribution. However, in the real world, there are almost always differences between training and test distributions, even within large sample sizes. These differences can come from sensor noise, changes in lighting conditions, gradual wear and tear of objects, or adversarial attacks. Since adversarial examples are almost imperceptible to human vision, we can classify them as “drawn from the same distribution.” Using “weak generalization” as a “disturbance” to assess the training distribution in practice is therefore useful. Interference with the test data distribution may also result in optimized interference (the lowest point is the best). In fact, we don’t know in advance that test distribution interference poses some difficulties for optimization. If we are too aggressive in optimizing the training environment (sharp global minimum to the left of the blue curve), we will get a suboptimal model for the test data (sharp local minimum on the red curve). Here, we have overfit the training distribution or training data samples, and there is no test distribution that is generalized to interference. “Regularization” is the technique we use to prevent overfitting. Since we don’t have any prior information about test disturbances, the best we can usually do is try to train random disturbances in the training distribution, hoping that these disturbances cover the test distribution. Random gradient descent, dropout, weight noise, activation noise, and data augmentation are all regularization techniques commonly used in deep learning. In reinforcement learning, randomizing simulation parameters makes training more robust. In his ICLR 2017 talk, Zhang Chiyuan pointed out that regularization is “anything that makes training more difficult” (compared to the traditional view of “restricting model capacity”). Basically, making things harder to optimize can improve generalization performance. This is really disturbing—our “generalization” approach is quite crude, equivalent to “optimizer lobotomy.” What we do is basically adjust the optimizer, and then hope that its interference with the training process just prevents overfitting. Moreover, improving the trainability of the model allows you to sacrifice the generalization! Looking at the (weak) generalization problem in this way does indeed complicate the development of trainable research. However, if a better optimizer is prone to overfitting, how do we explain why some optimizers seem to reduce both training and testing errors? The reality is that any combination of optimization methods and optimizers is in 1) finding a better model area and 2) over-fitting to a specific solution to achieve a balance, and we have no good way to control this balance. The most challenging test for weak generalization may be a adversarial attack. In an adversarial attack, the disturbance comes from an opponent, which performs the worst interference to the data points, making your model perform poorly. We still don’t have a deep learning method that can handle adversarial attacks well, but my instinct is that adversarial attacks can eventually be solved. In terms of theoretical work, some researchers are using information theory to explore the neural network during the training process, from “memory” data to “compressed” data. This theory is emerging, although there is still academic debate on whether it is effective. It is intuitively convincing from “memory” and “compression,” so this aspect deserves our attention. **Strong Generalization: Natural Manifold** In the “strong generalization” test, the data used in the test comes from a completely different distribution from the training data, but the underlying manifold (or generation process) of the data is the same. The space of observable data in the world can be described as a very high-dimensional, ever-changing “natural manifold.” Although it is vast, it is also highly structured. For example, all the data we observe follows the laws of gravity, and objects don’t appear out of nowhere. Strong generalization can be thought of as to what extent this “super manifold” is captured by a particular model trained only on the samples of the manifold. It should be noted that the image classifier doesn’t need to find Maxwell’s equations—it only needs to understand the reality consistent with the manifold data. Modern classification models trained on ImageNet are okay in terms of strong generalization—the model trained on ImageNet does understand the principles of edges, contours, and objects, which is why the ImageNet pre-training model is popular. However, there is still a lot of room for improvement in this type of model: the classifiers trained on ImageNet are not universally applicable, the problem of less data learning is still unresolved, and they are still vulnerable to adversarial attacks. Obviously, our models don’t yet understand what they are looking at, but this is the beginning. Similar to weak generalization, test distributions can be counter-sampled in a manner that counters the greatest difference between training and test distributions. AlphaGo Zero is my favorite example: when testing, it observes data from human players that is completely different from its training distribution (it has never “seen” humans before). Moreover, humans are taking advantage of all their wisdom to take AlphaGo to the undiscovered areas of training data. Although AlphaGo does not explicitly understand what abstract math, rival psychology, or “green” means, it clearly understands how this world is better than human players in a narrow field. If an artificial intelligence system is robust against a skilled human adversary, then I think it has a strong generalization ability. It is a pity that reinforcement learning research largely ignores the problem of strong generalization. Most benchmarks are static environments with little perceptual richness (for example, humanoid robots don’t understand what the world around it or what its own body looks like, except for some joint positions related to reward mechanisms). I really believe that solving generalization is the key to solving the trainability of reinforcement learning. The more our learning system “knows” the world, the better the ability to acquire learning signals, and perhaps the fewer samples we need. That’s why less-shot learning, imitation learning, and learning-to-learn are important: it frees us from brute-force solutions with large differences and low information. I believe that to achieve stronger generalization, two things need to be done: First, we need to actively derive models of the basic laws of the world from observations and experiments. Symbolic reasoning and causal reasoning seem to be mature research topics, but any kind of unsupervised learning can help. This reminds me of human beings’ understanding of the movement of celestial bodies by using the logic inference system (mathematics) to derive the laws of the universe. Interestingly, before the Copernican revolution, humans may initially rely on Bayesian heuristics (“superstitions”), and after we discovered classical mechanics, these “Bayesian” models were abandoned. Our model-based machine learning approach (trying to “predict” environmental aspects of the model) is still in the “former Copernican era,” that is, they are only interpolated based on very shallow statistical superstitions, rather than proposing profound, general principles to explain and infer data that may be in the millions of light years or many future time steps. Note that humans do not need a firm grasp of probability theory to derive deterministic celestial mechanics, and this raises the question of whether there is a way to machine learning and causal inference if there is no clear statistical framework. One way to dramatically reduce complexity is to make our learning system more adaptive. We need to go beyond just optimizing models that predict or act in a static way. We need to optimize models that can think, remember, and learn in real time. Second, we need to invest enough diversified data on this issue to get the model to develop abstract representations. Only when the environment is rich enough can the correct representation be developed (although AlphaGo Zero raises the question of how many natural manifolds the agent really needs to experience). Without these limitations, this issue is not clearly defined, and we have the chance to find the right solution by chance. I don’t know that the three-body civilization (see the book “Three-body”) has evolved to such a high level of technology, is it because their survival depends on their physical understanding of complex celestial mechanics. Maybe we also need to introduce some celestial movements in our Mujoco & Bullet environment :) **Comment** [1] Some research areas are not suitable for the framework of expressivity, trainability, and generalization capabilities. For example, trying to understand why a model provides an interpretability research for a particular answer. ML clients and policy makers working in high-risk areas (such as medicine, law enforcement) need to understand this, which can also clarify generalization issues: if we find that the diagnostic methods provided by the model and those with human medical professionals will come to these conclusions, which may mean that the model has an edge case in the derivation process and cannot be generalized. Determining whether the model has learned the right thing is more important than reducing test errors! Differential privacy is another constraint on the ML model. However, in view of the scope of this article, there is not much to say here. [2] A simple explanation: a fully connected layer of size N, followed by a ReLU nonlinearity, can cut a vector space into N piecewise linear blocks. Add a second ReLU layer, further subdivide the space into N or more blocks, and generate N^2 piecewise linear regions in the input space, and the three layers are N^3. See Raghu et al. 2017 for a detailed analysis. [3] This is sometimes referred to as a multi-level optimization problem. However, this means an “external” and “internal” optimization loop, and adaptation may occur at the same time. For example, parallel processes on a single machine that communicate asynchronously, or species that are constantly co-evolving in an ecosystem. In these cases, there is no clear “external” and “internal” optimization loop. [4] seq2seq with attention implements SOTA, but I suspect its strength is trainability, not expressiveness or generalization. Maybe seq2seq without attention can be done just as well in the case of proper initialization. [5] Here is a way to combat adversarial attacks. Although it does not solve the problem of strong generalization, it is extremely expensive to make calculations against interference. The model and data sections are black boxes. Each time the model is called during reasoning, a model is randomly selected from the trained models and presented to the opponent without telling them which model they got. Models are trained independently of each other and can even adopt different architectures. This makes it difficult to calculate a finite difference gradient because f(x + dx) - f(x) can have any high variance. Furthermore, the gradient between successive gradient calculations will still have a higher variance because different model pairs can be sampled. Another improvement is the use of multimodal data (video, multiview, image+sound), which makes it difficult to interfere with the input while maintaining input consistency.

Buzzer

A buzzer or beeper is an audio signalling device, which may be mechanical, electromechanical, or piezoelectric (piezo for short). Typical uses of buzzers and beeper include alarm devices, timers, and confirmation of user input such as a mouse click or keystroke.


Piezo Buzzer,Dc Electro Magnetic Buzzer,Buzzer Acoustic Components,Piezo Buzzer For Thermometer

Jiangsu Huawha Electronices Co.,Ltd , https://www.hnbuzzer.com