The new era of foundation models, summarized
AI research is advancing faster than expectations due to the outstanding performance of “foundation models” — large neural networks (billion to trillion parameters) trained on large amounts of data, specifically autoregressive language models (GPT-3, PaLM) with transformer architectures, as well as visual-language models (CLIP) and multimodal generative models (DALLE-2, Imagen) that combine transformer and convolutional components. What led to this race to train bigger and bigger models? Beyond improvements in hardware (including 10x increase in GPU throughput over the past decade) and data availability, it was the gradual realization in the ML research community that, in the long run, throwing more compute at a problem works better than anything else (better architectures, better algorithms, incorporating domain knowledge, etc). More concretely, as described in the “Scaling Laws” paper, test loss decreases predictably with order-of-magnitude increases in model size, data, and compute, and that effect is consistent across many scales.
Research advancements in machine learning in the 2010s primarily involved coming up with new and better architectures and training protocols. Landmark papers like AlexNet, ResNet, and StyleGAN gradually refined inductive biases in model design without drastically changing number of parameters. We’re now entering a new era where, if anything, architectures are getting simpler and most effort is spent on better engineering, e.g. improving the efficiency of training models on thousands of GPUs. We’re also seeing a consolidation of the models that are used in production; here’s a recent example from Apple of the system powering computer vision based features on iOS and macOS, where dedicated per-task vision models were replaced by a unified architecture using a CLIP-based backbone. Hardware too is being increasingly optimized for specific architectures (Transformer engine in NVIDIA H100s); it’s conceivable that we’ll even see processors specialized for particular model weights in the near future, co-locating processing units and memory to minimize data transfer costs.
Despite the impressive performance of foundation models, their output can be unpredictable and difficult to control. For that reason, in the coming years, they’ll be primarily used to build human-in-the-loop systems where models act as “generators” and humans act as “critics,” rather than opaque APIs that produce one result and don’t afford further iteration. A prominent example here is Github Copilot, which uses a large language model to autocomplete code. Copilot is by no means perfect, often producing incorrect code or completely misunderstanding the problem. But users are free to accept or disregard the model’s suggestions, and experiment with different ways of providing context to the model. Looking into the future, there is ongoing work for making model outputs more helpful and contextually relevant, done by creating elaborate prompts and techniques like RLHF and Context Distillation, as well as having models tell us when they don't know how to answer certain questions.
On the one hand, following the progress in foundation models can make the field of ML feel increasingly alienating. These models require millions of dollars to train and running them involves a high degree of operational complexity (the OPT training logbook provides a glimpse into how error-prone the process can be). Big advancements are coming more and more from large multi-disciplinary teams rather than brilliant small teams of researchers. Less and less of the state-of-the-art models are being open sourced, instead only becoming publicly available via closed APIs. On the other hand, there’s yet so much to understand and discover about how to work with foundation models. The art of prompt engineering, i.e. figuring out ways of providing context to models at inference time, is being invented at rapid pace as we speak. Unlike traditional machine learning engineering — which generally requires expensive trial and error, large amounts of data, and deep domain knowledge — prompt engineering allows cheap experimentation and rewards creativity, divergent thinking, and an intuitive understanding of language, cognition, and media. Because they use natural language as input, foundation models are really easy to get started with, even with minimal technical knowledge. Some of the biggest findings on how to prompt those models are discovered by hobbyists years before they’re published by mainstream research. Foundation models contain a vast universe of knowledge and identities within them, and this is as good a time at explore them as ever.