As Google’s batch sizes for AI training continue to skyrocket, with batch sizes ranging from over 100,000 to one million, the company’s research department is looking for ways to improve. everything from efficiency to scalability and even privacy for those whose data is used at scale. -scale training courses.
This week, Google Research published a number of articles on emerging new issues at the scale of “mega-batch” training for some of its most used models.
One of the most notable new features of large-scale training trenches is active batch learning in the million-batch size stage. Essentially, this reduces the amount of training data (hence resource / time calculation) by automating some of the labeling, which is great for efficiency but has drawbacks in terms of flexibility and accuracy.
Google Research has developed its own active learning algorithm to overlay in learning sets called Cluster-Margin, which they claim can operate at “orders of magnitude” lot size scales than other approaches to learning. active learning. Using the open image dataset with ten million images and sixty million labels in 20,000 classes, they found that Cluster-Margin only needed 40% of the labels for the same targets.
In active learning, the labels of the learning examples are selectively and adaptively sampled to more effectively train the desired model over multiple iterations. “The adaptive nature of active learning algorithms, which improves data efficiency, comes at the cost of frequent recycling of the model and a call to the labeling oracle. Both of these costs can be significant. For example, many modern deep networks can take days or weeks to train and require hundreds of CPU / GPU hours. At the same time, training human labelers to become proficient in potentially nuanced labeling tasks requires a significant investment from both the designers of the labeling task and the assessors themselves. A sufficiently large set of requests must be queued to justify these costs, ”explain the creators of Cluster-Margin.
The gain in efficiency, especially at this scale, is not hard to imagine, but as Google moves forward in training on a larger scale, there are other, more ethereal issues to be addressed, especially when issues arise. Massive batches mean extracting (possibly personal) data for training.
Evolving the language model giant, BERT, using huge batch sizes has been its own rising giant for Google and the few others operating at the scale of over a million batches. Now, the impetus is to maintain efficient scalability while adding privacy measures that don’t hamper performance, scalability, or efficiency.
This week, another Google Research team showed they can scale BERT to batch sizes in the millions with a layer of privacy, called SGD Differentially Private, which is a heavy step during pre-training. The implementation of this layer sacrifices some precision with the precision of the hidden language model in this 60.5% BERT implementation on a lot size of two million. The non-private BERT models used by Google achieve an accuracy rate of around 70%. They add that the batch size they used for their results is 32 times larger than the non-private BERT model.
As the creators of the algorithm explain, “To mitigate these [privacy] concerns, framework and properties of differential privacy (PD) [DMNS06, DKM+06] provide a compelling approach to rigorously control and prevent the leakage of sensitive user information present in the training dataset. Basically, DP ensures that the output distribution of a (randomized) algorithm does not change significantly if a single training example is added or removed; this change is parameterized by two numbers: the smaller these values, the more private the algorithm.
Accuracy and privacy go hand in hand in other areas for large-scale training at Google Research. Larger models, more massive batch sizes mean increasing difficulty in managing consistency of results and avoiding underfitting or overfitting. Google is working on developing new calibration techniques that can keep up with the scale of increasing training runs. Another Google Research team this week published results on flexible calibration techniques that reduce calibration errors of existing approaches by 82%.
The team explains that a comparison of soft calibration targets as secondary losses to existing calibration incentive losses reveals that “calibration sensitive training targets as a whole (not always the ones that we propose) give better estimates of uncertainty with respect to the standard cross-entropy loss. coupled with temperature scaling. They also show that composite losses allow for an advanced single-model ECE in exchange for a reduction of less than 1% in accuracy for CIFAR-10, CIFAR-100 and Imagenet, which served as baselines. .
In the past, the sheer scalability of models was at the heart of what we saw coming out of Google Research on the training front. The fact that what we’re seeing more recently, including the last few days, is evidence that scaling the model itself is giving way to more nuanced elements for full-scale training, ranging from the improvement / improvement of results with the addition of confidentiality. This means that the models themselves are found to scale over a million batches, leaving room for the creation of more efficient neural networks.
Subscribe to our newsletter
Featuring the week’s highlights, analysis, and stories straight from us to your inbox with nothing in between.