While it is exhilarating to see AI researchers pushing the performance of cutting-edge models to new heights, the costs of such processes are also rising at a dizzying rate.

Synced recently reported on XLNet, a new language model developed by CMU and Google Research which outperforms the previous SOTA model BERT (Bidirectional Encoder Representations from Transformers) on 20 language tasks including SQuAD, GLUE, and RACE; and has achieved SOTA results on 18 of these tasks.

What may surprise many is the staggering cost of training an XLNet model. A recent tweet from Elliot Turner — the serial entrepreneur and AI expert who is now the CEO and Co-Founder of Hologram AI — has prompted heated discussion on social media. Turner wrote “it costs $245,000 to train the XLNet model (the one that’s beating BERT on NLP tasks).” His calculation is based on a resource breakdown provided in the paper: “We train XLNet-Large on 512 TPU v3 chips for 500K steps with an Adam optimizer, linear learning rate decay and a batch size of 2048, which takes about 2.5 days.”

Reaction from researchers and academics included this comment on Reddit: “I think I’d just cry if I had to try and convince my boss to spend 250k on AWS for a single model that may or may not perform as well as needed.”

Synced has discovered however that Turner’s math might be off. A Cloud TPU v3 device, which costs US$8 per hour on Google Cloud Platform, has four independent embedded chips. As the paper authors specified “TPU v3 chips”, the calculation should be 512 (chips) * (US$8/4) * 24 (hours) * 2.5 (days) = $61,440. Google researcher James Bradbury expressed the same idea on Twitter: “512 TPU chips is 128 TPU devices, or $61,440 for 2.5 days. The authors could also have meant 512 cores, which is 64 devices or $30,720.”

Even so, spending US$61,000 to train a single language model is pricey. Of course since Google is leading XLNet research, the company’s cloud division won’t likely charge its own research team full price.

So why is it so expensive to train XLNet? For starters, the model is huge. From the paper: “Our largest model XLNet-Large has the same architecture hyperparameters as BERT-Large, which results in a similar model size.” XLNet-large has 24 Transformer blocks, 1024 hidden units in each layer, and 16 attention heads. Researchers also collected a total of 32.89 billion subword pieces as pretraining data.

Synced took a look at cost estimates for training other large AI models:

University of Washington’s Grover-Mega — total training cost: US$25,000

Grover is a 1.5-billion-parameter neural net tailored for both the generation and detection of fake news. Grover can generate the rest of an article from any headline, and outperforms other fake news detectors when defending against Grover itself. It was developed by University of Washington and Allen Institute for Artificial Intelligence in May 2019 and recently open-sourced on Github.

Training the largest Grover Mega model cost US$25k in total, based on information in the research paper: “training Grover-Mega is relatively inexpensive: at a cost of $0.30 per TPU v3 core-hour and two weeks of training.”

Google BERT — estimated total training cost: US$6,912

Released last year by Google Research, BERT is a bidirectional transformer model that redefined the state of the art for 11 natural language processing tasks. Many language models today are built on top of BERT architecture.

From the Google research paper: “training of BERT — Large was performed on 16 Cloud TPUs (64 TPU chips total). Each pretraining took 4 days to complete.” Assuming the training device was Cloud TPU v2, the total price of one-time pretraining should be 16 (devices) * 4 (days) * 24 (hours) * 4.5 (US$ per hour) = US$6,912. Google suggests researchers with tight budgets could pretrain a smaller BERT-Base model on a single preemptible Cloud TPU v2, which takes about two weeks with a cost of about US$500.

OpenAI GPT-2 — training cost US$256 per hour

GPT-2 is a large language model recently developed by OpenAI which can generate realistic paragraphs of text. Without any task-specific training data, the model still demonstrates compelling performance across a range of language tasks such as machine translation, question answering, reading comprehension and summarization.

The Register reports the GPT-2 model used 256 Google Cloud TPU v3 cores for training, which costs US$256 per hour. OpenAI didn’t specify the training duration.

While the numbers may look scary, most machine learning models are nowhere near as demanding as these high-profile examples associated with tech giants. As Turing Award Laureate Yoshua Bengio told Synced in a recent interview, “Some of the models are so big that even in MILA (Montreal Institute of Learning Algorithm) we can’t run them because we don’t have the infrastructure for that. Only a few companies can run these very big models they’re talking about.”

The cost of the compute used to train models is also expected to become significantly cheaper with the continuing advance of algorithms, computing devices, and engineering efforts. As a Reddit user commented: “Google’s cat neuron paper used days/tens of thousands of cores but now people are generating fake cats in real time. To take an example from progression of ImageNet models to 75% top-1, first DAWN benchmark submission cost $2k, then cost went down to $40 within couple of years.”

The paper XLNet: Generalized Autoregressive Pretraining for Language Understanding is on arXiv.