OpenGPT-2: We Replicated GPT-2 Because You Can Too

7 min readAug 22, 2019

Aaron Gokaslan*, Vanya Cohen*, Ellie Pavlick, Stefanie Tellex | Brown University

Introduction

Recently, large language models like BERT¹, XLNet², GPT-2³, and Grover⁴ have demonstrated impressive results in generating text and on multiple NLP tasks. Since Open-AI has not released their largest model at this time (but has released their 774M param model), we seek to replicate their 1.5B model to allow others to build on our pretrained model and further improve it.

You can access the model and generate text using our Google Colab.

We’ve also made the model weights available separately.

Replication

Radford’s et al’s³ security strategy of delaying the release of the model relies on these models being difficult to replicate and requiring a high degree of specialized domain knowledge. We demonstrate that many of the results of the paper can be replicated by two masters students, with no prior experience in language modeling. Because of the relative ease of replicating this model, an overwhelming number of interested parties could replicate GPT-2. Further, Zellers et al.⁴ shows that large language models like GPT-2 are an invaluable tool for countering the use of the same models as text generators.

Because our replication efforts are not unique, and large language models are the current most effective means of countering generated text, we believe releasing our model is a reasonable first step towards countering the potential future abuse of these kinds of models.

We base our implementation off of the Grover model⁴ and modify their codebase to match the language modeling training objective of GPT-2. Since their model was trained on a similarly large corpus, much of the code and hyper-parameters proved readily reusable. We did not substantially change the hyper-parameters from Grover.

The cost of training the model from scratch using our code is about $50k. It’s important to note this figure is the estimated value of the cloud compute, and does not reflect the much smaller intrinsic costs involved (training the model is less if training on other less time-efficient, user-friendly compute resources).

There is a significant time-cost tradeoff, and slower training methods have considerably smaller costs, thus reducing the barrier to entry.

Dataset

The original paper provided minimal details on how the dataset was cleaned.

As in WebText³, we begin by parsing out all links from Reddit with more than 3 up-votes. We started with the Pushshift Reddit scrape⁵, a dataset containing a continuously updated collection of Reddit posts, comments, and related metadata. These links are then filtered to remove direct links to file-types unlikely to contain usable text or HTML (i.e. video files, PDFs, and CSS style files).

We also filter webpages to remove Wikipedia as it is used by various evaluation benchmarks and datasets. We were not able to determine if our filtering criteria matched OpenAI’s since this information was never released. Text was extracted from HTML pages using the Newspaper Python library, and then filtered for only English text using the fastText Python library⁶. Specifically we use the WhatTheLang python Wrapper⁷. We deduplicate documents using locally sensitive hashing (LSH)⁸ ⁹ ¹⁰. We hashed the documents into sets of 5-grams and all documents that had a similarity threshold of greater than 0.5 were removed.

As a cleaning heuristic, documents with fewer than 128 tokens were removed from the dataset. These shorter documents tended to be lower quality, as determined by text coherence. We release this dataset as the OpenWebTextCorpus¹¹.

For encoding the dataset, we used the Binary Pattern Encoder¹² released with the small models from Radford et al.³

We used a modified version of the OpenWebText web-scraping codebase¹³ as a starting point for our dataset collection.

Errata

From the publicly released collection of 260k documents from WebText³, we find that all have a minimum byte-pair (BPE) encoding¹² length of 40, and a maximum of 1024. OpenWebText differs in that we set a lower bound for document length at 128 tokens (instead of BPE codes), and do not restrict the maximum document length. The original WebTextCorpus was released before these samples became available and therefore did not use the information for generating cleaning heuristics.

We made multiple attempts to contact Radford et al.³ to clarify evaluation and model details, but were ultimately unsuccessful.

Results

Despite the differences in our training distribution, we do report similar perplexities over most datasets.

Samples

Prompt: “Recycling is good for the world. NO! YOU COULD NOT BE MORE WRONG!!”

Output:

Recycling is good for the world. NO! YOU COULD NOT BE MORE WRONG!! That´s because recycling is all about selling to customers who cannot afford to recycle.” If you don’t have any interest in recycling, in whatever form, then you could well be doing your own global environmental benefit if you just got rid of it. Let me explain. Recycling takes time, money and energy, and it doesn’t really solve the problem. If you go to recycling drop-off shops and they just sort out your recyclables for you then it means there is no need to sell them again to the local recycling company; they just take it back from you and you can just dump it. There is nothing clever about this approach, in fact it just shuffles the problem around instead of solving it. If you put the recycling drop-off shops under your local council then you could actually find more recycling opportunities for local communities, as well as more time for recycling. Recycling does have some short term upsides, but only in cases where you don’t have an immediate need for the kind of plastic you are recycling; then at least the glass, paper, steel and plastic products are free to dispose of. But this is the case only in the short term. In the long term, in the age of climate change, you will need to take more care of what you do with your plastics or at least put it into our recycling system. I’m sure there are lots of companies out there that could invest money and effort in dealing with the issues of waste plastics. Instead of just shredding them, or leaving them in the ground, companies could turn them into good quality recycling products, or turn them into energy. There are other options as well, such as taking the material and using it to make a tree pulp or textiles. However, it doesn’t have to cost anything to divert waste plastics and plastics derived from energy, to something that actually has a long-term role in the world; that´s the whole idea. We are wasting tonnes of energy because we don’t recycle our waste plastics, and if we did then we could save that energy for our own needs. This is a far greater percentage of the energy budget of the world than just for producing energy, but it is still a lot. And the fuel used to make the plastics is costing more and more: Another short term gain from recycling comes when you remove disposable plastics from the landfills and recycling centres; those plastic bags, bottles, filters, rolls and clothes go straight into the world’s oceans, which are in big trouble right now, as poor diets contribute to pollution in the sea, sea life are starving and thousands of species are on the brink of extinction. Many companies are developing plastic “gifts” for the children of developing countries, but much more needs to be done. Companies can do a lot when they commit to the recycling of plastic goods. So, at this point, recycling is all about consumer habits. That´s really all it is. If people choose to separate and segregate their waste plastics, then recycling will definitely have more of an impact than if they don’t. Join the world’s largest recycling network: RecycleAmerica , featuring the largest collection of energy recovery and energy bond companies, designers, and manufacturers in the US. Don’t believe me? Find out more. Contribute to the global efforts to make plastics more sustan

Citations

*. Equal contribution.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
Zihang Dai, Zhilin Yang, Yiming Yang, William W Cohen, Jaime Carbonell, Quoc V Le,and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-lengthcontext.arXiv preprint arXiv:1901.02860, 2019.
Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Languagemodels are unsupervised multitask learners. 2019.
Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin Choi. Defending against neural fake news. arXiv preprint arXiv:1905.12616, 2019.
Jason Baumgarten. Reddit posts dataset.
Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. Bag of tricks forefficient text classification. InProceedings of the 15th Conference of the European Chapterof the Association for Computational Linguistics: Volume 2, Short Papers, pages 427–431.Association for Computational Linguistics, April 2017.
Whatthelang.https://github.com/indix/whatthelang, 2019.
Abhinandan Das, Mayur Datar, Ashutosh Garg, and Shyam Rajaram. Google news per-sonalization: scalable online collaborative filtering. InProceedings of the 16th internationalconference on World Wide Web, pages 271–280. ACM, 2007.
Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma. Detecting near-duplicates forweb crawling. InProceedings of the 16th international conference on World Wide Web, pages141–150. ACM, 2007.
Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: towards removing thecurse of dimensionality. InProceedings of the thirtieth annual ACM symposium on Theory ofcomputing, pages 604–613. ACM, 1998.
Aaron Gokaslan and Vanya Cohen. Openwebtext corpus.http://Skylion007.github.io/OpenWebTextCorpus, 2019.
Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare wordswith subword units.arXiv preprint arXiv:1508.07909, 2015.
OpenWebText. https://github.com/eukaryote31/openwebtext, 2019.

We would like to thank Google (TensorFlow Research Cloud) for providing the compute for this and related projects (and state that their providing compute is in no way an endorsement of our views).

📝 Read this story later in Journal.

👩‍💻 Wake up every Sunday morning to the week’s most noteworthy stories in Tech waiting in your inbox. Read the Noteworthy in Tech newsletter.