How to set up a GPT-2 server - ContentBot Blog

How to set up a GPT-2 server

Recently our team worked extensively on deploying a standalone GPT-2 server to allow ContentBot to offer long-form tools to content creators (now replaced with OpenAI). This article will cover the basic steps involved in setting up your own GPT-2 server, along with some of the things we learnt along the way. 

Firstly, we needed to identify a model which would offer a solid foundation, along with the ability to finetune the system as/when it was necessary. We decided to use GTP-Neo, Hugging Face Transformers, and Virtual Machine (Incl. GPU) on the Google Cloud Platform

The rest of this article will cover this process in a bit more detail, however, it is worth mentioning that if we were to attempt this again, we would consider using the Hosted inference API from Hugging Face instead of a virtual machine. For our needs at the time, a VM was the logical choice, however, this may not be true for future projects. 

Set up a virtual machine

To get started you will need a Google Cloud account, to gain access to the Compute Engine, where all your virtual machines will be managed: 

Once your account has been created, you’ll need to navigate to Compute Engine > VM Instances > And click on Create Instance to begin setting up your machine. 

Note: You may be prompted to create a project which acts as a container for all related Google Cloud resources. An added benefit is that you can add your team to this project, which can be helpful for larger project. 

We experimented with quite a few configurations for our VM’s and settled on the following for a good balance between cost and performance: 

  • Region: us-central-1-a
  • Machine family: GPU
  • GPU type: NVIDIA V100
  • Number of GPU’s: 1
  • Machine type: Custom
  • Cores: 12
  • Memory: 78GB (Extended)
  • CPU Platform: Automatic (Intel Haswell)
  • Storage: 100GB 
  • Boot Image: Deep Learning on Linux
  • Boot Image Version: Deep Learning Image: PyTorch 1.8 m73 CUDA 110
  • Firewall: Allow traffic

It’s important to mention that the configuration assumes you plan on finetuning using the same machine and plan on using the VM as a web API. Your needs may differ depending on those factors, and you may not need all the resources listed, however, for our purposes this allowed our VM’s more versatility as/when it was required. 

You may also want to make use of Preemptible Instances as this would reduce costs quite substantially (as much as 50%). With that said, this will allow your instance to be terminated early and could cause instabilities if you rely on the machine.

Once your VM is set up, you should be able to SSH into the machine and begin setting up your server. We’d also recommend ensuring that you have root access at this stage as it will make automation easier later.


The model will be driven by Python, using Hugging Face transformers, however, there are a few more packages we need to get set up to allow the VM to be used as a server, these can be found below: 

Due to the way we set up our VM, we already have Python and Pip installed, which means all of these packages can be installed using pip commands. 

An example of this can be seen below: 

pip3 install transformers

You can learn more here: Getting started with pip

It should also be noted you could use alternative server and routing layers if preferred, we decided to make use of Flask and Gunicorn purely out of preference, but these could be swapped out if necessary.

Go ahead and install all of the dependencies listed before moving on. 

Developing the code

As you may have guessed, we will be writing the primary flask container in Python, which means we assume you have a basic understanding of the syntax, however, we will be providing a mostly usable example for you as well.

Go ahead and create a new file on your VM called ‘’ and add the following to it:

Once the file has been saved, you should be able to run a test instance of the server, using the following command: 

gunicorn –threads 8 -t 0 -b  api:app

We run the flask container via Gunicorn which will initialize flask as part of its execution. In the final setup, we will look at setting up an automatic startup as part of our VM setup, but this is purely to test that things are working properly. 

If you have done everything correctly you should see the following messages printed: 

Starting ServerLoading EleutherAIEleutherAI ReadyCreating FlaskServer Ready

This may take some time as the transformer initialized the model and loads data into memory, so give it some time to complete. 

Finalizing your server set up

Before we can deploy the code and start using our API we’ll need to setup a static IP address for the virtual machine. 

This is well documented here: 

That can then be bound to a domain which you can then sign using Certbot, which is important if you want to use the server in production. 

If you do decide to sign the server using Certbot (LetsEncrypt) you should do that next by following the guide found here: 

Once you have completed the signing process using Certbot (LetsEncrypt) you should have the necessary certificate files on your VM ready to be used as part of your Gunicorn startup command. 

Your complete startup command should look something like this: 

gunicorn –certfile=fullchain.pem –keyfile=privkey.pem  –threads 8 -t 0 -b  api:app –daemon

The primary difference being the added certificate files, and the –daemon flag which will allow the process to run in the background. 

This is a great time to dial in your Gunicorn configuration to find a good balance for your thread counts and timeouts if required, although I’d recommend starting with the value seen above as these were tested on our machine.

Automatic Startup

At this stage you should be more or less on track for deployment, but you probably want the server to startup automatically when you restart the VM, or in the event that a crash does occur. 

To do this, go ahead and edit the following file on the VM: /etc/rc.local

Add the following command to your list of startup commands: 

/opt/conda/bin/python3.7 /opt/conda/bin/gunicorn –certfile=fullchain.pem –keyfile=privkey.pem –threads 8 -t 0 -b api:app –daemon

After that you can run the reboot command, allowing the VM to restart. Once it has restarted you can use the htop command to filter gunicorn processors, and ensure the process does automatically start. 

Making Requests

At this point you have set up your GPT-2 Server and if you have a domain setup you should be able to generate inputs using simple GET variables:

In this example, we only send the ‘text’ parameter, however, the following parameters are supported:

  • text: Your input prompt
  • o: The output length (Tokens)
  • t: Temperature 
  • i: The amount of outputs to generate

I hope this article gets you started. We look forward to bringing you more on this in the future.

Follow me

Leave a Comment