Overview

writeup.ai is an open-sourced text-bot that writes with you. It's (mostly) powered by OpenAI's GPT-2 and has additional fine-tuned models:

Legal

Copywriting and Mission Statements

Lyrics

Harry Potter

Game of Thrones

Academic Research Abstracts

The main technical challenges were creating an app that could deliver OpenAI's GPT-2 Medium (a ml model that generates text) quickly and simultaneously support 10-20 heavy users.

Inception

Began as an excuse to learn about training ML models in NLP (Natural Language Processing). I ended up learning mostly about deploying models.

Estimated a month to build. Wrong. Consumed my life for three months.

Hard for engineers to estimate. Estimation is even harder for overconfident idiots (cough).

Unfortunately, I did not learn much about training models (lol). Still don't know anything.

Lots of open-sourced training scripts (nsheppard) did the heavy lifting. Found gwern's GPT2 guide invaluable for a training tutorial. Another great quick-start is Max's gpt-2-simple repo.

Most of writeup.ai is open-sourced. I've added corresponding links to learn from my mistakes/failures lessons. I've also added direct links to the code in GitHub.

Background

Too many years making web apps in React, Django, Flask.

New to machine-learning (ML) and MLOps (machine learning devops), so read any advice with healthy skepticism.

Reader Expectations

Some background in web development is necessary, but I aggressively link to help with jargon.

Basic knowledge of machine-learning is helpful.

Caveats

Bullet points for concision.

Full phrases and then abbreviations. IE. machine learning (ML) > ML

In most circumstances, model means machine-learning model. Writing "ML model" was redundant.

Vendor lock-in is real. Enjoyed Google Cloud Platform (GCP) so much, never intend to leave. Some advice is GCP-centric.

GCP deploying and scaling ML resources better than previous AWS experiences.

Email, tweet, comment anything you'd like clarified.

Technical Architecture

Frontend (ReactJS) joins a WebSocket on backend (Django). Communicates with backend via WebSockets. Frontend Code | Backend Code

Backend parses and serializes frontend requests. Packages message (text, algorithm, settings) and a WebSocket channel to a Google Load Balancer. Backend Code

Load balancer relays to the proper microservices (small, medium, large, harry potter, legal, etc).

Microservices periodically update websocket with suggested words real-time. This creates "streaming" effect.

Frontend receives updated WebSocket messages from microservices.

Each ML model (small, medium, large, harry-potter, legal, research) is a microservice. Auto-scale on utilization.

Tried countless iterations to make fast(er).

I generally dislike microservice architectures (adds additional complexity). Despite best efforts, a microservice architecture was necessary for performance.

The requests and computational cost of the microservice are fundamentally different from the backend server. Traditional web servers easily handles handle 500-5000+ requests/second (see C10K). However, just 50 requests/second for an instance running a 1gb model generating 50-100 words can crush a machine. (*)

Backend and microservices are written in Python 3.6. Django (DRF) powers the backend. A separate instance of a heavily-stripped version of Django is used for microservices.

All of the microservices instances have an attached GPU or a Cascade Lake CPU to run ML models. Details below.

Backend and microservices are hosted on Google Cloud Platform.

A Google Load Balancer routes all the traffic to microservices. It routes based on the URL postfix "/gpt2-medium, /gtp2-medium-hp, etc.. The load balancer also runs health-checks to check for CUDA crashes.

(*) - Whenever you have to justify your use case for microservices, it probably means it wasn't worth the complexity.

Three Weeks in Lima, Peru

Started serious coding at the beginning of a three-week trip in Lima, Peru. Trip served as a catalyst.

Some friends started beta testing near the end. Slow and often failed.

Spent 80% of my trip coding in a coworking space.

Two weeks on backend and DevOps, last week on the frontend.

Rewrote DevOps as complexity grew.

At trip end, the frontend communicated via POST requests to the backend which relayed to the microservices.

It wasn't pretty or fast, but seeing the first message go end-to-end from frontend --> backend --> microservice made me stupidly excited.

MVP Version

Accomplishments in Lima

A reasonable amount of the frontend. It had a simple text editor and decision options populated from microservices.

Backend could create WebSockets to communicate to the frontend. In the first iteration, the backend communicated to the microservice via a POST request and then relay the message onto a WebSocket. I desperately wanted to keep the microservices dumb and not handle WebSockets.

Automated Deployments via Ansible (later refactored/removed into Google Startup Scripts)

Mistakes: Launch Earlier! In hindsight, I should have launched after building for 4-5 weeks. By then it was a sorta-working MVP, but I was intimidated that without all the bells-and-whistles that it was a mockery.

Random: There's something magical about flow, 2:00 AM and an empty coworking space. Anything feels possible.

The 90/90 Rule and Microservices Difficulties

The first 90 percent of the code accounts for the first 90 percent of the development time. The remaining 10 percent of the code accounts for the other 90 percent of the development time. - Tom Cargill, Bell Labs

Engineers are bad at estimates.

Severely underestimated Machine Learning DevOps (what kids call "MLOps") difficulty.

Second major underestimation was managing my own feature creep.

Making the microservices work on Docker containers, scale, and install CUDA drivers with the proper-models was unexpectedly difficult.

To make the ml-microservices worked in a Docker container, I had to:

Use a custom TensorFlow boot-image that had CUDA installed

Pass a special flag to Google to install nvidia-drivers (not always necessary for certain images)

Install Docker on the GCP instance

Override docker's default runtime to nvidia (makes using docker and docker-compose easier).

Make sure docker didn't just delete my configuration change or revert a configuration.

Sync GitHub with GCP; Build Docker images upon pushing.

Sync ml-models with Google Cloud Storage. Read speeds are insanely fast from GCP Storage and Instances. Faster than AWS.

Pull the prebuilt Docker image built from Google Cloud Build.

Pull ml-models from Cloud Bucket and --mount in Docker with separate proper folders.

Pray that all the requirements (TensorFlow / PyTorch) were installed properly.

Save the HD image / snapshots so that instances can cold-boot quickly from an image.

The rest of the traditional DevOps (git, monitoring, start docker container, etc).

It's much easier to remedy these issues when reflecting, but at the time, I had no idea what I was getting myself into.

Above steps had to be fully automated or scaling failed. It feels dirty (in 2019) to write bash scripts as part of an automated deployment, but it's necessary when using autoscaling on Google startup-scripts. Kubernetes is another option, but I'm not smart enough to use K8. Google startup-scripts runs a shell script when the machine starts. It's hard to use Ansible when auto-scaling an instance.

TIP: Use startup-script-url! This tells the instance to run a script from a custom bucket URL. This is much better than copying/pasting your script into GCP's CLI/UI. You will encounter many little changes to your startup-script.

Setting up the backend was straightforward. This was my first experience using Django Channels, which configures WebSockets. Props to Django-Channels.

The frontend took additional time because of feature creep. I kept on adding one more feature because I was afraid it wasn't good enough.

The microservices was originally written in Flask (because that's what everyone suggests). I then looked at the benchmarks and realized I could get the same performance in django-rest-framework if I stripped it. Having everything in django-rest-framework was much easier for me (my background is Django).

Optimizing the microservices took a bit of time. I experimented with different video cards, CPUs, memory configurations, images. More on that later.

Things that came as a shock

Until two months ago, the default python on TensorFlow images was 2.7

PyTorch's docker images uses conda.

The overrides necessary to make nvidia runtimes work on Docker.

Examples of ML open-sourced code were all over the place. Lots of spaghetti and duct-tape.

Google's TensorFlow Docker images are so optimized (!) they run PyTorch faster than PyTorch runs in official PyTorch images. This might have been a bug in the upstream PyTorch images, didn't investigate.

Builds can be broken upstream when pulling from Docker containers (TensorFlow / PyTorch). Everything is ML is moving so quickly, you kinda just get used to it.

TIPS

Try avoiding install CUDA manually. Use a Google pre-installed boot-image.

Write down what CUDA version / other configuration. Helpful to Google no issues with CUDA Version + some other requirement. Lot of graveyards of CUDA Version Bug + Framework Version.

Overall: Once you know what configuration (boot-images, Docker images, Docker configurations, CUDA) works, it's straight forward. The hard part is knowing ahead of time ...

Learning Machine Learning

Inference And Other Scary Jargon Explained

The ML community uses a lot of terms that can be intimidating. After a few months, they're much less intimidating until I see math. Then I hide.

Overall feeling when I saw talented people with maths. Crap. What Am I Doing Here?

Term Definition 16-bit make models train faster by requiring less precision. algorithm generic word meaning anything. the more frequent in your research paper, the higher probability it gets published in science/nature. ideal sentences include "We introduce a novel algorithm technique that" ... APEX something that will install in a docker container, but won't always run. used for nvidia's 16-bit training. conda something that installs all your python dependencies. after a production release, changes your office nickname to "that's odd, the code worked on my machine ..." converged finished, further training doesn't improve results. CUDA used as a walky-talkie to your nvidia video card to tell it to run your buggy code. something that rarely installs correctly. embedding the model's internal representation of an input. inference fancy word for "run-time or running" in real-time. "our inference run-times converged around 5 seconds" --> it takes 5 seconds to run. latent space the model think this these variables are correlated and have meaning, but we have no idea why. nvidia-smi a cli to monitor how little usage your expensive graphics card is using SOTA state-of-the-art. often a fancy phrase meaning really-really-large model or i took what you did 1.5 month ago but ran it with more data, more iterations and one more layer. transformers i'm not smart enough to explain this. but jay is. TPU google's custom video card. not exactly plug-and-play replacement of nvidia video cards. requires code modifications to run. works well for architectures with large memory requirements like transformers. python 2.7 a version that's supposed to be sunset, but apparently used in 70%+ of ML research. sudo pip install something you should never do, but frequently seen as step 2 in many repos. unit tests something rarely found in ml source code. nvidia's last ten cards write them down in terms of performance on a reference sheet. circle the ones that can be used for 16-bit. cross out the ones that sound like they starred in terminator movies

TIP: Jokes aside, ML has a LOT of new terminology / jargon to grasp. Keeping and writing a cheat-sheet of terms is helpful. I recommend Spaced Repetition and Anki.

Inference Profiling: GPU Names Sound Intimidating!

When running machine learning models on the web, you have two options for hardware : GPU (video-card) or CPU.

PROS : GPUs are faster, and performance is normally that of 5-15x of CPUs. CONS : Cost more and add deployment complexity.

: GPUs are faster, and performance is normally that of 5-15x of CPUs. : Cost more and add deployment complexity. Many machine learning tasks take only a fraction of a second (aka image classification). Perfectly valid to use a CPU.

Most users won't notice the difference between .05 second and .5 second on an async task. Your web page should load FAST, but lazy load the task result.

Running gpt-2 medium models (1.2-1.5 GB) isn't fast on CPUs. Average CPUs generates about 3-7 words per second, not ideal UX.

Decision between Cascade Lake (latest generation Xeon CPUs, optimized for ML), K80s, V100s or P100s on Google Cloud.

These benchmarks weren't a scientific baseline. It was more of a quick ranking heuristic ... written on a napkin.

Chip Speed (words/sec) (1) Memory Cascade Lake (CPU) (2) 8-11 26 gb (instance) K80 (GPU) 12-24 11 gb P100 (GPU) 32-64 16 gb V100 (GPU) 32-64 16 gb

(1) - This is when running multiple PyTorch instances. I did this to smooth out utilizations across CPU / GPU blocking operations. For instance, on the same machine with a GPU, two PyTorch instances could generate 1.5x more than a single PyTorch instance due to smoothing out CPU/GPU bottlenecks. An instance running with a single PyTorch app might generate 15 words/sec, but having two Python apps could both generate 10/words a second.

(2) - Huge blunder on my part, but I didn't try with latest MKL-DNN drivers installed. You may see a nice performance jump. Or you might not.

Higher memory was helpful as the text inputs increased.

In terms of cost per cycle, Cascade Lakes are cost-effective compared to GPUs. Felt Cascade Lakes were just below cutoff of fast-enough for UX. Cascade Lakes didn't generate prompts as fast as I wanted.

I found the UX impact of having K80s vs. P100 trade-off acceptable when generating <50 words at once.

Ended up using mostly Cascade Lakes and K80s except for GPT-2 Large. Cost.

TIP: You can have most of these running as preemptible, which cost 1/2 as much. I used preemptible most of the time, except during product launches.

TIP: If using preemptible, Google will force restart once every 24 hours. Create them at an odd hour like 2AM so that it impacts the least amount of visitors.

TIP: Cascade Lakes are perfectly a reasonable tradeoff.

CAVEAT: These "benchmarks" are only for inference (running a model in real-time). Most training should be done on GPUs.

Other Difficulties Encountered

Thomson's Rule for First-Time Telescope Makers

"It is faster to make a four-inch mirror then a six-inch mirror than to make a six-inch mirror." -- Programming Pearls, Communications of the ACM, September 1985

Started with simple: API endpoint generating words from gpt2-medium. Slow. Sync task. Used Flask. Single endpoint.

Added frontend. Would query API endpoint. Slow. Duplicate requests could crush API.

Added backend as gatekeeper to API endpoints.

Rewrote Flask endpoints into Django-DRF.

Integrated django-channels in backend to handle Websockets. Added redis-cache to check for duplicate requests before relaying to microservices.

Changed frontend to communicate via WebSockets.

Rewrote deployment scripts from Ansible to handle Google Cloud's startup scripts paradigm.

Integrated microservices to communicate via WebSockets, aka allow "streaming".

Trained and added additional microservices (small, medium, large, legal, writing, harry potter, lyrics, companies, xlnet)

Complexity gradually grew from original idea of simple endpoint.

PRO: I've drastically improved at deploying ML after all these shenanigans.

CON: Tight coupling with GCP's core products (specifically Storage, Cloud Build, Autoscaling, Images). Tight coupling on single service vendor is not always ideal (technically or strategically).

TIP: If you're okay with tight coupling with GCP Products, you can build faster. Once I accepted using startup-scripts, everything got easier.

Overall: Probably would have been discouraged/intimidated if I knew complexity of final architecture (and my own ignorance of DevOps). Attribute to lack of planning and don't know what I don't know risk. Amongst my many mistakes, building an app from a simple architecture and gradually refactoring it with more complexity was something I did right.

Docker! Where’s my video card?! And other deployment difficulties.

Note: Both GCP and Docker have a concept of images. To avoid confusion, I'll declare GCP's always as boot-images.

Generally, using Docker containers helps streamline deploys, service configuration, and code reproducibility ("iuno, worked on my machine problems").

Using Docker in ML is harder. Issues:

Images can become grotesquely large. Official TensorFlow Docker images are easily 500mb-1.5gb in size.

Most GCP machine-learning boot-images do not come with Docker/Compose.

Counter: Many boot-images that include Docker don't have CUDA.

If you have the courage to install TensorFlow and CUDA from scratch, I applaud you.

Trick is to find a good-enough boot-image and install the less difficult of the two (CUDA, Docker). Majority of time, Docker + Docker Tools is easier to install than CUDA.

Many models are frequently 1+ gb, prohibitively large for source control. Need scripts that rsync large models upon startup/deployment.

Easy to forget passing nvidia runtimes to Docker on commands.

Feedback loops in DevOps are much slower than programming. You can make a change, realize you had a typo and take another 10 minutes to deploy. If using Google rolling deploys, can take even longer.

PRO: Once containers are setup, surprisingly robust.

CON: Docker adds complexity on deployment. Reasonable counterargument: If doing this much, why not add Kubernetes? Answer: I'm not smart enough for Kubernetes. TIP: Be meticulous and put every shell command you're running in a Quiver journal (or some type of record keeping). You will likely copy and paste your commands dozens of times. You'll automate a huge portion of it later. Harder to automate if you "sorta" remember the command order. TIP: Run/Save your commands as absolute paths to avoid overwriting wrong directories. ie. "rsync /path1 /path2" instead of "rsync path1 path2", oh f f. TIP: If you know Ansible, use Ansible to rerun google's startup-scripts on inventory. Much faster than GCP rolling-deploys.

- name: Deploy to Open # startup scripts does most of the hard work, but make sure # you're only deploying to things that finished from startup scripts hosts: open_django:&finished_startup gather_facts: true become: true post_tasks: - name: Run Startup Script shell: | google_metadata_script_runner --script-type startup --debug args: chdir: / become: yes

TIP: Spend the additional hours and outline

where models should be stored on which bucket. Separating cloud buckets for training and production is recommended. where the bucket/directory should be synced on the instances. if possible, make the instance share the exact location the same as the mount directory of your docker container. ie. an instance's /models mounts to the docker container's /models path write proper rsync commands to the bucket. Use rsync! (not cp). More efficient on reboots than pulling same files via cp.

TIP: Having quick automated checks for PyTorch (torch.cuda.isavailable) or TensorFlow (tf.test.isgpu_available) saves headache to ensure Docker is using nvidia.

Overall: This area is where probably many web engineers struggle when deploying pre-trained ML applications.

Finding Bottlenecks. What do you mean I'm out of memory?

Monitoring traditional web-servers load is generally straight forward. CPU usage % listed on all GCP pages. For memory, the command top quickly tells how much memory programs are using. Google's StackDriver auto-forwards memory usage to Google Cloud.

DevOps has cared about monitoring cpu, memory, disk usage, network for DECADES.

However, only people caring about GPU usage were overclocking gamers (aka crysis-99-fps-watercooled-noobmaster). Production GPU monitoring tools are not quite up to par since the AlexNet (community learned to use GPUs for ML).

To properly monitor GPU usage, you have to use nvidia-smi, output the results on a set interval, write a script for Prometheus to read and then pass it to StackDriver. In short, you have to write a microservice to monitor a microservice.

During usage, both CPU and GPU usage spike somewhat linearly. As a hack, I found the lowest amount of vCPUs that could spike to 80-100% and auto-scaled based on CPU usage. Too many vCPUs and CPU usage % won't budge while GPU is hammered.

Problems can arise when the GPU runs out of memory. This happened when users passed longer prompts (> 200 words). PyTorch raises an exception, but unfortunately contains large memory leak. To handle this, I caught PyTorch exceptions and forced releasing unused memory. nvidia-smi isn't useful since memory usage stats aren't real-time precise (IIRC, it only shows the peak memory usage of a process).

Preparing For Launch Day

Training Models

Fine-tuned additional models on a P100 from gpt2-medium. Training iterations (cycles) ranged from 60k on Game of Thrones (GoT) and Harry Potter (HP) ... to 600k (academic research, trained on 200k paper abstracts).

Used TensorFlow 1.13 to train.

Training time ranged from a few hours (60k) to a few days (600k).

Cross-entropy loss was between ~2-3. Metric wasn't useful when overtraining.

Forked nsheppard's gpt2 repo, made minor modifications to speed startup for larger datasets.

Following gwern's tutorial is straight-forward once you understand ML jargon (although maybe that's the hard part).

Used gradient checkpointing to handle memory issues. Fine-tuning gpt2-large (774M parameters, 1.5gb) isn't possible on single GPUs without memory issues.

Finding and cleaning datasets varied from slightly-numbing pain to tedious frustration.

Once again, data scrubbing is 80% of the work.

Grabbed datasets from Kaggle, Google and misc. free data-sets to populate. Issues like dataset quirks, new lines (\r,

, carriage returns), unicode detection, and language detection were most time consuming during scrubbing.

Gwern used a lot of bash / command-lines to clean his Shakespeare corpus. I recommend using Python. Easier to reuse code on different datasets.

Could not make 16-bit training (apex) work correctly in Docker. Nvidia benchmarks (marketing though ..) show 16-bit can shorten training cycles by 2x (and more). Didn't try too hard (tired) to make 16-bit work.

After training, converted models to PyTorch using huggingface script. Deploying on pytorch-transformers is straight-forward.

Wanted to avoid overtraining on Harry Potter corpus, but in hindsight, feels like it would have better to overtrain than undertrain. Your results may vary when balancing over/under training risk for small datasets.

TIP: After you have your raw training dataset, make a copy of it. Never modify your raw dataset. Copy modified output to a separate folder. Keep modified and raw datasets in separate folders to avoid mistakes/confusion.

TIP: If you find yourself cleaning a particular dataset for a while, take a step back and look for a similar dataset w/out issues. This happened with Harry Potter datasets.

TIP: Learn tmux! Using tmux makes it much easier to start training on a remote-machine and you can exit without worry.

TIP: Use Quiver to hold all your commands. Very easy to make typos.

Running Models

Used PyTorch. pytorch-transformers creates convenient API call sites to the models. Mimicked examples of run_gpt2.py in huggingface. Then applied massive refactor.

Loading GPT-2 models in PyTorch is slow (1-2 minutes).

To shorten loading time, when microservices start, WSGI loads the appropriate model (gpt2-small, medium, large, etc) and stores PyTorch instance as a singleton. This is in response to the emails asking how the responses are quick.

All subsequent requests use singleton PyTorch instance.

Configuration limits to how many WSGI process are running based on the model size. Too many WSGI processes and CUDA runs out of memory. Too little and the GPU is underutilized.

Catch exceptions when PyTorch runs out of memory; release memory-leak.

def get_process_prompt_response(request, validated_data): try: output = generate_sequences_from_prompt(**validated_data) except RuntimeError as exc: if "out of memory" in str(exc): logger.exception( f"Ran Out of Memory When Running {validated_data}. Clearing Cache." ) torch.cuda.empty_cache() oom_response = get_oom_response(validated_data) return oom_response response = serialize_sequences_to_response( output, validated_data["prompt"], validated_data["cache_key"], WebsocketMessageTypes.COMPLETED_RESPONSE, completed=validated_data["length"], length=validated_data["length"], ) # clear cache on all responses (maybe this is overkill) torch.cuda.empty_cache() return response

95% of request time is spent predicting logits. Other is from routing and de/serialization from frontend -> backend -> load balancer.

On every fifth word, microservice updates WebSocket with updated text.

Adding cache to backend to prevent duplicate requests helped.

To streamline the same responses from different instances, I use a seed of 42 on all requests.

Deployment Improvements, Distillation, Thoughts

TensorFlow has TensorFlow Serve and PyTorch has TorchScript to converting models into production-grade. Benefits include reasonable speed improvement (a redditor quoted 30% improvement) and easier deployment on devices without Python. I traced (PyTorch's conversion process) on a few models, but I found that the speed benefit wasn't as obvious, but added much more complexity.

In the past few months, distillation of models (extracting 90-95%+ of the model at <50% of the size and runtime) has picked up traction. Huggingface's distillation of gpt2-small is 33% smaller and 2x faster.

There's a recently published paper about Extreme Language Model Compression that compressed BERT by 60x. Lot of implications if it can be applied on GPT2!

A bit of an anti-pattern, but having PyTorch and TensorFlow was tremendously useful on the same Docker image. I was able to diagnose and try out potential solutions much faster.

I originally integrated XLNet, but didn't the find generative output as strong as GPT2. I also attempted to have it suggest individual words (similar to its masked language model), but I couldn't find a good writing use-case / UI.

Some repeats from above.

Sentry is invaluable for error reporting. Using it with ASGI (Django-Channels) was slightly harder than normal.

Quiver

tmux - Use it to keep remote sessions open. An alternative is screen.

Using django-rest-framework is a joy. Feels like a cheat code.

Netlify has been fantastic to deploy with.

Conclusion

Dealing With Burnout

How I Treated My Mental State Until It Was Too Late ...

Hit a wall nearing the finish line.

Started burning out hard about 2-2.5 months in.

Mental anguish from feeling that I should have launched, and that it wasn't good enough to launch. Obsessed over missing features.

Really helpful to call a close friend to talk through (thanks James C).

Self-imposed stress of "launch!" made me avoid calling family. That was a mistake. Found calling my mother just to ask about her life made me breathe again.

Proud that I finished this. Learned a lot of unexpected things about ML deployments. Useful for my next project.

Huge Thanks To

OpenAI for GPT2; HuggingFace for pytorch-transformers.

GCP for credits, otherwise could not afford. Biased, but I found GCP metrics better across the board (Bucket, Networks, ease-of-use) than AWS.

My friends who helped me beta-test and gave invaluable feedback. Thank you to (random order): Christine Li, James Stewart, Kate Aksay, Zoltan Szalas, Steffan Linssen and Harini Babu who all give invaluable feedback.

The many Redditors / ProductHunt who really pushed and played around with product, along with great feedback and hilarious writing prompts. Reddit Writing Research Reddit Writing Harry Potter ProductHunters Writing Business Descriptions

