By Dmitry Mukhin, CTO at Uploadcare.

Uploadcare is a file infrastructure as a service solution. We offer building blocks for handling files that provide simple controls for managing complex technologies. These controls include our widget, Upload API, REST API, and CDN API. All together these APIs handle 350M requests per day.

With only a few lines of code you get the ability to upload, store, process, cache, and deliver files. We support uploads from Dropbox, Facebook, and many other external sources. We also allow our users to upload files directly to their storage.









Yes, you can handle files on your own and you can get the basic system up and running pretty fast. What about storage? Uptime? A clear and friendly UI? Fast delivery to remote regions? For most of the use cases we analyzed, there is no sense in investing in developing your own file infrastructure.

Setting up Uploadcare is quick and solves many of the issues users traditionally experience when handling both large files and batches of smaller ones. Additionally you no longer need to test your system in every browser and maintain the infrastructure.

Uploadcare has built an infinitely scalable infrastructure by leveraging AWS. Building on top of AWS allows us to process 350M daily requests for file uploads, manipulations, and deliveries. When we started in 2011 the only cloud alternative to AWS was Google App Engine which was a no-go for a rather complex solution we wanted to build. We also didn’t want to buy any hardware or use co-locations.

Our stack handles receiving files, communicating with external file sources, managing file storage, managing user and file data, processing files, file caching and delivery, and managing user interface dashboards.

From the beginning we built Uploadcare with a microservice based architecture.

These are the challenges we faced in each layer of our stack.

Backend

At its core, Uploadcare runs on Python. The Europython 2011 conference in Florence really inspired us, coupled with the fact that it was general enough to solve all of our challenges informed this decision. Additionally we had prior experience working in Python.

We chose to build the main application with Django because of its feature completeness and large footprint within the Python ecosystem.

All the communications within our ecosystem occur via several HTTP APIs, Redis, Amazon S3, and Amazon DynamoDB. We decided on this architecture so that our our system could be scalable in terms of storage and database throughput. This way we only need Django running on top of our database cluster. We use PostgreSQL as our database because it is considered an industry standard when it comes to clustering and scaling.

Uploads, External sources

Uploadcare lets users upload files using our widget. We support multiple upload sources including APIs that only require URLs.

Uploaded files are received by the Django app where the majority of the the heavy lifting is done by Celery. It’s great for handling queues and it’s got a great community with tons of tutorials and examples to leverage. Celery handles uploading large files, retrieving files from different upload sources, storing files, and pushes files to Amazon S3. All the communications with external sources are handled by separate Amazon EC2 instances where load balancing is handled by AWS Elastic Load Balancer. The EC2 instances responsible for uploads are kept separate from the rest of the application.









The only two issues we have experienced with AWS are inaccurate reports from the AWS status page and failing to plan ahead when reserving resources to reduce costs and inefficiencies.

File storage, User and file data

We use Amazon S3 for storage. The EC2 upload instances, REST API, and processing layer all communicate with S3 directly. S3 gives us the ability to store customer files forever if they desire it.

File and user data are managed with a heavily customized Django REST framework. At first we used the out of the box Django REST framework as it helped us to rapidly deploy features. However, as our vision of how a REST API should work evolved we implemented customizations to fit our use cases. The footprint of our custom additions has grown large enough that updating the framework is a pain point. We're looking to modify this part of our stack to avoid adding further customizations that would compound this problem.

We use the micro framework Flask to handle sensitive data and OAuth communications. It is lightweight and efficient and doesn’t include any features that we don’t need such as queues, an ORM layer, or caches.

We explore this topic in more detail in an article on cloud-security on our blog explaining how Uploadcare gets content from social media and how we treat end user privacy.

Processing

The 350M API requests we handle daily include many processing tasks such as image enhancements, resizing, filtering, face recognition, and GIF to video conversions.

Our file processing requirements necessitate using asynchronous frameworks for IO-bound tasks. Tornado is the one we currently use and aiohttp is the one we intend to implement in production in the near future Both tools support handling huge amounts of requests but aiohttp is preferable as it uses asyncio which is Python-native.









Our real-time image processing is a CPU-bound task. Since Python is in the heart of our service, we initially used PIL followed by Pillow. We kind of still do. When we figured resizing was the most taxing processing operation, Alex, our engineer, created the fork named Pillow-SIMD and implemented a good number of optimizations into it to make it 15 times faster than ImageMagick. Thanks to the optimizations, Uploadcare now needs six times fewer servers to process images. Here, by servers I also mean separate EC2 instances handling processing and the first layer of caching. The processing instances are also paired with ELB which helps ingest files to the CDN.

Caching, Delivery

There are three layers of caching which help improve the overall performance:

Caching in the processing engine so that same operations are not run many times

Caching inside CDN-Shield, for CDN edges not to hammer origins and cache things more effectively

Cache on the CDN edges as the frontier closest to consumer devices

For delivery, files are then pushed to Akamai CDN with the help of nginx and AWS Elastic Load Balancer. We also use Amazon CloudFront but due to the lack of coverage, we moved to Akamai as default CDN. Also, Akamai has many cool features, for instance, it allows us to automatically adjust imagea formats to user browsers.

It's also worth adding that our file receive/deliver ratio is strongly biased toward delivery.

Front-end

Simple controls over complex technologies, as we put it, wouldn't be possible without neat UIs for our user areas including start page, dashboard, settings, and docs.

Initially, there was Django. Back in 2011, considering our Python-centric approach, that was the best choice. Later, we realized we needed to iterate on our website more quickly. And this led us to detaching Django from our front end. That was when we decided to build an SPA.

Building SPA for our front page, docs, and other site sections from scratch is an ongoing process. It's done with Node.js which is asynchronous and provides isomorphic rendering. In order to communicate with our older Django-based front end, it uses JSON API through nginx. And that's a rule of thumb we stick to: once separated, the communications between our front and back end will still be carried out via JSON API.

For building user interfaces, we're currently using React as it provided the fastest rendering back when we were building our toolkit. And it’s not just it: React has a great community which helps you code less and build more. It’s worth mentioning Uploadcare is not a front-end-focused SPA: we aren’t running at high levels of complexity. If it were, we’d go with Ember.

However, there's a chance we will shift to the faster Preact, with its motto of using as little code as possible, and because it makes more use of browser APIs.

On the client, we work with data using Redux.js and do routing via React Router. The latter is a great example of the awesome React community.

One of our future tasks for our front end is to configure our Webpack bundler to split up the code for different site sections. Currently, when you load a site page, you also get the code for all the other pages. Webpack is also a code-less-build-more tool with vast amounts of community examples to use as starter kits. We were thinking of using Browserify or Rollup, but they did not have a runtime and either worked slower than Webpack or required us to do way more coding to provide similar functionality. Plus, async chunks are easier to handle with Webpack.

As for the static site pages, many of them are Markdown-formatted files sitting in a GitHub repo. Overall, we’re using a GitHub Pull Request Model for deployment. The files from a repo then go through jinja2-inspired nunjucks followed by markdown-it and posthtml. Our docs are built this way, for instance.

For styles, we use PostCSS along with its plugins such as cssnano which minifies all the code.

As you can tell, we like the idea of post-processors. posthtml, for instance is a parser and a stringifier providing an Abstract Syntax Tree which is easy to work with.

All that allows us to provide a great user experience and quickly implement changes where they are needed with as little code as possible.

Deployment

As I mentioned above, we use the GitHub Pull Request Model. Certain parts of the process are automatic while others require manual intervention. We use the following steps:

We create a branch in a repo with one of our features

Commits are made to the branch. That’s followed by automatic tests in TeamCity

If tests pass, a PR is created followed by both auto tests and code review by the dev team

If tests pass/review ok, we merge changes to the staging branch which is then automatically deployed via TeamCity and Chef

When deployed, we run integration tests via TeamCity

If everything is green, we merge changes to the production branch and run one more battery of tests automatically via TeamCity

If tests pass and we ensure it’s not some Friday night, we deploy to production. Chef scripts here are run by devops.









We get notified about deployment status via a Slack channel. Regarding reports, we get a lot of input from our monitoring stack which includes Rollbar for reporting errors and exceptions, LogDNA for logging, Pingdom for testing our services externally, Amazon CloudWatch for monitoring AWS stats, and Datadog for consolidating reports and monitoring health status of our servers and services.

Team administration, tasks, communication

Along with Slack, there's also G Suite for emails, Trello for planning, HelpScout and Intercom for customer success communications and client relations, and more.

Vision

Since we provide the complete set of building blocks for handling files, we encourage everyone out there to use similar building blocks for different parts of their products. Our stack is a great example of that— it's a collection of pre-built components. And we can always go on and add more to the list. For instance, we're using Segment to send data for analyses, which, in turn, are carried out by Kissmetrics,Keen IO, Intercom, and others. We’re using Stripe for processing payments.

We practice what we preach: focus on what you want to create and let those building blocks handle the specific tasks they were made for.