Posted by Matt Cowger, Global Head of Startup Architects

Building a GCS Uploader

In my time since joining Google, more than one startup has asked about the possibility of receiving files into Google Cloud Storage (GCS) so that they can be the start of a Pub/Sub based pipeline for all sorts of use cases — image recognition, data ingestion, etc.

To me, this seems like a natural fit for the product. However — there’s no available ‘shim’ layer for doing these — there’s no way to treat GCS like a FTP server, or a direct (simple) web upload target. There are work arounds with gcsfuse or gsutil, but those require end users to install and utilize command line products that are Google specific, and even then have the challenge of not being usable without direct Google credentials in the project.

However, even a naive implementation of such a system would be suboptimal (I suppose thats what its called naive). By placing a single FTP or web upload shim (likely hosted out of a single GCP zone), we negate much of the power and performance of Google’s global network and GCS’s distributed nature, as well as potentially limit performance to the network interface upon which our shim runs. Optimally, we’d want something with the following characteristics:

Uses exclusively standard web technologies and runs on modern browsers. This means supporting HTTP/1.1, HTTP/2, TLS, etc. Avoids custom plugins or tools Can be used in an anonymous or semi anonymous way — at minimum we need to avoid the use of project credentials on the client side. Maximizes the value of Google’s network along with the user’s network, so that we get the best upload performance possible. Has the lowest price to maintain — optimally we don’t want to end up with a daemon running somewhere that has a cost much above the cost of GCS object storage in general.

Using these ideas as my guide, I developed a prototype to do exactly that, using only 2 Google products (Cloud Run and Google Cloud Storage).

Ultimately the solution to the problem was to use the fact the GCS supports signed URLs — URLs that have all relevant (time limited) authentication information built into them. We can use those as a way to avoid the need to deliver standard GCP credentials. By using signed URLs we also avoid issues #1, #2 and #4 — all of the bulk upload traffic from the user to GCS is carried directly to the nearest Google Cloud edge point over HTTP, and not through a single system somewhere.

The two biggest challenges are to:

Generate those signed URLs Design a frontend UI

Generate Signed URLs

Generating signed URLs within GCP is a fairly simple process, and supported by a wide number of languages and their associated GCP SDKs. Because Python is superior to all other languages in every way, I chose to use it as my base, but it could be done in nearly any language you prefer.

After a few setup procedures detailed in the GitHub repo for this project, I centered on a key function to generate the URLs:



def getSignedURL():

filename = request.args.get('filename')

action = request.args.get('action')

blob = bucket.blob(filename)

url = blob.generate_signed_url(

expiration=datetime.timedelta(minutes=60),

method=action, version="v4")

return url @app .route('/getSignedURL')def getSignedURL():filename = request.args.get('filename')action = request.args.get('action')blob = bucket.blob(filename)url = blob.generate_signed_url(expiration=datetime.timedelta(minutes=60),method=action, version="v4")return url

It’s very simple, and very short — consider this just a backend API. After parsing some incoming parameters (note: there are gaping security holes here — this function uses a client generated value with absolutely no sanity checking), we ask the API to sign a URL for this path in the bucket that’s good for 60 minutes.

It’s worth remembering that the Python itself is just an API — it does not really handle any of client client side code. However, for ease of testing and deployment, I also have the static part of the site (HTML, CSS and JavaScript) served from the same container, but that could be replaced with GCS website serving as a later optimization.

Running this small Python script is its own interesting thought process: Do we run this on a Compute Engine instance, as a AppEngine instance, as a Cloud Function, in Cloud Run or in Kubernetes (GKE)? All of these would work, but for my case, I wanted as little management as possible, making Cloud Functions and Cloud Run the top contenders. Both support scale-to-zero and per-millisecond billing, meeting requirement #5. For me — I’m most comfortable testing, deploying and managing containers, so I went with Cloud Run.

The last important component is the client side work, where the user enters the file they wish to upload with a standard form, then the request for the signed URL is made:

async function generateSignedURL() {

file = getFilename();

action = "PUT";

const response = await fetch('/getSignedURL?filename=' + file + "&action=" + action)

if (!response.ok) {

throw new Error('Network response for fetch was not ok.');

}

c = await response.text();

c = c.replace(/\"/g, "")

console.log("Got signedURL: " + c)

console.log("Trying to upload " + file)

upload();

console.log("Complete")

return false;

}

And then lastly the form itself is submitted on button click by the user. This is special because the target of that form is the GCS signed URL directly rather than the Python service, meaning we are only limited by client bandwidth for the upload and maximize the performance benefit of Google’s network.

Once the code has been pushed to Cloud Run ( gcloud run deploy ), its ready to go (note, I’m skipping the process of building a container — that process is left up to the reader, but the Dockerfile is in the repo)!

You can find the full repo on my GitHub: https://github.com/mcowger/gcs-file-uploader