Let …

A=amount of analysis, in TB, performed each month.

T= total amount stored, in TB.

p=percentage of T that’s long-term storage (and 50% off!)

l = the amount GCP charges you to store one TB long-term. That’s currently one penny per GB, so $10.24 per TB. (As we’ll see, all other charges can be expressed as multiple of ‘l:. Short term storage is ‘2l’. Analysis charges run roughly ‘0.5l’ and so on…)

Before we go on, let’s express ‘A’ as multiple of ‘T’. Intuitively, we can picture some projects doing lots of queries on a single small table just as well as picturing the opposite: few queries on lots of large tables. For this, I introduce an index of sorts, with a tortuous name of “High Operation Financial Feasibility Assessment’ Index. From here on, we’ll just its acronym, ‘HOFFA’. So:

HOFFA = A/T

The HOFFA index is simply a multiple of how many times you churn through your total storage in analysis each month.

The good folks at Google might, one day, share the distribution of HOFFA scores from their customers, but you can imagine that:

Scores near zero imply users who store a lot of data in BQ, but do little analysis on it. These folks probably misunderstood what BQ is all about, and might be better off just storing data in GCS.

Scores higher than, say, 3 imply a lot of analysis being done — at least relative to the amount being stored. These users are power queriers, and should get cracking on their clustering efforts pronto!

Scores higher than 10 are users who, like their low-HOFFA counterparts, may not be using the right tool for the job. (Or need flat-rate pricing).

But, wait … there’s more!

When you analyze your bill, you see a lot of BQ SKUs listed. At it’s simplest, the three components of your BQ bill can be expressed as Analysis + LongTerm Storage + Short Term Storage (also called “Active Storage”). Using the variables described above, we get:

Bill = (A*l/2)+(T*l*p)+(T*2*l*(1-p))

I won’t show all the algebra, but we can also express it using the HOFFA index like so:

Bill = 2*T*l*(HOFFA/4.096 + 1 — p/2)

I realize this isn’t exactly E=mc², but there’s a little beauty here nevertheless. The 3 components inside the brackets tell you your HOFFA score will matter in any attempt to lower your bill (since clustering benefits are inherently tied to your HOFFA score as we’ll see), as will the discount you’re currently getting on long-term storage (the ‘-p/2’ part, which you’ll have to forgo for 90 days whenever you tinker with your old tables as they become ‘new’ once again.). Go ahead, I’ll wait while you go and try this formula out on your own bill.

In our own case, we found that clustering on high-cardinality columns would reduce our query-costs by 40%. That’s impressive, but is it worth giving up the Long-Term storage savings? Here, some more algebra to the rescue. Let’s call “C” the clustering benefit, which we’ll peg as “0.6” to represent our 40% savings. Intuitively, we know the benefit will occur near the Hoffa index, so we will enjoy a new bill of

NewBill = 2*T*l*(HOFFA*C/4.096 + 1 — p/2)

Now, the only thing left to figure out is whether it’s worth giving up the ‘-p/2’ part, which is the credit you get for keeping old data around in BQ. Any new tables you make today won’t be old again for 3 months, so we might as well think of this as your clustering penalty, or rather your initial investment and express it like so:

penalty = 6*T*l(-p/2)

Here’s the ‘penalty’ in action! Red Line indicates the surge in Short Term (Active) storage in Mid December, when we ran a special Dataflow job to recreate all the daily tables as clustered partitions. Suddenly years of ‘old’ data became ‘new’ again! Fortunately, our analysis charges (blue) went way down, and the Red will eventually recede into the cheaper Green (Long Term) rates, so we’ll come out way ahead.

In our case, we went into this with a HOFFA index of 2.98, and determined our penalty to be about $1700 on one of our projects. But, since our ‘NewBill’ would be $400 less each month, we could see our Return on Investment would be just 4.25 months, which sounds even more impressive if stated as 282% annually.

Warning: Minor rant ahead…

That’s clearly worth it, and we’re delighted to share this finding but I’ll veer off the BQ-fanboy page for a minute here and point out that it would’ve been really nice if GCP had handled this for us. The company that can autocomplete my search query almost certainly recognizes that we’re performing routine queries on high-cardinality columns on their managed service. What would’ve been the harm in taking a guess at cluster keys, and passing their own savings as an across-the-board-drop in analysis charges — all under the hood? I mean, it’s not like them guessing wrong would have violated any Service Level Agreement with respect to speed — there isn’t one! Maybe they optimize wrong and my queries run a little slower the next day. The promise with BQ has always been “we’ll get the results back as fast as we can and -on average- you won’t find anything faster around. But that doesn’t mean you won’t occasionally have slower queries.”

And by now product managers know this is what’s expected. Why is Long Term data cheaper than Active storage anyhow? 90 Day old data doesn’t seem to return any slower than recent, so why does Google pass on a 50% discount for keeping old data around? GCP has said on various occasions, and I’ll paraphrase, “after 90 days, we’ve pretty much crunched and optimized the storage anyhow, so we figured we’d pass on the savings so users don’t feel they need to delete data all the time”. That’s in keeping with the long tradition of cloud vendors — they pass on discounts when it’s in line with their own cost structure. (AWS’ S3, right out of the gate, said “If you don’t need your object storage all that often, we can put it near the back of the fridge, which will be cheaper for both of us”.) This is why it’s a bit jarring to stumble on these kinds of ROIs in ‘managed services’ — it almost feels like these opportunities shouldn’t exist in the first place, like finding out you can trade stock indexes cheaper than Jack Bogle (RIP).

(In an alternate universe, GCP would’ve developed an AI-model that ‘optimized’ tables through clustering techniques, and they’d have passed on the savings in reduced analysis charges to the rest of us. In the meantime, maybe that’s an opportunity for all those ‘understand your cloud bill’ people.)

To conclude, we really are happy we did all of this. It feels a bit like refinancing a mortgage, and paying various ‘up front’ costs to do so, but we can see the benefits are clearly there. Know your HOFFA number, and run your own Clustering benchmarks to determine if it’s worth it for you. And let me know how it goes!

You can follow me on twitter @pmueller, and my company (www.atso.com) is available to help with your heavy analytics work, especially in the telecom and energy space.