When building software systems, we usually deal with data from external sources. This can be user input, data coming from other systems, etc. My basic assumption on any external data is: don’t trust it!

Any data that we don’t know completely ahead of time can and will behave differently than what we expected. A classic example for this is user input, say a text field. If we don’t limit the length and contents, somebody will eventually enter a book lengths worth of data or try to use it to attack a system.

But the same problem extends to data from systems we control and that we might have faith in. At Mozilla we have a variety of teams and products which are deployed to a large and diverse population of users across the world. We may think we know how those products and the data they generate behave, but practically we always find surprises.

A case study

Let’s consider a hypothetical mobile application. The mobile application has a basic system to collect product data, and makes it easy to collect new string values. To make it easier and more flexible for our teams to add something, we don’t impose a hard limit on the length of the string. We have documentation on the various instrumentation options available, making it easy to choose the best for each use-case.

Now, this works great and everybody can add their instrumentation easily. Then one day a team needs data on a specific view in the application to better understand how it gets used. Specifically they need to know how long it was visible to the user, which buttons were interacted with in which order and what the screen size of the device was. This doesn’t seem to directly fit into our existing instrumentation options, but our string recording is flexible enough to accommodate different needs.

So they put that data together in a string, making it structured so it’s reasonable to use later and we start sending it out in our JSON data packages:

...

"view_data": "{\"visible_ms\": 35953, \"buttons_used\": [\"change_name\", \"confirm\", \"help\"], \"screen_size\": \"960×540\"}",

...

The change gets tested and works fine so it gets shipped. Some time later we get a report that our product dashboards are not updated. An investigation shows that the jobs to update the dashboards were timing out, due to unusually large strings being submitted. It turns out that some users click buttons in the view 100 times or more.

What’s more, a detailed review shows that the user churn rate in our dashboard started to increase slightly, but permanently, around the time the change shipped. The favored hypothesis is that the increased data size for some users leads to lower chances of the data getting uploaded.

So, what went wrong?

To be clear, this is built as a bad example. There is a whole bunch of things that could be learnt from the above example; from getting expert review to adding instrumentation options to building the data system for robustness on both ends. However, here i want to highlight how the lack of a limit for the string length propagated through the system.

No software component exists in isolation. Looking at a high-level data flow through a product analytics system, any component in this system has a set of important parameters with resulting trade-offs from our design choices. The flexibility of a component in an early stage puts fewer constraints on the data that flows through, which propagates through the system and enlarges the problem space for each component after it.

The unbound string length of the data collection system here means that we know less about the shape of data we collect, which impacts all components in the later stages. Choosing explicit limits on incoming data is critical and allows us to reason about the behavior of the whole system.

Find the right limit

Choosing a limit is important, but that doesn’t mean we should restrict our data input as much as we can. If we pick limits that are too strict, we end up blocking use-cases that are legitimate but not anticipated. For each system that we build, we have to make a design decision on the scale from most strict to arbitrary values and weigh the trade-offs.

For me, my take-away is: Have a limit. Reason about it. Document it. The right limit will come out of conversations and lessons learnt — as long as we have one.