There has been a resounding uptick in attention around machine learning, but with relatively few large-scale systems in production (and even fewer public stories about progress and roadblocks), the wider story is all about the potential and dramatically less about the possibilities for problems.

As we have covered here, building machine learning systems on the hardware front, while teeming with options, is not necessarily complex—at least in a relative sense. New tools, frameworks, and platforms are emerging constantly, reminiscent of the days when every company and startup had to have a hook that included the word “cloud” or “big data”. In many ways, machine learning, independent of its algorithmic developments over the years, is the product of both trends—mass attention can only be paid to something when the infrastructure and data basis is the foundation, after all. And machine learning, whether this is a correct assumption or not, is the natural next step or evolution of what has happened with analytics and data science over the last few years.

However, there is a caveat. Actually, there are many, but for these purposes, the one big one is a silent lurker rather than a screaming, pinpointed flaw—and a problem that isn’t as sexy to debate as say, cloud security, privacy, or the other mainstream issues that crop up to wide press attention. It has to do with building systems that are outwardly simple but are in fact laden with a cascading, compounding set of potential problems—many of which cannot be easily or immediately spotted and which threaten the still unsullied reputation of machine learning as the next essential phase of evolution beyond mere data analysis.

The software frameworks and various algorithms are also not in themselves cobbled together from wholly unknown parts. So what is it then, that makes deploying and using machine learning systems in production a challenge? The answer cannot be extracted from either the hardware or software exclusively, but rather is a system-level one.

While we have already described some problems with machine learning systems in the enterprise in particular, including the notorious black box problem, the real challenge, according to Google Research, can be summarized as technical debt—a concept common in software engineering, which refers to the code and system maintenance burden that comes with any big software project. In other words, there’s no free lunch in computing—and machine learning is certainly no exception. In the Google experience, “developing machine learning systems is relatively fast and cheap, but maintaining them over time is difficult and expensive.” Such systems have “all the maintenance problems of traditional code plus an additional set of machine learning-specific issues.”

“At a system level, a machine learning model may silently erode abstraction boundaries. The tempting re-use or chaining of input signals may unintentionally couple otherwise disjointed systems. Machine learning packages may be treated as black boxes, resulting in large masses of ‘glue code’ or calibration layers that can lock in assumptions. Changes in the external world may influence system behavior in unintended ways—even monitoring system behavior may prove difficult without careful design.”

The authors argue that the extent of this debt might be hidden away for some time because it rests at the system versus code level. As the technical debt concept goes, small problems compound when not immediately addressed (which is problematic if it’s not clear they are issues until it’s too late) and all of this compounds. Ultimately, “typical methods for paying down code level technical debt are not sufficient to address machine learning-specific technical debt at the system level.”

Data dependencies create a dependency debt, feedback loops create analysis debt, configuration debt, and other similar compounding complications with data testing, process management, and even cultural debt, the Google authors argue for awareness at the outset—and offer some insight about contending with these issues before they bankrupt the system. Further, with so much of any machine learning system’s code being dedicated to actual intelligence (most of it is “plumbing”) rerouting to pay such debt is incredibly difficult. Between all of this glue code, pipeline jungles, and dead experimental codepaths, among other sinkholes, rooting out the trouble is even more difficult.

“Research solutions that provide a tiny accuracy benefit at the cost of massive increases in system complexity are rarely wise practice. Even the addition of one or two seemingly innocuous data dependencies can further slow progress.”

All systems at scale contend with similar issues, whether it’s pipelining, job scheduling problems, or multiple versions of code. And while there are tools built to work around many of these in established areas where there are already a large number of large-scale systems in full production (HPC sites, for instance) with so much glue code and so many moving parts, machine learning systems are especially vulnerable to falling down a technical debt hole.

While the Google authors provide notes about how such debt begins to accumulate, the purpose is more of a warning—and a well-timed one as more companies are moving beyond their traditional “big data” tools and looking for the next level of intelligent analysis in machine learning (which is arguably, along with deep learning, going to be the next hyped thing among the data analytics set). But as the authors warn, “paying down machine learning related technical debt requires a specific commitment, which can often only be achieved by a shift in culture. Recognizing, prioritizing, and rewarding this effort is important for the long term health of successful machine learning teams.”