Data isn't the only thing that needs to be governed in big data systems. The queries run by data scientists and other users also have to be watched to make sure they don't bog down processing in Hadoop and Spark clusters.

Hadoop performance problems became an issue at BT Group PLC after use of its data lake environment started rising rapidly in early 2016 as production applications began proliferating. "We had a bow wave of demand from users," said Jason Perkins, head of business insight and analytics architecture at the London-based company.

Eventually, the communications and TV services provider had to "close the doors" to new users for a few months while it added more compute nodes to the Hadoop system, Perkins said. Properly balancing the "very mixed workload" of big data processing jobs remains a challenge, he added. And it could become a greater challenge -- BT plans to expand the number of applications in the cluster from about 100 as of April to 500 by year's end.

A fix for what ails Hadoop queries Carl Steinbach Carl Steinbach LinkedIn Corp. ran into similar issues in its Hadoop and Spark environment, which has grown to more than 10,000 nodes across multiple clusters accessed by thousands of users. In particular, the company found that overall processing performance would suffer if individual jobs weren't tuned properly, said Carl Steinbach, a senior staff engineer at LinkedIn who heads its Hadoop development team. At first, the Hadoop team tried to avoid such problems by meeting with users to review proposed queries and suggest changes. But that could take weeks -- and then the users had to "get back in the queue for another meeting," Steinbach said. "It wasted a lot of time for both them and my team." To accelerate the process, LinkedIn developed a tool called Dr. Elephant that monitors Hadoop performance and identifies problematic big data queries. The web-based tool runs on its own cluster node, continually analyzing system logs to find "sick jobs" and then offering advice on how to "treat" them, Steinbach explained. Everyone has a view of what everyone else is doing, and that motivates people to do the right thing. Carl Steinbachsenior staff engineer, LinkedIn The Mountain View, Calif., company, now owned by Microsoft, began using Dr. Elephant in 2015 and open sourced it last year. In tracking queries, Dr. Elephant provides "sort of a soft-governance model," Steinbach said. "It does shine a light on what's happening in the cluster. Everyone has a view of what everyone else is doing, and that motivates people to do the right thing." Software vendor Pepperdata this year added a product based on Dr. Elephant to its set of tools for managing Hadoop clusters and governing their use. A variety of other commercial and open source cluster management tools are also available from big data platform vendors such as Cloudera and Hortonworks as well as third-party software developers akin to Pepperdata.