The Wild West days of big data may not be totally gone, but it sort of felt that way this week on the Strata Data Conference’s Expo floor. Yes, there were vendors hawking the latest machine learning technology, but there were an equal number of vendors targeting basic tasks around data management, such as finding it, cleaning it, governing it, and making sure it’s not going to get your company fined $1 billion or get your CIO fired.

The lure of AI is strong, and many companies want to utilize machine learning to automate decision-making on data. And they should seek that goal, because technological breakthroughs and the rapid pace of data generation are making that possible. But for many companies, their data desires simply are more aspirational at this point. The percentage of big data projects that end in success is below where many executives thought it would be at this point.

The culprit, in many cases, is the data itself. To sum it up, the data is often a mess.

Big data poses all sorts of problems for companies. It can be hard to find, hidden in a data lake or squirreled across a dozen databases or SaaS apps. It can be untrustworthy without a reliable history or lineage. The data can contain unconsented personally identifiable information (PII) that’s protected by GDPR and, soon California’s Consumer Protection Act (CCPA). Or it could be sitting in a misconfigured S3 bucket, a sitting duck for clever hackers.

You’ve undoubtedly heard about the difficulties and dangers of data before. If you haven’t, then you may want to subscribe to the Datanami newsletter.

Enterprise Data Security

One of the younger companies exhibiting at the Javits Center is Privacera. The data governance and security company was founded by Don Bosco Durai and Balaji Ganesan, the original developers of the open source Apache Ranger and Apache Atlas projects at Hortonworks.

Privacera essentially is using those two open source technologies as the basis for a solution that helps customers discover and catalog sensitive data across all of their data stores, both cloud and on prem (Apache Atlas), and then allow users to access it using fine-grained access (Apache Ranger).

“We see this not as a Hadoop problem,” says Jeff Kelly, Privacera’s vice president of marketing. “This is a horizontal enterprise-wide data problem. How do you govern and secure your data across all these heterogeneous systems, whether they’re in the cloud or premises? So that’s basically why Privacera exists, to help solve that problem.”

The technology for hooking into Hadoop data stores and then providing fine-grained access control already exists in the form of Ranger and Atlas. Privacera’s value-add is extending the reach into all the other systems that are in play today. That includes public cloud offerings like AWS Redshift, Microsoft Azure Data Warehouse, and Google Cloud BigQuery. It includes third-party cloud offerings like Snowflake and Databricks. And it includes existing on-prem solutions like Teradata and Vertica.

“A lot of customers will come to us and say, in the past data governance and security was important, but it wasn’t necessarily holding us back from doing some things, depending on the industry,” says Kelly, who previously worked as an analyst at Wikibon.

“What we’re finding now is customers are saying they’re so concerned about privacy and governance, we’re holding data back from users, and in order to open it up to then, we have to be able to govern them,” he continues. “We see this directly tied to maximizing the value of your data. It’s not just governing just to check a box. ‘Oh we comply with GDPR.’ Oh great. The more fine-grained access control you have, the more data you can open up to people.”

In Search of Pristine Data

Another company in the space worth keeping an eye on is Ataccama. The firm was founded in 2008 with a goal of developing data quality and master data management (MDM) tools designed for the modern enterprise.

Those two offerings were adopted by around 200 customers, such as Starbucks, American Airlines, the Toronto public library, and GlaxoSmithKline, as well as another 100 custoers through its OEM partner, Information Builders. Ataccama’s offerings also drew the attention of Gartner, which has included them in the leader’s quadrant of the Magic Quadrants for data quality and MDM for the past several years.

About three years ago, as the big data boom continued, Ataccama expanded into adjacent areas of concern around data management, and it rolled out two new products, including a data discovery and data profiling tool, and a data catalog and business glossary tool.

Customers can use these four offerings separately. But the real advantage comes in using them together, according to Vladimir Emelianov, big data practice lead for Ataccama. That’s where the tight integration and cohesion really allow users to get the most out of their data.

“We are good at covering different data management use cases,” Emelianov says. One of the big advantages is “how easy it is to actually move on from one use case to another. We hear complaints that [other] tools don’t easily integrate with each other. So to do data quality and MDM, it’s complete different software and tools.”

Bolstering its story is the fact that Collibra, with whom Ataccama competes in the data catalog front, is a partner when it comes to data quality. “We have been in the data quality market for 10 year and we are doing it in a much better, more powerful way. They realize that Collibra is not enough for data quality.”

While Ataccama is named after the Atacama Desert in northern Chile — one of the driest and most pristine places in the world — the company is based in Prague, the Czech Republic. Sixty percent of its customers are based in North America, and another 30% hail from the UK, which seems fitting.

A Model for Data Governance Success

One of the newer faces on the big data management scene is erwin. You might recognize the name of this company as the popular data modeling tool from CA Technologies. About three years ago, Adam Famularo and his partner Jim McGarry, in concert with Parallax Capital Partners, acquired Erwin from CA with plans to turn it into a full-stack data governance provider.

Famularo and McGarry orchestrated the acquisition of several other technology companies to flesh out the fledgling company’s portfolio. In addition to the core data modeler, erwin bought or developed a data catalog, a business process modeling tool, a metadata management and governance company, a data harvesting tool, and a data governance consulting firm.

“Over the course of three years, we bought five total assets,” Famularo told Datanami at Strata. “We built and innovated new technology on our own, and at the end of the day, today we go into large enterprise and provide a comprehensive approach to data governance.”

Big data projects are often stymied due to bad data. By helping to ensure the data is well governed, downstream data analysts and data scientists can get more value from the data.

“If you actually put our technologies into play, we will help you underhand where all the data is, how it’s all structured today, create similar names and terms, and create all the lineage, so that when you’re doing data science and data analytics work, you’re using higher quality data than you currently have today,” Famularo said. “I always had a bigger vision, while working at Verizon and seeing how much IoT data was being created, that there was a bigger play here around data governance.”

Dirty Data Dancing

Trifacta CEO Adam Wilson understands the problems around data cleaning and data transformation like no other. In fact, because Trifacta is a leader in using ML to prepare data for more ML (or advanced analytics or BI), his company just completed a $100-million round of financing that cements its position as the undisputed leader in data prep.

“I think people have really understood this is where bottlenecks are, where a lot of cost is, frankly where, worst case scenario, if you don’t get your data quality right, you’re going to start automating bad decisions faster based on bad data, and that scares CIOs and CEOs and COOs to death right now,” Wilson said during an interview at Strata this week. “So this has become a burning platform for people to really fix their data challenges, and Trifacta plays a central role in that.”

As Wilson points out, Trifacta has been beating this drum for the past decade, ever since the technology underlying the company’s popular software emerged from Cal Berkeley and Stanford. But now, it seems as though the market is finally starting to get the message that succeeding with big data is hard, and it’s often due to the data itself. There’s no single technology or magic button you can press to make this stuff work.

“People tried that five, six, seven years ago. They pressed the button and the magic didn’t come out,” Wilson said. “And all of a sudden, people started starching their heads, saying why is this so hard? Why are we not getting user and use cases adopting these modern architecture. Why are we not able to handle the explosion of the diversity of data that we’re dealing with. Why is data quality still crippling us? Those persistent problems didn’t go away. And I think to some extent this trend around AI and ML has now created a desire to say, OK guys let’s get this right.”

Trifacta’s message seems to be resonating, in light of the $100 million financing round it recorded two weeks ago. In any case, these were not the only software firms at Strata this week who are trying to solve these tough security, governance, and cleansing challenges. Cloudera itself talked a lot about how it’s planning to span security and governance for its data platform on prem and in the cloud.

There were many other firms at Strata this week who were beating the governance, security, and quality drum. This includes companies like Io-Tahoe, which has a unique take on using machine learning for data discovery and cataloging; Alation, which largely defined the data catalog space in the first place and continues to build innovative solutions; Collibra, which combines data cataloging with other aspects of governance; Okera, which functions as a control point for giving users access to data; and Immuta, which takes a data-centric approach to governance.

Keep an eye out, because we’ll be covering the offerings from these firms, and others, on these Datanami pages in the future.

Related Items:

Data Gone Wild: Rampant Growth and Clouds Heighten Risk

What to Expect at Strata This Week

Governing Consumer Data to Improve Quality and Enforce Privacy