Introduction

Enterprise Knowledge Graphs (EKGs) have been on the rise and are incredibly valuable tools for harmonizing internal and external data relevant to an organization to improve operational efficiency for the enterprise and competitive advantage for the business units. On the other hand, EKGs can be difficult to develop and sustain, suffer from scalability issues, and can be difficult for business units to consume. This article describes some of these challenges and how a flexible data representation of a native multi-model database can address them (see Figure 1).





Figure 1: The Multi-model knowledge graph blends multiple data representations in one system.

What Is an Enterprise Knowledge Graph?

Knowledge graphs have been instrumental in creating trillions of dollars in wealth for companies like Google, Apple, Facebook, Twitter, MicroSoft, Linkedin, Ebay, and Alibaba, who developed their own technology stacks to support knowledge graphs. By contrast, EKGs are developed on open source and commercial graph database products to harmonize an organization’s content, data, and information assets in terms of industry or enterprise-specific knowledge models.

An EKG is a representation of an organization’s knowledge domain and artifacts that is understood by both humans and machines. It is a collection of references to your organization’s knowledge assets, content, and data that leverages a data model to describe the people, places, and things and how they are related.

You might also like: Knowledge Graphs Are the New Black. The Year of the Graph Newsletter: May 2019

Not all graphs are EKGs: Enterprises may have many business knowledge graph (BKG) solutions deployed, and an important distinction to note, is that bespoke knowledge graphs built to address a specific business need, for example, next best action, recommendation, or impact analysis are not EKGs. BKG’s are built to support narrow business use cases, whereas EKGs are developed to supply high quality harmonized data to multiple business units and address multiple use cases. In the next section, we will talk about the challenges and opportunities in leveraging EKGs to support business use cases.

EKG Challenges and Opportunities

EKGs contain valuable, high-quality data harmonized from multiple data sources. The advantage to business units is that it eliminates time and effort of integrating data sources for supporting business use cases. Current EKG solutions harmonize multiple disparate heterogeneous source systems in terms of an enterprise conceptual model or ontology. The raw data is usually staged on distributed storage (Hadoop/HDFS, S3), and then a middleware cluster is used to extract transform and load (ELT) the data to graph database cluster.

EKGs then support enterprise applications like enterprise search and they also need to extract and transform the EKG data in a variety of formats (documents, tables, key-value, and graph) to support business applications.





Figure 2: Impedance mismatches when harmonizing to graph and supplying data from the graph

EKGs often fail to realize their full potential because enterprises struggle with the complex multi-source data logistics needed to harmonize data into graphs for an EKG and then business users struggle with the complex and unfamiliar knowledge graph representations and the lack of tooling needed to consume them. Organizations can expend massive effort to harmonize dozens to hundreds of data sources into an EKG, while solving data governance issues like data provenance and preservation of entitlements, only to face challenges in the last few hundred feet in getting their business units to leverage the high quality curated EKG data.

The essence of the problem is that the “all or nothing” conversion of data to graph causes an impedance mismatch (see Figure 2) between source data representations and EKGs and between EKGs and the way business units would like to consume and process their data with their tools. Multi-model based EKGs reduce data impedances by allowing diversity of representation in the knowledge graph, which allows agile incremental harmonization to graph as well as minimal transformation to data when needed by the consuming business units.

The Challenge of Harmonizing Many Data Sources to Graph

Enterprises need to harmonize a large number of disparate data sources. In general, the more relevant data sources that are harmonized, the greater the potential value to the enterprise. However, the cost of harmonizing data to the graph can increase exponentially with the number of data sources. This is why enterprises are eager to find ways to automate data harmonization and to apply agile methodologies to provide data harmonization based on needs.

Figure 3: EKG Data Harmonization Effort Increases Exponentially with Number of Data Sources

Complex knowledge representations are needed to represent the nuances of disparate data and normalize to a graph structure. All relevant source data consumed and syndicated by the knowledge graph needs to be transformed to graph structure in a single model graph database. Mapping source data to these complex knowledge graph representations requires time, effort, and knowledge.

The resulting EKGs can stress the performance at scale capabilities of graph databases and require huge amounts of resources. The truth is that there will always be more data than graph databases are able to scale to, particularly when you consider the practical scale of data housed in key-value and document stores (see Figure 4).

Figure 4: Graphs handle data complexity, whereas document and key-value handle scale.

Multi-model databases are able to blend key-value, document, joins, and graph data models in a way that allows them to scale and, at the same time, simplify the graph representations needed. For example, cybersecurity information in an enterprise grows at a rate of many trillions of edges per year when represented as a pure graph. The same enterprise cybersecurity graph could be represented in billions of edges when combining graph, documents, and joins.

The enterprise looking for ways to reduce the effort needed to develop and maintain EKGs often ask questions, like:

Can we automatically classify, map, and transform source data to knowledge graph?

Can we automatically refactor EKGs when the conceptual model changes?

Can we search over source, knowledge graph, and curated data?

No practical solutions exist yet for automating data harmonization to a graph. This article focuses on challenging the key assumptions underlying EKGs: that the EKG must be a monolithic graph model and that all data must be converted to a graph to be useful. EKG deployment and sustainment effort can be reduced, and the potential scale of EKGs increased by relaxing this assumption by allowing it to contain other data models. This would allow for EKG development and sustainment to be more dynamic and agile. Knowledge graphs that permit other data models allow staging data and graphs to exist in the same database and delay graph harmonization to when to be tackled in an agile and iterative way.

The Challenge of Making EKGs Easily Consumable

The complex knowledge representations that are needed to represent the nuances of the data and normalize to a graph structure are also an impediment to business users. Business users struggle with the complex representations and unfamiliar data formats used in knowledge graphs and the lack of tooling needed to consume them. Common EKG questions are:

Does it work with the tools I am using?

Will my developers know how to use it?

How do I find relevant data?

How do I bound the data I want?

How do I get the data in the format that I need?

The essence of the challenge is that there is an impedance mismatch between EKGs and the way business units would like to consume and process their data with their tools. It would be a perfect world if everyone worked with graph data — graphs are the exception, not the rule.

For example, a business might need all of the trades from January 2017 to December 2019 for politically exposed customers and direct family members and require this data to be delivered in a JSON document collection in a particular document structure. They do not want to learn or use a graph query language to do this. What they want is a data shopping experience where they visit the EKG store and search the EKG shopping catalog for data using faceted filters and the EKG store recommends data sets as well as data that complements their data, and then they specify how they want it delivered and when.

Multi-Model Enterprise Knowledge Graph

Multi-model enterprise graphs (MMEKGs) can alleviate many of the issues described earlier by allowing users to blend and manage source, EKG, and curated data representations in one ecosystem.

Reduced Time and Cost

MMEKGs allow graph transformation to be delayed until needed. Multi-model graphs also tend to reduce the size of graphs because they allow edges and vertices to contain documents. This allows EKGs to developed using agile iterative processes.

Figure5: Knowledge Graph Data Harmonization if more efficient using multi-model graphs

Reduced Computing Resources

EKG solutions often require separate data systems for staging, graph ETL, graph management, and delivering data to consuming business units (see Figure 6). MMEKGs can eliminate the impedance mismatch between source data, knowledge graph, and curated business data allowing the data to be managed in one system, thus reducing transformation latencies and making all data searchable. This reduces the cost of having separate clusters for staging, transformation, graphs, and business applications (see Figure 7).



Figure 6: A typical EKG Ecosystem uses multiple systems for staging and transformation





Figure7: Source Data, EKG, and curated business data can be managed in the same multi-model database.

Ease of Use

Multi Model makes source data, knowledge graph, and business application data searchable and findable in the same data system. Business users can consume the data in their own formats, without having to understand the complex enterprise graph models. Enterprise users can search for source data as well as curated data.

Data Lineage/Provenance

With data staged, transformed, and delivered in the same multi-model system, it is much easier to keep track of data lineage

Enhance Existing EKGs

Enterprises that have RDF EKGs can preserve the effort put into them and leverage them in an MMEKG. Model databases can ingest RDF ontologies and RDF EKGs because the multi-model graph is a superset of the labeled directed graphs that RDF is based on. Similarly, multi-model graphs subsume property graphs making it easy to absorb property graph-based EKGs.





Figure8: The Multi-Model EKG can ingest RDF and Property graph-based EKGs

Conclusion

Multi-model is a useful enabling technology for EKGs. The benefits include streamlining multi-source data to EKGs, increasing the usability of EKG data for business use cases, enabling greater scale by blending models, and reducing EKG ecosystem footprint.

Further Reading

Knowledge Graphs and NLP. The Year of the Graph Newsletter: July/August 2019



KGCNs: Machine Learning Over Knowledge Graphs With TensorFlow