By Anmol Rajpurohit.

Editor's note: This post was originally included as an answer to a question posed in our 17 More Must-Know Data Science Interview Questions and Answers series earlier this year. The answer was thorough enough that it was deemed to deserve its own dedicated post.



Disclaimer: Any views or opinions presented in this article are solely those of the author and do not reflect the views of his employer or other affiliated institutions.

The most common data quality issues observed when dealing with Big Data can be best understood in terms of the key characteristics of Big Data – Volume, Velocity, Variety, Veracity, and Value.

Volume:

In the traditional data warehouse environment, comprehensive data quality assessment and reporting was at least possible (if not, ideal). However, in the Big Data projects the scale of data makes it impossible. Thus, the data quality measurements can at best be approximations (i.e. need to be described in probability and confidence intervals, and not in terms of absolute values). We also need to re-define most of the data quality metrics based on the specific characteristics of the Big Data project so that those metrics can have a clear meaning, be measured (good approximation) and be used for evaluating the alternative strategies for data quality improvement.

Despite the great volume of underlying data, it is not uncommon to find out that some desired data was not captured or is not available for other reasons (such as high cost, delay in getting it, etc.). It is ironical but true that data availability continues to be a prominent data quality concern in the Big Data era.

Velocity:

The tremendous pace of data generation and collection makes it incredibly hard to monitor data quality within a reasonable overhead on time and resources (storage, compute, human effort, etc.). So, by the time data quality assessment completes, the output might be outdated and of little use, particularly if the Big Data project is to serve any real-time or near real-time business needs. In such scenarios, you would need to re-define data quality metrics so that they are relevant as well as feasible in the real-time context.

Sampling can help you gain speed for the data quality efforts, but this comes at the cost of a bias (which eventually makes the end result less useful) because of the fact that samples are rarely an accurate representation of the entire data. Lesser samples will give higher speed, but with a bigger bias.

Another impact of velocity is that you might have to do data quality assessments on-the-fly, i.e. somewhere plugged-in within the data collection/transfer/storage processes; as the critical time-constraint does not give you the privilege of making a copy of a selected data subset, storing it elsewhere and running data quality assessments on it.

Variety:

One of the biggest data quality issues in Big Data is that the data includes several data types (structured, semi-structured, and unstructured) coming in from different data sources. Thus, often a single data quality metric will not be applicable for the entire data and you would need to separately define data quality metrics for each data type. Moreover, assessing and improving the data quality of unstructured or semi-structured data is way more tricky and complex than that of structured data. For example, when mining the physician notes from medical records across the world (related to a particular medical condition) even if the language (and the grammar) is same the meaning might be very different due to local dialects and slang. This leads to low data interpretability, another data quality measure.

Data from different sources often has serious semantic differences. For example, “profit” can have widely varied definitions across the business units of an organization or external agencies. Thus, the fields with identical names may not mean the same thing. This problem is made worse by the lack of adequate and consistent meta-data from each data source. In order to make sense of data, you need reliable metadata (such as to make sense of sales numbers from a store, you need other information such as date-time, items purchased, coupons used, etc.). Usually, a lot of these data sources are outside an organization and thus, it is very hard to ensure good metadata for such data.

Another common issue is syntactic inconsistencies. For example, “time-stamp” values from different sources would be incompatible unless they are captured along with the time zone information.

Veracity:

Veracity, one of the most overlooked Big Data characteristics, is directly related to data quality, as it refers to the inherent biases, noise and abnormality in data. Because of veracity, the data values might not be exact real values, rather they might be approximations. In other words, the data might have some inherent impreciseness and uncertainty. Besides data inaccuracies, Veracity also includes data consistency (defined by the statistical reliability of data) and data trustworthiness (based on data origin, data collection and processing methods, security infrastructure, etc.). These data quality issues in turn impact data integrity and data accountability.

While the other V’s are relatively well-defined and can be easily measured, Veracity is a complex theoretical construct with no standard approach for measurement. In a way this reflects how complex the topic of “data quality” is within the Big Data context.

Data users and data providers are often different organizations with very different goals and operational procedures. Thus, it is no surprise that their notions of data quality are very different. In many cases, the data providers have no clue about the business use cases of data users (data providers might not even care about it, unless they are getting paid for the data). This disconnect between data source and data use is one of the prime reasons behind the data quality issues symbolized by Veracity.

Value:

The Value characteristic connects directly to the end purpose. Organizations are harnessing Big Data for many diverse business pursuits, and those pursuits are the real drivers of how data quality is defined, measured, and improved.

A common and old definition of data quality is that it is the “fitness of use” for the data consumer. This means that data quality is dependent on what you plan to do with the data. Thus, for a given data two different organizations with different business goals will most likely have widely different measurements of data quality.This nuance is often not well understood – data quality is a “relative” term. A Big Data project might involve incomplete and inconsistent data, however, it is possible that those data quality issues do not impact the utility of data towards the business goal. In such a case, the business would say that the data quality is great (and will not be interested in investing in data quality improvements). For example, for a producer of mashed potato cans a batch of small potatoes would be of same quality as a batch of big potatoes. However, for a fast food restaurant making fries, the quality of the two batches would be radically different.

The Value aspect also brings in the “cost-benefit” perspective to data quality – whether it would be worth to resolve a given data quality issue, which issues should be resolved on priority, etc.

Putting it all together:

Data quality in Big Data projects is a very complex topic, where the theory and practice often differ. I haven’t come across any standard theory yet that is widely-accepted. Rather, I see little interest in the industry towards this goal.In practice, data quality does play an important role in the design of Big Data architecture. All the data quality efforts must start from a solid understanding of high-priority business use cases, and use that insight to navigate various trade-offs (samples given below) to optimize the quality of the final output.

Sample trade-offs related to data quality:

Is it worth improving the timeliness of data at the expense of data completeness and/or inadequate assessment of accuracy?

Should we select data for cleaning based on cost of cleaning effort or based on how frequently the data is used or based on its relative importance within the data models consuming it? Or, a combination of those factors? What sort of combination?

Is it a good idea to improve data accuracy through getting rid of incomplete or erroneous data? While removing some data, how do we ensure that no bias is getting introduced?

Given the magnanimous scope of work and very limited resources (relatively!), one common way for data quality efforts on Big Data projects is to adopt the baseline approach, in which, the data users are surveyed to identify and document the bare minimum data quality needed to ensure that the business processes they support are not disrupted. These minimum satisfactory levels of data quality are referred to as the baseline, and the data quality efforts are focused on ensuring that data quality for each data does not fall beyond its baseline level. It looks like a good starting point and you may later move into more advanced endeavors (based on business needs and available budget).

Summary of Recommendations to improve data quality in Big Data projects:

Identify and prioritize the business use cases (then, use them to define data quality metrics, measurement methodology, improvement goals, etc.)

(then, use them to define data quality metrics, measurement methodology, improvement goals, etc.) Based on a strong understanding of the business use cases and the Big Data architecture implemented to achieve them, design and implement an optimal layer of data governance (data definitions, metadata requirements, data ownership, data flow diagrams, etc.)

(data definitions, metadata requirements, data ownership, data flow diagrams, etc.) Document baseline quality levels for key data (think of “critical-path” diagram and “throughput-bottleneck” assessment)

(think of “critical-path” diagram and “throughput-bottleneck” assessment) Define ROI for data quality efforts (in order to create feedback loop on the ROI metric to improve efficiency and to sustain funding for data quality efforts)

(in order to create feedback loop on the ROI metric to improve efficiency and to sustain funding for data quality efforts) Integrate data quality efforts (to achieve efficiency through minimizing redundancy)

(to achieve efficiency through minimizing redundancy) Automate data quality monitoring (to reduce cost as well as to let employees stay focused on complex tasks)

Do not rely on machine learning to automatically take care of poor data quality (machine learning is science and not magic!)

Related: