By Vitaly Dubravin

Data Warehouse and Business Intelligence (DW/BI) implementation is frequently considered a black hole for the IT budget. It is hard (almost impossible) to run a business without one, but requires substantial investment to reshape your existing transactional systems to clean “dirty” data, fix existing (but broken) business processes and formalize analytical needs for different departments. Vast majority of these costs have nothing to do with the Data Warehouse, but become noticeable during DW/BI project.

My team at GRT Corporation has deployed tens of Data Warehouses for Fortune 50, as well as smaller size companies. One of the most common challenges for DW/BI implementation is loose definition of business requirements by the business. This is not something that can be fixed, for a simple reason – it’s very hard to tell what you need for the system until you start using it every day. Good Data Warehouses are built in phases not because the scope is huge, but because user requirements are changing while the business is getting hands-on experience analyzing real data.

Business requirements uncertainty often leads to an excessive data collection and processing in the Data Warehouse. It is easier to bring source data into the system right away and let the business later decide whether to use it or not, rather than redesign a data model to add yet another field from the source.

Problem starts when you realize that not all business people should have access to all records in the Data Warehouse. Marketing Analyst working on a new campaign should be able to access actual customer’s data for his region, except credit card numbers, but may see only customer’s occupation and city/state for all other regions for comparison analysis. Data privacy rules may become very complex, implementation takes time and drains IT budget.

Data privacy is not a wishful thinking, but is a regulatory requirement for certain data categories. Legislators in the United States, European Union and other countries have passed tough data privacy laws and the financial implications for non-compliance may take you out of business. HIPAA and PII rules are the most known ones, but this is only the tip of the regulatory iceberg.

There are two major ways to address this issue during Data Warehouse implementation:

Include access rights validation into every report generated by DW/BI system, add exceptions handing triggered by row-level security violations and keep explaining business analysts why their data request in Excel came back empty.

violations and keep explaining business analysts why their data request in Excel came back empty. Ignore data access rules in the Data Warehouse reports engine, but integrate them with the Dynamic Data Masking engine. Analysts will see all the data they want, but only the part they are authorized to see will contain real values. Other values will be masked, but trends and patterns may be preserved!

First approach (this is what everyone uses today) is very expensive, slow and causes a lot of frustration on the business side, sleepless nights and missed deadlines in IT. It is a high maintenance (new development) and a time consuming solution. It works, but you can achieve the same (at least) result without new development by adjusting Data Masking rules utilizing the second option.

Data Masking engine contains two subsystems vital for successful data privacy implementation in Data Warehouse solution – rules engine and masking engine. Both components, depending on the vendor, may operate in static or dynamic mode that gives us several implementation options:

Static Data Masking – static rules (driven by data type, not by data content), static (batch) data conversions. This mode is primarily used to generate a sample dataset for system developers and testers. It can also be used in the Data Warehouse project for ETL feeds. For example, one of the privacy rules may require you to hide/mask credit card information (like display only the first 5 and last 4 digits for any DW report). This transformation may happen before data comes to the Data Warehouse and the system, as a result, will not contain any sensitive data at all and will not be a subject of privacy auditing.

– static rules (driven by data type, not by data content), static (batch) data conversions. This mode is primarily used to generate a sample dataset for system developers and testers. It can also be used in the Data Warehouse project for ETL feeds. For example, one of the privacy rules may require you to hide/mask credit card information (like display only the first 5 and last 4 digits for any DW report). This transformation may happen before data comes to the Data Warehouse and the system, as a result, will not contain any sensitive data at all and will not be a subject of privacy auditing. Semi-Dynamic Data Masking – static rules, dynamic (on the fly) masking. This conversion requires extra horse power for real-time conversion and should work really well for near real-time and real-time Data Warehouse ETL processes, but can be applied to traditional DW systems as well. Same masking process may be used for some reports generated right from the ODS (Operational Data Store). This will let you reuse the same report template for all user categories without new development.

– static rules, dynamic (on the fly) masking. This conversion requires extra horse power for real-time conversion and should work really well for near real-time and real-time Data Warehouse ETL processes, but can be applied to traditional DW systems as well. Same masking process may be used for some reports generated right from the ODS (Operational Data Store). This will let you reuse the same report template for all user categories without new development. Dynamic Data Masking – dynamic (data driven) rules, dynamic masking. This is the most flexible masking option. It gets handy when OLAP cubes are added to the design. What if I’m allowed to see every department’s total salary, but can see real numbers for people in my team only? I can still use OLAP aggregates with a 100% accurate data, but dynamic data masking will be used for drill-down capabilities. I’ll never have access to the raw fact table and a sanitized version will be generated on the fly based on my access permissions. Dynamic rules/masking engine should be smart enough to adjust individual salaries to match OLAP’s total for each department.

All three Data Masking options, when included into the initial Data Warehouse design let you change data visibility on the fly without new and expensive development, testing and redeployment. This method can be integrated with both custom-made and off-the-shelf DW/BI products and give you a chance to become instantly compliant without any additional development investments. Regulation or internal business policy changes affect only Data Masking rules definitions and are immediately propagated to the existing production environment.

Data security/privacy violations (often involuntary) are happening all across the globe and may significantly damage business reputation. Data Masking limits real data exposure and secures personal information, while eliminating unnecessary development activities. It drives IT costs down and improves corporate IT security simultaneously.

(originally posted on CIO.com)