Do you work with data in Azure? How do you ensure GDPR compliance? This blog post covers a few privacy pattern implementation examples that you might use in your solutions.

Raghu Gollamudi in his talk Privacy Ethics – A Big Data Problem, covers best practices in applying data protection rules. From the engineer’s perspective, there are 3 privacy categories that we need to be aware: security, design and process automation.

Let’s look at privacy implementation options with Azure Blob Storage, Azure Databricks, Azure Data Factory, and other Azure services.

This will be a rather technical post, so if you are interested in GDPR compliance on a higher level, you might take a look at my earlier blog post: Privacy Design Methodologies in Big Data World.

1. Security

How to encrypt data (transit, rest, backups)?

How to authenticate users?

Where to store credentials?

Where do I find audit logs?

At first, you should visit Microsoft’s GDPR Blueprints.

Microsoft keeps there reference architectures, deployment guidance, GDPR Article implementation mappings, customer responsibility matrices (CRM), and threat models. Not perfect, but good to get started.

Copyright by Microsoft

Next, you pick the required components and add security elements (e.g. NSG, ExpressRoute). Here is an Azure Databricks - Bring Your Own VNet reference architecture.

Copyright by Databricks

Later, if you are interested in end-to-end security, you should spend some time thinking about potential security problems. Microsoft has a tool called Threat Modeling Tool to help you identify problematic areas. The tool is really helpful, but the user experience is awful.

Copyright by Microsoft

Read more:

2. Design

How to authorize users?

How to ensure data integrity?

What are the design guidelines for privacy?

Users management

You can only dream of creating Azure AD users, putting those into Azure AD group and setting access in one place. The reality is that there are many places where you need to create secrets, groups and control accesses.

Read more:

Privacy protection patterns

Unfortunately, there are no built-in data integrity checks inside Azure Data Factory or Databricks. It’s up to you to implement it based on privacy pattern that suits your scenario best:

Privacy protection at the ingress

Scramble on arrival

Simple to implement

Limits incoming data = limited value extraction

Example - create custom SQL views Instead of reading data directly from tables, fetch records from a custom view.

CREATE VIEW masked . dwh_customer AS SELECT id , ( CASE WHEN email IS NULL THEN 0 ELSE 1 END ) AS hasEmail , postalCode , customerType FROM dwh . customer

Privacy protection at the egress

Processing in an opaque box

Enabling

Strict export operations required

Exploratory analytics need explicit egress/classification

Anonymisation

Discard all PII (e.g., user id)

No link between records or datasets

Example - drop sensitive columns with Apache Spark

df = spark . read . parquet ( "/mnt/data/customer" ) df = df . drop ( "nin" , "address" , "phoneNumber" )

Further improvement would be to get a list of sensitive columns for each dataset and skip it automatically.

Pseudonymization

​Records and datasets are linked

Hash PII

Example - hash sensitive column with Apache Spark

from pyspark.sql.functions import sha2 df = spark . read . parquet ( "/mnt/data/customer" ) df = df . withColumn ( "nin" , sha2 ( df . nin , 256 ) . alias ( 'nin_hash' ))

3. Process and automation

What steps should be taken before uploading data?

How to delete data you can no longer have?

How to ensure there is a business purpose for all data?

Task templates

Before you start building data pipelines for a new use case, there might be a few things you need to do prior, e.g. fill out questionnaires, check user consent, get access to data. To make sure you complete everything, create task templates in your task management software.

In the case of Trello: create a task, save it in a Template list, copy and edit accordingly. You can have similar functionality with Jira, Azure Boards, i.e.

Implementing “right to be forgotten”

Complying with the “right to be forgotten” clause of GDPR for data gets easier with Databricks Delta. You can set up a simple scheduled job with an example code like below to delete all the users who have opted out of your service.

MERGE INTO users USING opted_out_users ON opted_out_users . userId = users . userId WHEN MATCHED THEN DELETE

Data retention

At first, you should explore native Blob storage lifecycle rules. You can use rules to transition your data to the best access tier and to expire data at the end of its lifecycle. Unfortunately, the lifecycle management has some serious limitations (strict quota, lack of support for Gen 2), so most probably you’ll end up writing custom logic for data retention anyways.

{ "rules" : [ { "name" : "manageSensitive" , "enabled" : true , "type" : "Lifecycle" , "definition" : { "filters" : { "blobTypes" : [ "blockBlob" ], "prefixMatch" : [ "customer" ] }, "actions" : { "baseBlob" : { "tierToCool" : { "daysAfterModificationGreaterThan" : 30 }, "tierToArchive" : { "daysAfterModificationGreaterThan" : 90 }, "delete" : { "daysAfterModificationGreaterThan" : 180 } }, "snapshot" : { "delete" : { "daysAfterCreationGreaterThan" : 90 } } } } } ] }

Read more:

Purpose

I believe companies should only collect and store the data they need — and delete everything else. The value of data decreases very quickly, and storing it “just in case” is a dangerous path. But how to ensure that all the data we have is needed?

One approach would be to make sure that all data that lands in Azure has a business purpose. You can use Azure Data Factory to fetch source/destination mapping and enforce purpose, business owner and business use case columns to be in the same place, e.g.:

Read more: