In today’s data-driven business landscape, unlocking insights from data is critical for an enterprise’s success. To achieve this, the data engineering team would need to allow data scientists, developers, project managers, and others to access datasets.
These data sets are typically used to develop and train machine learning models, AI algorithms, data analytics, advanced data visualizations, reporting, testing, and miscellaneous data applications.
The healthcare sector is seeing a growing consensus that the future of the industry will be based on patient-centric personalized healthcare, for which data is the fundamental building block.
Datasets, however, may contain Personally Identifiable Information (PII) and Electronic Personal Health Information (ePHI) fields that need to be protected, both from a patient data privacy perspective as well as for regulatory compliance. This protection is achieved through Data Masking and Anonymization (DMA).
Data Masking involves replacing data with special characters (e.g., ****) while Data Anonymization involves substituting the original data with fictitious data that looks very similar to the original record. Data Masking and Anonymization is typically done during the data cleansing and preparation steps before data sets are made available to the larger team.
These steps often involve a lot of manual effort and take up a lot of time before they could be shipped to data scientists for deriving analytics and insights.
The key goal for any DMA solution would therefore be twofold: a) Identify and protect individuals’ information in healthcare datasets, and b) Ensure that even after DMA, the utility of the dataset is not compromised for its intended usage.
Key techniques for Data Masking & Anonymization
Data Masking and Data Anonymization could be accomplished by several techniques. Each technique produces a slight difference in privacy guarantees and the utility of resultant data. In DMA generally, privacy and utility are inversely proportional to each other.
Some of the key techniques that could be leveraged in DMA solutions are:
Complete Random Substitution or CRS (aka Pseudonymization)
- Data in one or more columns of a table is randomly substituted with values from an appropriate list/generator
- This preserves the look and feel of data with high privacy
- It could alter the distribution of the data, affecting utility for certain use cases
- This technique does not offer fine-grained control of privacy vs utility
Generalization (e.g., k-anonymization)
- Some data is excluded deliberately to make it less identifiable
- The data may be modified into a series of ranges
- This technique results in stronger privacy due to reduced data dimensions
- But data utility is compromised
Differential Privacy
- This technique introduces ‘noise’ in the data set but preserves stat distribution
- Correct selection of its control parameters gives a good balance between utility and privacy
- Support for non-numeric data is not readily available
Synthetic Data
- In this technique, data engineers build new artificial data that is modeled on a real dataset
- The right models ensure the balance between privacy and utility.
- This technique works well across data types
- But it is complex to implement and requires creating new models in some cases
Of course, the above methods could be combined in a single DMA solution and for a single data set.
8 features of an ideal Data Masking and Anonymization Solution
- Automated identification of PII and ePHI fields from a data set
- Automated masking and anonymization of PII and ePHI data
- Supports structured, semi-structured, and unstructured data
- Features a wide range of the latest anonymization techniques
- UI as well as API access for uploading data sets containing PII and ePHI information
- Preservation of the original schema so that there is no impact on downstream work
- Leveraging the right algorithms to ensure that de-anonymization is difficult
- Support for standard healthcare industry formats such as FHIR
A correctly-featured DMA solution provides multiple benefits. It helps protect sensitive patient data and can also ensure compliance with data privacy regulations like HIPAA, SOC 2, and HITRUST.
It can also help improve team productivity and reduce data ops cycle times by automating the data cleansing and preparation processes, for datasets containing PII & ePHI information.
Looking for a Data Masking and Anonymization Solution?
Ideas2IT is a leading product engineering firm, having partnered with prominent US healthcare and health-tech clients such as Roche, Medtronic Lab, Mayo Clinic, uLab Systems, Grapefruit Health, and Dr. Agarwal’s.
Our elite Data Science team also engages in specialized projects for notable non-health enterprises including Facebook, Bloomberg, and Siemens. With our deep expertise in healthcare and data science, we have developed a bespoke Data Masking and Anonymization solution to meet your needs.
If you are looking to create an innovative product or service and anticipate technical challenges, we are here to help, leveraging our strengths in both engineering and domain knowledge. Schedule a free consultation with us today and let's explore how we can collaborate.