3 Top Tools to De-Identify PHI in Healthcare Datasets - Ideas2IT

3 Top Tools to De-Identify PHI in Healthcare Datasets

Share This

Under US law, Protected Health Information, or PHI refers to any information pertaining to health state, health care, and associated payments. Usually, PHI is created or collected by a Healthcare Services Provider (clinics and hospitals) or Payers (insurance companies).   

The U.S. Health Insurance Portability and Accountability Act (HIPAA) states that the following 18 identifiers must be held confidentially.

  1. Names
  2. All geographical identifiers smaller than the name of a state
  3. Dates (other than year) directly related to an individual
  4. Phone Numbers
  5. Fax numbers
  6. Email addresses
  7. Social Security numbers
  8. Medical record numbers
  9. Health insurance beneficiary numbers
  10. Account numbers
  11. Certificate/license numbers
  12. Vehicle identifiers and serial numbers, including license plate numbers;
  13. Device identifiers and serial numbers;
  14. Web Uniform Resource Locators (URLs)
  15. Internet Protocol (IP) address numbers
  16. Biometric identifiers, including finger, retinal, and voiceprints
  17. Full face photographic images and any comparable images
  18. Any other unique identifying number, character, or code except the unique code assigned by the investigator to code the data

The Need for PHI De-identification

Safeguarding PHI and ePHI are important to ensure privacy risks are mitigated. The de-identification of personal information mitigates privacy risks to individuals while also reducing the organization’s exposure to breach risk (e.g., reputational damage and remediation costs). Further, personal information should be retained only as long as necessary to fulfill the stated purposes or as required by law or regulations.

If any organization is considering the de-identification of personal information, it is recommended to look at the HIPAA Privacy Rule’s standard for the de-identification of protected health information. This is found in Section 164.514(a) of the rule. Under this standard, health information is not deemed individually identifiable if it does not identify an individual.

EHR and EMR datasets usually contain PHI data. Healthcare organizations and their business associates that want to share protected health information must do so in accordance with the HIPAA Privacy Rule, which limits the possible uses and disclosures of PHI, but “de-identification” of protected health information means HIPAA Privacy Rule restrictions no longer apply. These datasets are shared with Data Scientists like us for analysis, to unlock insights and trends.

At Ideas2IT – a Healthcare Software Development Company, we work with healthcare and health-tech clients like Roche, Netsmart, uLab Systems, Mayo Clinic, and Grapefruit Health. And we have come to use a few tools regularly to de-identify PHI data from healthcare datasets. Let’s take a brief look at them in this blog.

Methods of De-Identification 

All methods of de-identification of PHIs do not ensure, with certainty, that all risks of re-identification are removed. Most methods try to reduce this risk to as small an extent as possible or within an acceptable range. HIPAA-compliant de-identification of protected health information is possible using two methods: 

  1. Safe Harbor 
  2. Expert Determination

Safe Harbor

The first HIPAA-compliant way to de-identify protected health information is to remove specific identifiers from the data set. The identifiable data that must be removed are:

  • Names
  • Geographic subdivisions smaller than a state
  • All elements of dates (except year) related to an individual (including admission and discharge dates, birthdate, date of death, all ages over 89 years old, and elements of dates (including year) that are indicative of age)
  • Telephone, cellphone, and fax numbers
  • Email addresses
  • IP addresses
  • Social Security numbers
  • Medical record numbers
  • Health plan beneficiary numbers
  • Device identifiers and serial numbers
  • Certificate/license numbers
  • Account numbers
  • Vehicle identifiers and serial numbers including license plates
  • Website URLs
  • Full-face photos and comparable images
  • Biometric identifiers (including finger and voice prints)
  • Any unique identifying numbers, characteristics or code

Expert Determination

This method of de-identification of protected health information requires a HIPAA-covered entity or business associate to obtain an opinion from a qualified statistical expert that the risk of re-identifying an individual from the data set is very small. Expert Determination methodologies exist so that critical data can be used while still protecting patient privacy. In such cases, the methods used to make that determination and justification of the expert’s opinion must be documented and retained by the covered entity or business associate and made available to regulators in the event of an audit or investigation. HIPAA does not define the level of risk of re-identification other than to say it should be ‘very small’. The expert should define ‘very small’ in relation to the context of the data set.

While there is not currently one standard method for de-identification, there are four major organizations that have adopted the Expert Determination standard. They are: 

  1. The Institute of Medicine (IOM)
  2. The Health Information Trust Alliance (HITRUST)
  3. The Pharmaceutical Users Software Exchange (PhUSE) and 
  4. The Council of Canadian Academies. 

These standards help guide organizations through accessing, storing, and exchanging personal information. These frameworks are a major step in clarifying current methodologies.

Tools we recommend for PHI De-identification

Google Healthcare API

De-identification in Google Healthcare API works at the following levels:

  • At the Dataset Level: De-identification occurs on all data in DICOM stores and FHIR stores of the dataset. If a dataset contains both DICOM instances and FHIR resources, you can de-identify all of the instances and resources at the same time.
  • At the FHIR Store Level: Healthcare organizations and their business associates who want to share protected health information must do so in accordance with the HIPAA Privacy Rule, which limits the possible uses and disclosures of PHI, but de-identification of protected health information means HIPAA Privacy Rule restrictions no longer apply. De-identification occurs on all data in a specific FHIR store in a dataset. At the DICOM store level. De-identification occurs on all data in a specific DICOM stored in a dataset.

We suggest checking the documentation on how the APIs are called for dataset level, FHIR store, and the DICOM level

De-identification doesn’t impact the original dataset, FHIR store, DICOM store, or the original data. Depending on how you configure the de-identification, the operation behaves as follows:

  • If you are de-identifying data at the dataset level, de-identified copies of the original data are written to a new dataset called the destination dataset.
  • If you are de-identifying data at the DICOM or FHIR store level, de-identified copies of the original data are written to an existing DICOM or FHIR store in an existing dataset. The output DICOM store and FHIR store are called the destination DICOM store and destination FHIR store, respectively.

The source dataset, FHIR store, or DICOM store, and the destination dataset, FHIR store, or DICOM store must reside in the same Google Cloud location. De-identifying data across multiple Google Cloud locations is not supported.

BERT-based Clinical Deidentification

There are several ways that can be used for the Safe Harbor method of identification of PHI in biomedical corpora. Here we will discuss a specific model that can be used for the same. BERT models are based on Transformers, a Deep Learning model in which every output element is connected to every input element, and the weightings between them are dynamically calculated based on their connection. It can read from both directions i.e. from left to right and right to left. 

BioBERT is a BERT-based pre-trained model which is trained on several medical corpora like journals, medical articles, publications of medical research, etc. The vocabulary of the pre-trained model is fairly specific to biomedical jargon. Following is the feature of the BioBERT model:

  1. Simple architecture based on bidirectional transformers
  2. Single output layer based on the representations from its last layer to compute only token level BIO2 probabilities
  3. BioBERT directly learns WordPiece embeddings during pre-training and fine-tuning.

Here, let’s take a BioBERT model which is a pre-trained model with context-aware word embeddings, to classify PHI categories in a named entity recognition task. The model is fine-tuned and trained on the I2B2 2014, a fully tagged dataset in medical research. 

The identified PHI NER from the model could then be cleaned or removed to mask the identifiers.

The following results are selfly computed:

AGE 99%
CITY 82%
DATE 98%
FAX 0%

The ones in above results with 0% accuracy imply that no representation of such a PHI exists in the model. All other PHIs are fairly well detected. Apart from the textual information regarding patient names, addresses, and other PHIs, the model does a good job in detecting the specific date, age, and numeric data and classifies it as PHI. In that case, specific requirements which mandate the removal of specific information could be also eliminated in the process.

AWS Comprehend Medical

Amazon Comprehend Medical detects and returns useful information in unstructured clinical text such as physician’s notes, discharge summaries, test results, and case notes. Amazon Comprehend Medical uses Natural language processing (NLP) models to detect entities, which are textual references to medical information such as medical conditions, medications, or Protected Health Information(PHI).

Use the DetectPHI operation to detect Protected Health Information (PHI) data in the clinical text being examined.

The following PHIs are detected through DetectPHI in AWS Comprehend:

Entity Description HIPAA Category
AGE All components of age, spans of age, and any age mentioned, be it patient or family member or others involved in the note. Default is in years unless otherwise noted. 3. Dates related to an individual
DATE Any date related to patient or patient care. 3. Dates related to an individual
NAME All names mentioned in the clinical note, typically belonging to patient, family, or provider. 1. Name
PHONE_OR_FAX Any phone, fax, pager; excludes named phone numbers such as 1-800-QUIT-NOW as well as 911. 4. Phone number

5. FAX number

EMAIL Any email address. 6. Email addresses
ID Any sort of number associated with the identity of a patient. This includes their social security number, medical record number, facility identification number, clinical trial number, certificate or license number, and vehicle or device number. It also includes biometric numbers, and numbers identifying the place of care or provider. 7. Social Security Number

8. Medical Record number

9. Health Plan number

10. Account numbers

11. Certificate/License numbers

12. Vehicle identifiers

13. Device numbers

16. Biometric information

18. Any other identifying characteristics

URL Any web URL. 14. URLs
ADDRESS This includes all geographical subdivisions of an address of any facility, named medical facilities, or wards within a facility. 2. Geographic location
PROFESSION Includes any profession or employer mentioned in a note as it pertains to the patient or the patient’s family. 18. Any other identifying characteristics

Medical Protected Health Information Data Extraction and Identification (PHId) API of Amazon Comprehend is priced at $0.0014 per 100 characters of text in a request.


Ensuring that specific data elements are removed from personal data sets will help ensure that the personal information retained does not allow for the identification of an individual to occur. In short, the de-identification of personal information is a very important component of protecting PII and mitigating privacy risks

To discover more intel regarding the advancements in the de-identification of PHI in healthcare datasets, we invite you to connect with one of our specialists today.

Leave a Reply

Your email address will not be published. Required fields are marked *

Get Instant Pricing Straight to Your Inbox

Let us know what your needs are and get a quote sent straight to your inbox.

Thank you for contacting us

Get Pricing Sent Straight to Your Inbox