Ideas2IT rewards key players with 1/3rd of the Company in New Initiative.  Read More >
Back to Blogs

PHI De-Identification for HIPAA Compliance

Under US law, Protected Health Information, or PHI refers to any information pertaining to health state, health care, and associated payments. Usually, PHI is created or collected by a Healthcare Services Provider (clinics and hospitals) or Payers (insurance companies).   The U.S. Health Insurance Portability and Accountability Act (HIPAA) states that the following 18 identifiers must be held confidentially.

  • Names
  • All geographical identifiers smaller than the name of a state
  • Dates (other than year) directly related to an individual
  • Phone Numbers
  • Fax numbers
  • Email addresses
  • Social Security numbers
  • Medical record numbers
  • Health insurance beneficiary numbers
  • Account numbers
  • Certificate/license numbers
  • Vehicle identifiers and serial numbers, including license plate numbers;
  • Device identifiers and serial numbers;
  • Web Uniform Resource Locators (URLs)
  • Internet Protocol (IP) address numbers
  • Biometric identifiers, including finger, retinal, and voiceprints
  • Full face photographic images and any comparable images
  • Any other unique identifying number, character, or code except the unique code assigned by the investigator to code the data

The Need for PHI De-identification

Safeguarding PHI and ePHI are important to ensure privacy risks are mitigated. The de-identification of personal information mitigates privacy risks to individuals while also reducing the organization’s exposure to breach risk (e.g., reputational damage and remediation costs). Further, personal information should be retained only as long as necessary to fulfill the stated purposes or as required by law or regulations.

If any organization is considering the de-identification of personal information, it is recommended to look at the HIPAA Privacy Rule’s standard for the de-identification of protected health information. In Section 164.514(a) of the rule, it is stated that health information is not considered individually identifiable if it does not identify a specific individual. EHR and EMR datasets usually contain PHI data.

Healthcare organizations and their business associates that want to share protected health information must do so in accordance with the HIPAA Privacy Rule, which limits the possible uses and disclosures of PHI, but “de-identification” of protected health information means HIPAA Privacy Rule restrictions no longer apply. These datasets are shared with Data Scientists like us for analysis, to unlock insights and trends.

Methods of De-Identification

All methods of de-identification of PHIs do not ensure, with certainty, that all risks of re-identification are removed. Most methods try to reduce this risk to as small an extent as possible or within an acceptable range. HIPAA-compliant de-identification of protected health information is possible using two methods:

  1. Safe Harbor
  2. Expert Determination

Safe Harbor

The first HIPAA-compliant way to de-identify protected health information is to remove specific identifiers from the data set. The identifiable data that must be removed are:

  • Names
  • Geographic subdivisions smaller than a state
  • All elements of dates (except year) related to an individual, including:
    • admission and discharge dates
    • birthdate
    • date of death
    • all ages over 89 years old
    • elements of dates (including year) that are indicative of age
  • Telephone, cellphone, and fax numbers
  • Email addresses
  • IP addresses
  • Social Security numbers
  • Medical record numbers
  • Health plan beneficiary numbers
  • Device identifiers and serial numbers
  • Certificate/license numbers
  • Account numbers
  • Vehicle identifiers and serial numbers including license plates
  • Website URLs
  • Full-face photos and comparable images
  • Biometric identifiers (including finger and voice prints)
  • Any unique identifying numbers, characteristics or code

Expert Determination

This method of de-identification of protected health information requires a HIPAA-covered entity or business associate to obtain an opinion from a qualified statistical expert that the risk of re-identifying an individual from the data set is very small. Expert Determination methodologies exist so that critical data can be used while still protecting patient privacy.

In such cases, the methods used to make that determination and justification of the expert’s opinion must be documented and retained by the covered entity or business associate and made available to regulators in the event of an audit or investigation.

HIPAA does not define the level of risk of re-identification other than to say it should be ‘very small’. The expert should define ‘very small’ in relation to the context of the data set. While there is not currently one standard method for de-identification, there are four major organizations that have adopted the Expert Determination standard. They are:

  • The Institute of Medicine (IOM)
  • The Health Information Trust Alliance (HITRUST)
  • The Pharmaceutical Users Software Exchange (PhUSE) and
  • The Council of Canadian Academies.

These standards help guide organizations through accessing, storing, and exchanging personal information. These frameworks are a major step in clarifying current methodologies.

Tools we recommend for PHI De-identification

Google Healthcare API

De-identification in Google Healthcare API works at the following levels:

  • At the Dataset Level: De-identification occurs on all data in DICOM stores and FHIR stores of the dataset. If a dataset contains both DICOM instances and FHIR resources, you can de-identify all of the instances and resources at the same time.
  • At the FHIR Store Level: Healthcare organizations and their business associates who want to share protected health information must do so in accordance with the HIPAA Privacy Rule, which limits the possible uses and disclosures of PHI, but de-identification of protected health information means HIPAA Privacy Rule restrictions no longer apply. De-identification occurs on all data in a specific FHIR store in a dataset. At the DICOM store level. De-identification occurs on all data in a specific DICOM stored in a dataset.

We suggest checking the documentation on how the APIs are called for dataset level, FHIR store, and the DICOM level. De-identification doesn't impact the original dataset, FHIR store, DICOM store, or the original data. Depending on how you configure the de-identification, the operation behaves as follows:

  • If you are de-identifying data at the dataset level, de-identified copies of the original data are written to a new dataset called the destination dataset.
  • If you are de-identifying data at the DICOM or FHIR store level, de-identified copies of the original data are written to an existing DICOM or FHIR store in an existing dataset. The output DICOM store and FHIR store are called the destination DICOM store and destination FHIR store, respectively.

The source dataset, FHIR store, or DICOM store, and the destination dataset, FHIR store, or DICOM store must reside in the same Google Cloud location. De-identifying data across multiple Google Cloud locations is not supported.

BERT-based Clinical De-identification

There are several ways that can be used for the Safe Harbor method of identification of PHI in biomedical corpora. Here we will discuss a specific model that can be used for the same. BERT models are based on Transformers, a Deep Learning model in which every output element is connected to every input element, and the weightings between them are dynamically calculated based on their connection.

It can read from both directions i.e. from left to right and right to left. BioBERT is a BERT-based pre-trained model which is trained on several medical corpora like journals, medical articles, publications of medical research, etc. The vocabulary of the pre-trained model is fairly specific to biomedical jargon. Following is the feature of the BioBERT model:

  1. Simple architecture based on bidirectional transformers
  3. Single output layer based on the representations from its last layer to compute only token level BIO2 probabilities
  5. BioBERT directly learns WordPiece embeddings during pre-training and fine-tuning.

Here, let’s take a BioBERT model which is a pre-trained model with context-aware word embeddings, to classify PHI categories in a named entity recognition task. The model is fine-tuned and trained on the I2B2 2014, a fully tagged dataset in medical research.

The identified PHI NER from the model could then be cleaned or removed to mask the identifiers. The following results are self-computed:

PHI Accuracy:

  • Age: 99%
  • City: 82%
  • Country: 66%
  • Date: 98%
  • Device: 0%
  • Doctor: 93%
  • Email: 0%
  • Fax: 0%
  • Hospital: 79%
  • ID Number: 85%
  • Medical Record Number: 99%
  • Organization: 40%
  • Patient Name: 89%
  • Phone: 96%
  • Profession: 79%
  • State: 84%
  • Street: 98%
  • Username: 96%
  • Zip Code: 99%
  • Unknown: 97%

The ones in above results with 0% accuracy imply that no representation of such a PHI exists in the model. All other PHIs are fairly well detected. Apart from the textual information regarding patient names, addresses, and other PHIs, the model does a good job in detecting the specific date, age, and numeric data and classifies it as PHI. In that case, specific requirements which mandate the removal of specific information could be also eliminated in the process.

AWS Comprehend Medical

Amazon Comprehend Medical detects and returns useful information in unstructured clinical text such as physician's notes, discharge summaries, test results, and case notes. Amazon Comprehend Medical uses Natural language processing (NLP) models to detect entities, which are textual references to medical information such as medical conditions, medications, or Protected Health Information(PHI).

Use the DetectPHI operation to detect Protected Health Information (PHI) data in the clinical text being examined.

The following PHIs are detected through DetectPHI in AWS Comprehend: EntityDescriptionHIPAA Category

  • AGE: All components of age, spans of age, and any age mentioned, whether related to the patient, family members, or others in the note. Default is in years unless otherwise noted.
  • DATE: Any date related to the patient or patient care.
  • NAME: All names mentioned in the clinical note, typically belonging to the patient, family, or provider.
  • PHONE_OR_FAX: Any phone, fax, pager number; excludes named phone numbers like 1-800-QUIT-NOW or emergency numbers like 911.
  • EMAIL: Any email address mentioned.
  • ID: Any number associated with the patient's identity

This includes,

  • Identification Numbers:
    • Social Security Number
    • Medical Record Number
    • Facility Identification Number
    • Clinical Trial Number
    • Certificate/License Number
    • Vehicle or Device Number
    • Health Plan Number
    • Account Numbers

  • Biometric Information:
    • Biometric Numbers
    • Numbers identifying the place of care or provider

  • URLs:
    • Web URLs

  • Address:
    • Geographical subdivisions of an address
    • Named medical facilities or wards within a facility
    • Geographic location

  • Profession:
    • Any profession or employer mentioned in a note related to the patient or their family

Additionally, the cost for using the Medical Protected Health Information Data Extraction and Identification (PHId) API of Amazon Comprehend is $0.0014 per 100 characters of text in a request.


Ensuring that specific data elements are removed from personal data sets will help ensure that the personal information retained does not allow for the identification of an individual to occur. In short, the de-identification of personal information is a very important component of protecting PII and mitigating privacy risks. To discover more intel regarding the advancements in the de-identification of PHI in healthcare datasets, please contact us today.

Ideas2IT Team

Connect with Us

We'd love to brainstorm your priority tech initiatives and contribute to the best outcomes.