Cloud, Healthcare, Technology
Spark: A scalable replacement for Mirth?
In the modern world, the data for healthcare organizations are channeled from multiple sources like EHR, patient portals, billing systems, and more. It has become inevitable for these organizations to migrate, transform and collate these data from different sources for Data Analytics. By adopting the latest technologies available in the market, this process could be simplified to a greater extent.
Mirth is a popular solution for integrating Healthcare data. While Mirth provides a wealth of in-built functionality, it has problems with scalability and performance. Recently, we modernized a Healthcare data ingestion platform for one of our customers from Mirth to a cloud-native solution using Apache Spark. In this blog, we walk you through how we did this and the performance improvements that followed.
What is Mirth?
Following are the challenges we faced while processing the huge datasets using Mirth.
- With parsing, validation, and transformation functions in place for every field of input, when the data size is in GBs, Mirth almost takes a day to process the data.
- It is hard to achieve better scalability with Mirth when the input data size varies (say from 100 MB to 100 GB). The solution should be capable of throttling the resources based on the load.
- It is difficult to orchestrate and monitor the entire data pipeline using Mirth.
To overcome the above-mentioned challenges, Mirth should be replaced by Spark for faster and efficient data processing. A cloud-based infrastructure is recommended for such solutions with high data processing needs.
Before getting started with the cloud solutions, let’s deep dive into the existing solution to understand the whole picture.
- The healthcare payer related information is generated as an Inbound Recipient File (space delimited text files with fixed-width strings)
- The inbound files are transferred to the SFTP server
- The role of the processing engine is to parse and transform the input feed. In the existing architecture, Mirth is used as the processing engine. The role of the Mirth processing engine is as follows.
- Parse the input fixed-length text files.
- Validate each field in the inbound recipient file (validating string length, string format, etc.).
- Apply transformation to the fields (Converting date time to specific formats, Generating unique ids, trimming, etc.).
- Create success and failure flat files. Success flat file contains the rows which have passed all the validations and the Failure flat file contains rows if any of the fields failed on the validation step.
- Update EVV tables with the success and failure information.
- Update the Healthcare payer database tables with the success and failure rows.
Sample Input Format
The recipient Inbound file will be in the form of a text file with fixed-length sequence.
|2692 CA First Street California 11111111111 Fax: 1111111111 Female 04/04/1950 Monica Latte 4444 Coffee Ave Chocolate, California 90011 Carl Savem Female Divorced English DIABETES MELLITUS 0 0 0 0 0 CA4932 250|
A data dictionary has been provided to parse and identify the column information from the above-mentioned fixed-length strings.
For example, the first 4 letters correspond to Source System (2692) – i.e., the next 2 letters correspond to Jurisdiction (CA), etc.
To eliminate the data processing limitations due to the size of the database, here is the proposed infrastructure.
Following are the steps performed in the cloud-native solution.
- Using a batch loader, the input files are retrieved from the source system and fed into the Amazon S3 bucket.
- An event notification is set in the Amazon S3 bucket to look for incoming files and trigger a Lambda.
- The Lambda function reads the incoming file names from the S3 bucket and pushes it inside an Amazon SQS queue.
- The service orchestration layer has been implemented using Amazon Step Functions. Following are the four orchestration elements used by Step function.
- AWS SQS
- AWS Glue
- AWS RDS
- AWS SNS
- AWS Step function takes necessary actions if there is a success or failure in the above-mentioned steps
- The main drawback of the Mirth-based architecture is scalability. In cloud-based architecture, we can scale up or scale down the processing job based on the input load.
- Wherever there is an input message in AWS SQS, the Step function will invoke a Glue job. The role of Glue job is as follows.
- Parse the inbound recipient feed using a data dictionary and load it as a Spark Dataframe.
- Perform validations (which were earlier done by Mirth) on the input fields and segregate success and failure records.
- Apply transformation functions to the success records.
- Store the Success and Failure records in S3 as a flat file.
- Update the AWS RDS (MySQL tables) with the success and failure information.
- Once the data has been successfully written to the database, use AWS SNS to send an email notification.
To establish the right benchmarks for the upgraded system, we conducted a dry run using an inbound file with the following specifications.
- File size – 1.5 GB
- Total number of records – 2 Million
- Success to Failure ratio – 80:20
- Number of DPUs used – 3 standard DPUs
Following are the metrics from the new design.
- File size: 1.5 GB
- Total number of records to be processed: 2 million
- Total time taken to process: 13 minutes
- Time taken for validation and transform: 6 minutes
- Time taken for updating DB: 7 minutes
With the cloud-based architecture (PySpark), the total time taken for processing a 1.5 GB file was around 13 minutes whereas with the older architecture (Mirth), it took almost a day to process those same records.
Following are the advantages of using the cloud-based architecture.
- Enhance Scalability: Since the process and database layers are separated, better scalability is achieved. Based on the input loads, the number of DPUs can be dynamically adjusted.
- AWS Glue Capabilities: It can process the files in GBs in one shot and write the results to S3. There is no need of dividing the inputs into chunks.
- Monitoring and Notification: Better monitoring and scalability is achieved using AWS SNS and AWS Step Functions. Any error/failure in the orchestration unit is gracefully handled using AWS Step functions (with retries and notification).
Challenges Faced While Upgrading to the Cloud Solution
- Writing a huge amount of data into relational databases using PySpark leads to I/O throttling.
- Many transformation and validation logics have to be manually converted to PySpark.