AWS S3 Multipart File Upload: Our Recent Experience at Ideas2IT

This article features our recent experience with multipart file upload to AWS S3. Our application builds line-delimited JSON files that contain patient information and uploads them to an AWS S3 bucket. The file consists of a few hundred lines, each representing a patient record.

Since each row was up to 10MB, we adopted the "multipart S3 file upload" approach. This method allows us to break the file into parts and upload them using ThreadExecutive, ensuring good performance. In this approach, we are required to pass the part stream and its byte size as arguments in the Part Upload Request.

Ideas2IT: Multipart file upload to AWS S3

The multipart AWS S3 upload was successfully done, as anticipated. However, after going live, we encountered a production bug. The uploaded file had an incomplete line in the middle, with a few characters missing at the end. The issue occurrence was sporadic and unreplicable.

Upon investigation, we discovered that the issue occurred when the patient record had a binary file-like image attached. Further inspection revealed something unusual with the highlighted line.

After replicating the issue in our Dev environment with a suitable record, we changed the following lines and resolved the issue.

We changed content.getBytes().length to inputStream.available().

We found that content.getBytes().length minus inputStream.available() yielded 0 with normal string contents and -1 or -2 with binary strings, such as images. This discrepancy occurred with a few images but not all.

The documentation notes, "Note that while some implementations of InputStream will return the total number of bytes in the stream, many will not." Therefore, InputStream.available() might not be appropriate in all cases, which is why we initially decided against using it.

‍

Conclusion

Switching from content.getBytes().length to inputStream.available() addressed the issue of incomplete lines in our S3 multipart uploads, especially when dealing with binary files like images. This change was crucial in stabilizing the upload process. The problem arose due to the method's inconsistent handling of binary data, as noted in the documentation. This experience underscores the importance of thorough testing and selecting the right methods based on the nature of the data to ensure reliable file uploads.

Ideas2IT Team