Cloud, Data Platforms and Visualization, Machine Learning and AI, Technology
How to reuse custom Python libraries across AWS Glue jobs: A step-by-step guide
AWS Glue initially supported a limited number of Python libraries. We had issues when we had to use other Python libraries like pandas or Paramiko. Furthermore, we experienced more trouble when we shared or reused custom libraries/modules across different Glue jobs.
But we solved it! Here’s how.
The solution lies in wheel files (Python files with a .whl extension). Glue started supporting custom-built wheel files recently and this allowed us to import external libraries or even our own custom modules/libraries easily into AWS Glue.
What are wheel files?
Wheels are a component of the Python ecosystem that help make package installations faster and provide more stability in the package distribution process. A ‘wheel’ file is basically a ZIP-format archive with a specially formatted filename and the .whl extension. It is designed to contain all the files for a PEP 376 compatible installation in a way that is very close to the on-disk format.
How to create wheel files?
We can build our python code as wheel formatted files. To do this, we need to follow a folder structure with a “setup.py” file.
– __init__.py (empty file)
Here’s a sample setup.py
To build your code as a wheel file, run the below command.
> python setup.py bdist_wheel
It will create build, dist, and util_module.egg-info folders. The dist folder will have the wheel file (“*.whl”). Now, we can add this wheel file to the Glue job.
Adding wheel files to a Glue Job
Navigate to AWS Glue > Jobs > Click ‘Add Job’ button
Now, here’s how we import the reusable wheel file in a Glue job
Now, let’s consider a different use case, where we need to use external packages across several Glue jobs.
I had a scenario where I wanted to use the ‘Paramiko’ library to connect my SFTP server from my Glue Python job. To use this in my Glue job, I cloned the code from GitHub and used the “setup.py” to create a .whl file for that library. Here are the steps that I followed.
- Git clone ‘https://github.com/paramiko/paramiko/’
- cd paramiko
- python setup.py bdist_wheel
After execution, you can see the “paramiko-2.7.2-py2.py3-none-any.whl” file in the dist folder. Upload this to a bucket in S3 and now we can use this file in your Glue job as Python lib path “–extra-py-files”
Now navigate to AWS Glue > Jobs > Click ‘Add Job’ button.
Here’s how we import the reusable wheel file in a Glue job