Table of Contents
AWS Glue initially supported a limited number of Python libraries. We had issues when we had to use other Python libraries like pandas or Paramiko. Furthermore, we experienced more trouble when we shared or reused custom libraries/modules across different Glue jobs.
But we solved it! Here’s how
The solution lies in wheel files (Python files with a .whl extension). Glue started supporting custom-built wheel files recently and this allowed us to import external libraries or even our own custom modules/libraries easily into AWS Glue.
What are wheel files?
Wheels are a component of the Python ecosystem that help make package installations faster and provide more stability in the package distribution process. A ‘wheel’ file is basically a ZIP-format archive with a specially formatted filename and the .whl extension. It is designed to contain all the files for a PEP 376 compatible installation in a way that is very close to the on-disk format.
How to create wheel files?
We can build our Python code as wheel-formatted files. To do this, we need to follow a folder structure with a "setup.py" file.
- __init__.py (empty file)
Here’s a sample setup.py
To build your code as a wheel file, run the below command.
> python setup.py bdist_wheel
It will create build, dist, and util_module.egg-info folders. The dist folder will have the wheel file ("*.whl"). Now, we can add this wheel file to the Glue job.
Here’s a sample setup.py
Navigate to AWS Glue > Jobs > Click ‘Add Job’ button
Now, here’s how we import the reusable wheel file in a Glue job
Now, let’s consider a different use case, where we need to use external packages across several Glue jobs.
I had a scenario where I wanted to use the ‘Paramiko’ library to connect my SFTP server from my Glue Python job. To use this in my Glue job, I cloned the code from GitHub and used the “setup.py” to create a .whl file for that library. Here are the steps that I followed.
- Git clone ‘https://github.com/paramiko/paramiko/’
- cd paramiko
- python setup.py bdist_wheel
After execution, you can see the “paramiko-2.7.2-py2.py3-none-any.whl” file in the dist folder. Upload this to a bucket in S3 and now we can use this file in your Glue job as Python lib path "--extra-py-files"
Now navigate to AWS Glue > Jobs > Click ‘Add Job’ button.
Here’s how we import the reusable wheel file in a Glue job
Are you looking to build a great product or service? Do you foresee technical challenges? If you answered yes to the above questions, then you must talk to us. We are a world-class custom .NET development company. We take up projects that are in our area of expertise. We know what we are good at and more importantly what we are not. We carefully choose projects where we strongly believe that we can add value. And not just in engineering but also in terms of how well we understand the domain. Book a free consultation with us today. Let’s work together.