Ideas2IT rewards key players with 1/3rd of the Company in New Initiative.  Read More >
Back to Blogs

How to Reuse Custom Python Libraries Across AWS Glue jobs: Complet guide

AWS Glue initially supported a limited number of Python libraries. We had issues when we had to use other Python libraries like pandas or Paramiko. Furthermore, we experienced more trouble when we shared or reused custom libraries/modules across different Glue jobs.

But we solved it! Here’s how

The solution lies in wheel files (Python files with a .whl extension). Glue started supporting custom-built wheel files recently and this allowed us to import external libraries or even our own custom modules/libraries easily into AWS Glue.

What are wheel files?

Wheels are a component of the Python ecosystem that help make package installations faster and provide more stability in the package distribution process. A ‘wheel’ file is basically a ZIP-format archive with a specially formatted filename and the .whl extension. It is designed to contain all the files for a PEP 376 compatible installation in a way that is very close to the on-disk format.

How to create wheel files?

We can build our Python code as wheel-formatted files. To do this, we need to follow a folder structure with a "setup.py" file.

Folder Structure:

- module

- module-named-folder

- class.py

- __init__.py (empty file)

- setup.py

Eg :

util

- util_module

- __init__.py

- common_util.py

- date_util.py

- setup.py

Here’s a sample setup.py

Sample setup.py

To build your code as a wheel file, run the below command.

> python setup.py bdist_wheel

It will create build, dist, and util_module.egg-info folders. The dist folder will have the wheel file ("*.whl"). Now, we can add this wheel file to the Glue job.

Here’s a sample setup.py

Navigate to AWS Glue > Jobs > Click ‘Add Job’ button

Adding wheel files to a Glue Job

Now, here’s how we import the reusable wheel file in a Glue job

Importing a reusable wheel file in a Glue job

Now, let’s consider a different use case, where we need to use external packages across several Glue jobs.

I had a scenario where I wanted to use the ‘Paramiko’ library to connect my SFTP server from my Glue Python job. To use this in my Glue job, I cloned the code from GitHub and used the “setup.py” to create a .whl file for that library. Here are the steps that I followed.

  1. Git clone ‘https://github.com/paramiko/paramiko/
  2. cd paramiko
  3. python setup.py bdist_wheel

After execution, you can see the “paramiko-2.7.2-py2.py3-none-any.whl” file in the dist folder. Upload this to a bucket in S3 and now we can use this file in your Glue job as Python lib path "--extra-py-files"

Now navigate to AWS Glue > Jobs > Click ‘Add Job’ button.

AWS Glue > Jobs > Click ‘Add Job’ button

Here’s how we import the reusable wheel file in a Glue job

reusable wheel file in a Glue job

References

About Ideas2IT,

Are you looking to build a great product or service? Do you foresee technical challenges? If you answered yes to the above questions, then you must talk to us. We are a world-class custom .NET development company. We take up projects that are in our area of expertise. We know what we are good at and more importantly what we are not. We carefully choose projects where we strongly believe that we can add value. And not just in engineering but also in terms of how well we understand the domain. Book a free consultation with us today. Let’s work together.

Ideas2IT Team

Connect with Us

We'd love to brainstorm your priority tech initiatives and contribute to the best outcomes.