AWS Glue Python Library - User Guide | Ideas2IT

How to reuse custom Python libraries across AWS Glue jobs: A step-by-step guide

Share This

AWS Glue initially supported a limited number of Python libraries. We had issues when we had to use other Python libraries like pandas or Paramiko. Furthermore, we experienced more trouble when we shared or reused custom libraries/modules across different Glue jobs. 

But we solved it! Here’s how.

The solution lies in wheel files (Python files with a .whl extension). Glue started supporting custom-built wheel files recently and this allowed us to import external libraries or even our own custom modules/libraries easily into AWS Glue.

What are wheel files?

Wheels are a component of the Python ecosystem that help make package installations faster  and provide more stability in the package distribution process. A ‘wheel’ file is basically a ZIP-format archive with a specially formatted filename and the .whl extension. It is designed to contain all the files for a PEP 376 compatible installation in a way that is very close to the on-disk format.

How to create wheel files?

We can build our python code as wheel formatted files. To do this, we need to follow a folder structure with a “setup.py” file. 

Folder Structure:

    – module

        – module-named-folder

            – class.py

            – __init__.py (empty file)

        – setup.py

    Eg : 

        util

            – util_module

                – __init__.py

                – common_util.py

                – date_util.py

            – setup.py

Here’s a sample setup.py

Sample setup.py

To build your code as a wheel file, run the below command.

> python setup.py bdist_wheel

It will create build, dist, and util_module.egg-info folders. The dist folder will have the wheel file (“*.whl”). Now, we can add this wheel file to the Glue job.

Adding wheel files to a Glue Job

Navigate to AWS Glue > Jobs > Click ‘Add Job’ button

Adding wheel files to a Glue Job

Now, here’s how we import the reusable wheel file in a Glue job

Importing a reusable wheel file in a Glue job

Now, let’s consider a different use case, where we need to use external packages across several Glue jobs.

I had a scenario where I wanted to use the ‘Paramiko’ library to connect my SFTP server from my Glue Python job. To use this in my Glue job, I cloned the code from GitHub and used the “setup.py” to create a .whl file for that library. Here are the steps that I followed.

  1. Git clone ‘https://github.com/paramiko/paramiko/
  2. cd paramiko
  3. python setup.py bdist_wheel

After execution, you can see the “paramiko-2.7.2-py2.py3-none-any.whl” file in the dist folder. Upload this to a bucket in S3 and now we can use this file in your Glue job as Python lib path “–extra-py-files” 

Now navigate to AWS Glue > Jobs > Click ‘Add Job’ button.

AWS Glue > Jobs > Click ‘Add Job’ button

Here’s how we import the reusable wheel file in a Glue job

Importing a reusable wheel file in a Glue job

References