Fully managed ETL (extract, transform, and load) service that makes it simple and cost-effective to transform and move data reliably between various data stores and data streams.
Service limitations
https://docs.aws.amazon.com/general/latest/gr/glue.html#service-quotas
The documentation only lists soft limits with no maximum adjustable limits mentioned, so we do not expect any hard limits for potential use cases.
What to do when the soft limitation is hit?
This depends on the limitation being hit. For most of the limitations listed, we can reach out to AWS support and request for increasing the default limits. Furthermore, you can always segregate your customers such that different customers use the resources of different accounts and thereby overcome the limit to each account.
Configuring the service
CDK script
Creating a glue database:
You need a glue database to store the tables that will be created from your job/crawler.
const glueDB = new Database(this, databaseName, {
databaseName,
});
Creating a python script for a glue job
You can write python scripts in your repo to control the way your data is pipelined through glue. These scripts must be uploaded to S3 bucket in order to be used by your job. For this, you need to create an asset from your script file.
const scriptAsset = new Asset(this, withEnv(name), {
path: path.join(__dirname, <path-to-glue-script>),
});
Creating a job role to provide necessary permissions
const jobRole = new Role(this, withEnv('glue-role'), {
roleName: withEnv('glue-role'),
assumedBy: new ServicePrincipal('glue.amazonaws.com'),
inlinePolicies: {
allowS3ReadAndWrite: PolicyDocument.fromJson({
Version: '2012-10-17',
Statement: [
{
Effect: 'Allow',
Action: ['s3:ListBucket', 's3:GetObject', 's3:PutObject', 's3:DeleteObject'],
Resource: 'arn:aws:s3:::*',
},
],
}),
allowGlueActions: PolicyDocument.fromJson({
Version: '2012-10-17',
Statement: [
{
Effect: 'Allow',
Action: 'glue:*',
Resource: '*',
},
],
}),
allowLogging: PolicyDocument.fromJson({
Version: '2012-10-17',
Statement: [
{
Effect: 'Allow',
Action: 'logs:*',
Resource: '*',
},
],
}),
},
});
Creating a glue job
const glueJob = new CfnJob(this, withEnv('glue-job-name'), {
name: withEnv('glue-job-name'),
role: jobRole.roleName,
allocatedCapacity: 3,
command: {
name: 'glueetl',
pythonVersion: '3',
scriptLocation: scriptAsset.s3ObjectUrl,
}, // Change timeout based on the nature of job
// (set min required to prevent unnecessary charges incurred)
timeout: 30,
glueVersion: '1.0',
defaultArguments: {
'--UPLOAD_BUCKET_NAME': '<parameter-value>',
'--PROCESSING_BUCKET_NAME': '<parameter-value>',
'--DATA_BUCKET_NAME': '<parameter-value>',
'--GLUE_DB_NAME': '<parameter-value>',
'--enable-continuous-cloudwatch-log': true,
},
executionProperty: {
maxConcurrentRuns: 10,
},
});
Manual configuration
Populating the glue data catalog here
Creating jobs here
Deploy
Create your glue resources in a nested stack in your backend stack and then run
cdk deploy
Using the service
One of the most common use cases for glue integration is to have a glue job run when data is uploaded into your s3 bucket. You typically set up a lambda file upload trigger like here. And you can have your lambda handler trigger the glue jobs for your files as follows:
import { S3CreateEvent } from 'aws-lambda';
import * as AWS from 'aws-sdk';
import { Log } from '../util/logging';
const glue = new AWS.Glue();
export async function handler(event: S3CreateEvent): Promise<void> {
await Promise.all(
event.Records.map(async record => {
const fileParts = record.s3.object.key.split('/');
// In this case, subdirectory corresponds to customer id
const customerId = fileParts[0];
const fileName = fileParts[1];
const startJobResponse = await glue
.startJobRun({
JobName: process.env.GLUE_JOB_NAME,
Arguments: {
// Add whatever parameters are required by your job
'--UPLOAD_BUCKET_NAME': process.env.UPLOAD_BUCKET_NAME,
'--GLUE_DB_NAME': process.env.GLUE_DB_NAME,
'--FILE_NAME': fileName,
'--CUSTOMER_ID': customerId,
'--enable-continuous-cloudwatch-log': 'true',
},
})
.promise();
const jobId = startJobResponse.JobRunId;
Log.debug(`Triggered job '${jobId}' for file '${customerId}/${fileName}'`);
})
);
}
Testing
Writing unit tests for this service would usually involve writing a unit test for your lambda trigger to ensure that the proper call is being made with the required method from the AWS SDK. You can also cover methods in your python script with your unit tests.
For integration tests, you would typically need your test to upload your data to the s3 bucket (provided you have an upload trigger), wait for your job runs to complete and then see the processed data to ensure the transformation was as expected.
For manual tests, you can simply upload your data to the bucket (provided you have an upload trigger), wait for the job to run and then check your processed data in the output bucket and confirm the data looks as expected (process is similar to integration test).
Logging
We need to ensure that "enable-continuous-cloudwatch-log" option is set to true in the code that triggers your job run as shown in the code sample above (under section "Using the service"). In your AWS console, under Glue dashboard, you will find an option to see the status of your runs under "ETL > Jobs"
If you click on the "Logs" option for your job run, you will be able to see CloudWatch logs for your job.
Debugging
Developing with python/pyspark in glue scripts can be challenging without a convenient configuration to debug your changes. Job runs are typically slower at start up because the creation of your cluster can take anywhere between 10-30 mins. Furthermore running jobs, looking at cloudwatch logs, deploying changes and iterating these steps prove to be incredibly detrimental to productivity.
The convenient way of doing this is using a development endpoint (preferably connected to your IDE).
Step by step guide to configure this here
The advantage with this setup is that your cluster is constantly available and logs from your run can be seen from the PyCharm IDE itself. The disadvantage is that dev endpoints are expensive since they keep your clusters running as long as you have the dev endpoint up.
Rolling back
If there are configuration issues with the CDK deploy code, we can destroy and redeploy the env. However, you must be wary of losing your crawled data or data extracted from your uploaded files and stored in your data catalog. If you write your transformed data frames from your script using sink like here, your transformed data is stored in an s3 bucket of your choosing. Ensure that this bucket does not get deleted when you are rolling back. Or ensure that you have a backup for this and the table can be restored.
Backup
To backup your transformed catalog data, you can simply backup your content into another s3 bucket if required before destroying any production env.
Restore
To restore backed up data, you can create a crawler and set the data source as your backed up s3 bucket to repopulate your catalog table with previously transformed data.
Cost optimization techniques
You can see the pricing for glue here.
Glue jobs are measured and priced on the basis of DPU-hours. The same kind of pricing applies to dev endpoints as well. You are not charged for the time it takes to set up your cluster, however as long as you keep it running you will incur charges.
Glue jobs create and use clusters only when required so they have some cost optimization inbuilt. A good rule of thumb would be to experiment with using the minimum number of DPUs (which is 2) to see how your job performs and iterate to find the optimum number required for your job. Higher DPUs means your processing will be faster but it is not required for most use cases.
Development endpoints are optional, and billing applies only if you choose to interactively develop your ETL code. By default, AWS Glue allocates 5 DPUs to each development endpoint. You are billed $0.44 per DPU-Hour with a 10-minute minimum duration for each provisioned development endpoint. This pricing is similar to your glue job, however since your cluster is running throughout, costs for dev endpoints can rise rapidly.
The optimum way of developing with endpoints is to set the minimum number of DPUs for your dev endpoint (2 is the minimum). Since most development use cases involve smaller data sets, you will find 2 DPUs to be more than plenty to transform your data. Ensure manual deletion of the dev endpoint after use to help alleviate costs. You can also create a lambda scheduler that automatically deletes your dev endpoint at a scheduled time of day to ensure that even if devs forget to delete end points, you have a fallback to clean up resources.
Comments
0 comments
Please sign in to leave a comment.