-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor(glue-alpha): glue L2 alpha construct #30833
base: main
Are you sure you want to change the base?
Conversation
…-glue-l2 to mjanardhan/aws-cdk-glue-l2 to resolve non-trivial merge situation
Workflow Triggers
Added Scala jobs unit tests
fixed the unit test for scala etl and flex etl
WorkerType and numberofWorkers defaults are enforced when not set
Integ tests snapshots
…s-cdk-glue-l2 into glue-alpha-refacor
## Database | ||
|
||
A `Database` is a logical grouping of `Tables` in the Glue Catalog. | ||
|
||
```ts | ||
new glue.Database(this, 'MyDatabase', { | ||
databaseName: 'my_database', | ||
description: 'my_database_description', | ||
}); | ||
``` | ||
|
||
## Table | ||
|
||
A Glue table describes a table of data in S3: its structure (column names and types), location of data (S3 objects with a common prefix in a S3 bucket), and format for the files (Json, Avro, Parquet, etc.): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How come the documentation for Database
, Table
and many others are being deleted here? The code files are not being modified, is it still WIP or will they be removed or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed with @natalie-white-aws and the removal of these docs are unintended and will be updated to add them back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for this huge effort on re-writing L2 glue. Left some feedback on it. Still working my way to finish the review.
A general feedback on the README is to add more examples with usage.
from multiple sources for analytics, machine learning (ML), and application | ||
development. | ||
|
||
Wihout an L2 construct, developers define Glue data sources, connections, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wihout an L2 construct, developers define Glue data sources, connections, | |
Without an L2 construct, developers define Glue data sources, connections, |
|
||
There are 3 types of jobs supported by AWS Glue: Spark ETL, Spark Streaming, and Python Shell jobs. | ||
1. **ETL Jobs** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we give an example of ETL job?
/** | ||
* The IAM role assumed by Glue to run this job. | ||
* | ||
* @default - undefined | ||
*/ | ||
readonly role?: iam.IRole; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary for the import attribute?
* | ||
* @see https://docs.aws.amazon.com/glue/latest/dg/getting-started-access.html | ||
**/ | ||
readonly role: iam.IRole; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slight improvement: Instead of making it required, a common practice L2 does is to auto-generate a new role if users don't supply one. Can we do the same here? How hard would it be?
* @param id The construct's id. | ||
* @param attrs Attributes for the Glue Job we want to import | ||
*/ | ||
public static fromJobAttributes(scope: constructs.Construct, id: string, attrs: JobImportAttributes): IJob { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is fromXXX
function even needed on an abstract class where users won't be able to instantiate anyway? Should we consider adding it to the concrete class?
} | ||
|
||
private setupSparkUI(role: iam.IRole, sparkUiProps: SparkUIProps) { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Spark ETL Jobs class | ||
* ETL jobs support pySpark and Scala languages, for which there are separate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit misleading as this file is dedicated for PySpark. I see there's another one for Scala. Can you update the docstrings?
this.role = props.role, { | ||
assumedBy: new iam.ServicePrincipal('glue.amazonaws.com'), | ||
managedPolicies: [iam.ManagedPolicy.fromAwsManagedPolicyName('service-role/AWSGlueServiceRole')], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you're using a comma operator to assign the values to this.role
. Then why not make role
optional?
|
||
// Enable CloudWatch metrics and continuous logging by default as a best practice | ||
const continuousLoggingArgs = this.setupContinuousLogging(this.role, props.continuousLogging); | ||
const profilingMetricsArgs = { '--enable-metrics': '' }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same question as pyspark-etl
one.
const jobResource = new CfnJob(this, 'Resource', { | ||
name: props.jobName, | ||
description: props.description, | ||
role: this.role.roleArn, | ||
command: { | ||
name: JobType.ETL, | ||
scriptLocation: this.codeS3ObjectUrl(props.script), | ||
pythonVersion: PythonVersion.THREE, | ||
}, | ||
glueVersion: props.glueVersion ? props.glueVersion : GlueVersion.V3_0, | ||
workerType: props.workerType ? props.workerType : WorkerType.G_1X, | ||
numberOfWorkers: props.numberOfWorkers ? props.numberOfWorkers : 10, | ||
maxRetries: props.maxRetries, | ||
executionProperty: props.maxConcurrentRuns ? { maxConcurrentRuns: props.maxConcurrentRuns } : undefined, | ||
notificationProperty: props.notifyDelayAfter ? { notifyDelayAfter: props.notifyDelayAfter.toMinutes() } : undefined, | ||
timeout: props.timeout?.toMinutes(), | ||
connections: props.connections ? { connections: props.connections.map((connection) => connection.connectionName) } : undefined, | ||
securityConfiguration: props.securityConfiguration?.securityConfigurationName, | ||
tags: props.tags, | ||
executionClass: ExecutionClass.FLEX, | ||
defaultArguments, | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A general feedback, the standard and the flex constructs look very similar and a lot of codes are duplicated. It would make future maintaining hard where the same changes need to apply to two places. Do we really need two separate files and constructs for it? Can we make it the same construct?
This PR has been in the CHANGES REQUESTED state for 3 weeks, and looks abandoned. To keep this PR from being closed, please continue work on it. If not, it will automatically be closed in a week. |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
Issue # (if applicable)
RFC 0497-glue-l2-construct
Link: https://github.com/aws/aws-cdk-rfcs/blob/main/text/0497-glue-l2-construct.md
Reason for this change
Refactor existing Glue L2 alpha construct to adhere to CDK best practices, enforce Job types and parameters via interface constructs, and improve the developer experience pushing validations left and disallowing deprecated Glue and language versions.
Description of changes
The existing Job and Job Executable monoliths have been decomposed into Job Type and Language specific classes that implement and extend an abstract Job parent class. Developers will be able to see mandatory and optional parameters that apply just to their selected job type and language, rather than having to reference documentation and examples or find out during synth or deploy time that they've selected the wrong configuration.
BREAKING CHANGE: Developers must refactor their existing Job instantiation method calls to choose the right job type and language, and use the new constants static values to define the associated Job configuration settings. See the RFC and/or new README for examples.
Description of how you validated changes
We've added unit and integration tests for each valid job type and language combination and validated the creation of new Glue Jobs adhere to opinionated best practices such as continuous logging and runtime versions.
Checklist
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license