Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-50561][SQL] Improve type coercion and boundary checking for UNIFORM SQL function #49237

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

dtenedor
Copy link
Contributor

@dtenedor dtenedor commented Dec 18, 2024

What changes were proposed in this pull request?

This PR improve type coercion and boundary checking for UNIFORM SQL function.

@srielau found the following issues and wrote them down in SPARK-50561:

  • TINYINT and BIGINT and DECIMAL types were not supported.
  • No type coercion from floating-point numbers was implemented.
  • No explicit error checking for negative numbers was implemented, resulting in weird stacktraces instead.

Why are the changes needed?

This PR fixes the above problems to make the function work in more cases and produce better error messages when it fails.

For example:

SELECT uniform(cast(10 as decimal(10, 3)), cast(20 as decimal(10, 3)), 0.0D) AS result;
> 17.605

SELECT uniform(-20L, -10L, 0) AS result
> -12

SELECT uniform(0, cast(10 as tinyint), 0) AS result
> 7

SELECT uniform(0, cast(10 as bigint), 0) AS result
> 7

Does this PR introduce any user-facing change?

Yes, see above.

How was this patch tested?

This PR adds golden file based test coverage, and updates existing coverage.

Was this patch authored or co-authored using generative AI tooling?

No.

commit
@dtenedor dtenedor marked this pull request as ready for review December 18, 2024 23:36
@github-actions github-actions bot added the SQL label Dec 18, 2024
@dtenedor
Copy link
Contributor Author

cc @MaxGekk @HyukjinKwon @cloud-fan Greetings, this small PR is ready for a review at your convenience :)

case _ =>
throw SparkException.internalError(
s"Unexpected argument data types: ${min.dataType}, ${max.dataType}")
}
}

private def integer(t: DataType): Boolean = t match {
case _: ShortType | _: IntegerType | _: LongType => true
case _: ByteType | _: ShortType | _: IntegerType | _: LongType => true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you just check IntegralType?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up just updating this to use the ExpectsInputTypes trait to simplify this code.

@@ -49,6 +49,9 @@ trait RDG extends Expression with ExpressionWithRandomSeed {
@transient protected lazy val seed: Long = seedExpression match {
case e if e.dataType == IntegerType => e.eval().asInstanceOf[Int]
case e if e.dataType == LongType => e.eval().asInstanceOf[Long]
case e if e.dataType == FloatType => e.eval().asInstanceOf[Float].toLong
case e if e.dataType == DoubleType => e.eval().asInstanceOf[Double].toLong
case e if e.dataType.isInstanceOf[DecimalType] => e.eval().asInstanceOf[Decimal].toLong
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

truncation and overflow may happen here, how shall we deal it it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW do we really expect users to use float/double/decimal as the random seed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked and the existing RAND and RANDN functions only accept IntegerType or LongType for the random seed (but positive, zero, and negative values are allowed). I updated this PR to be consistent with that.

case (_, DoubleType) | (DoubleType, _) => DoubleType
case (_, FloatType) | (FloatType, _) => FloatType
case (_, d: DecimalType) => d
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we require the other side to be integral type?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or at least numeric type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, this is done.

@@ -229,16 +232,20 @@ case class Uniform(min: Expression, max: Expression, seedExpression: Expression,
if Seq(first, second).forall(integer) => IntegerType
case (_, ShortType) | (ShortType, _)
Copy link
Contributor

@cloud-fan cloud-fan Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One idea to generalize it

case (left: IntegralType, right: IntegralType) =>
  if (UpCastRule.legalNumericPrecedence(left, right)) right else left
case float double stuff

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, this is done.

case (_, DoubleType) | (DoubleType, _) => DoubleType
case (_, FloatType) | (FloatType, _) => FloatType
case (_, d: DecimalType) => d
case (d: DecimalType, _) => d
Copy link
Contributor

@cloud-fan cloud-fan Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it's two decimals, shouldn't we pick the wider one instead of preferring the right-side one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, this is done.

respond to code review comments
Copy link
Contributor Author

@dtenedor dtenedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @MaxGekk and @cloud-fan for your reviews! Responded to your comments, please take another look.

case _ =>
throw SparkException.internalError(
s"Unexpected argument data types: ${min.dataType}, ${max.dataType}")
}
}

private def integer(t: DataType): Boolean = t match {
case _: ShortType | _: IntegerType | _: LongType => true
case _: ByteType | _: ShortType | _: IntegerType | _: LongType => true
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up just updating this to use the ExpectsInputTypes trait to simplify this code.

@@ -49,6 +49,9 @@ trait RDG extends Expression with ExpressionWithRandomSeed {
@transient protected lazy val seed: Long = seedExpression match {
case e if e.dataType == IntegerType => e.eval().asInstanceOf[Int]
case e if e.dataType == LongType => e.eval().asInstanceOf[Long]
case e if e.dataType == FloatType => e.eval().asInstanceOf[Float].toLong
case e if e.dataType == DoubleType => e.eval().asInstanceOf[Double].toLong
case e if e.dataType.isInstanceOf[DecimalType] => e.eval().asInstanceOf[Decimal].toLong
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked and the existing RAND and RANDN functions only accept IntegerType or LongType for the random seed (but positive, zero, and negative values are allowed). I updated this PR to be consistent with that.

@@ -229,16 +232,20 @@ case class Uniform(min: Expression, max: Expression, seedExpression: Expression,
if Seq(first, second).forall(integer) => IntegerType
case (_, ShortType) | (ShortType, _)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, this is done.

case (_, DoubleType) | (DoubleType, _) => DoubleType
case (_, FloatType) | (FloatType, _) => FloatType
case (_, d: DecimalType) => d
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, this is done.

case (_, DoubleType) | (DoubleType, _) => DoubleType
case (_, FloatType) | (FloatType, _) => FloatType
case (_, d: DecimalType) => d
case (d: DecimalType, _) => d
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, this is done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants