-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-50561][SQL] Improve type coercion and boundary checking for UNIFORM SQL function #49237
base: master
Are you sure you want to change the base?
Conversation
cc @MaxGekk @HyukjinKwon @cloud-fan Greetings, this small PR is ready for a review at your convenience :) |
case _ => | ||
throw SparkException.internalError( | ||
s"Unexpected argument data types: ${min.dataType}, ${max.dataType}") | ||
} | ||
} | ||
|
||
private def integer(t: DataType): Boolean = t match { | ||
case _: ShortType | _: IntegerType | _: LongType => true | ||
case _: ByteType | _: ShortType | _: IntegerType | _: LongType => true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you just check IntegralType
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up just updating this to use the ExpectsInputTypes
trait to simplify this code.
@@ -49,6 +49,9 @@ trait RDG extends Expression with ExpressionWithRandomSeed { | |||
@transient protected lazy val seed: Long = seedExpression match { | |||
case e if e.dataType == IntegerType => e.eval().asInstanceOf[Int] | |||
case e if e.dataType == LongType => e.eval().asInstanceOf[Long] | |||
case e if e.dataType == FloatType => e.eval().asInstanceOf[Float].toLong | |||
case e if e.dataType == DoubleType => e.eval().asInstanceOf[Double].toLong | |||
case e if e.dataType.isInstanceOf[DecimalType] => e.eval().asInstanceOf[Decimal].toLong |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
truncation and overflow may happen here, how shall we deal it it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW do we really expect users to use float/double/decimal as the random seed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked and the existing RAND
and RANDN
functions only accept IntegerType
or LongType
for the random seed (but positive, zero, and negative values are allowed). I updated this PR to be consistent with that.
case (_, DoubleType) | (DoubleType, _) => DoubleType | ||
case (_, FloatType) | (FloatType, _) => FloatType | ||
case (_, d: DecimalType) => d |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we require the other side to be integral type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or at least numeric type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, this is done.
@@ -229,16 +232,20 @@ case class Uniform(min: Expression, max: Expression, seedExpression: Expression, | |||
if Seq(first, second).forall(integer) => IntegerType | |||
case (_, ShortType) | (ShortType, _) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One idea to generalize it
case (left: IntegralType, right: IntegralType) =>
if (UpCastRule.legalNumericPrecedence(left, right)) right else left
case float double stuff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, this is done.
case (_, DoubleType) | (DoubleType, _) => DoubleType | ||
case (_, FloatType) | (FloatType, _) => FloatType | ||
case (_, d: DecimalType) => d | ||
case (d: DecimalType, _) => d |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it's two decimals, shouldn't we pick the wider one instead of preferring the right-side one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, this is done.
respond to code review comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @MaxGekk and @cloud-fan for your reviews! Responded to your comments, please take another look.
case _ => | ||
throw SparkException.internalError( | ||
s"Unexpected argument data types: ${min.dataType}, ${max.dataType}") | ||
} | ||
} | ||
|
||
private def integer(t: DataType): Boolean = t match { | ||
case _: ShortType | _: IntegerType | _: LongType => true | ||
case _: ByteType | _: ShortType | _: IntegerType | _: LongType => true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up just updating this to use the ExpectsInputTypes
trait to simplify this code.
@@ -49,6 +49,9 @@ trait RDG extends Expression with ExpressionWithRandomSeed { | |||
@transient protected lazy val seed: Long = seedExpression match { | |||
case e if e.dataType == IntegerType => e.eval().asInstanceOf[Int] | |||
case e if e.dataType == LongType => e.eval().asInstanceOf[Long] | |||
case e if e.dataType == FloatType => e.eval().asInstanceOf[Float].toLong | |||
case e if e.dataType == DoubleType => e.eval().asInstanceOf[Double].toLong | |||
case e if e.dataType.isInstanceOf[DecimalType] => e.eval().asInstanceOf[Decimal].toLong |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked and the existing RAND
and RANDN
functions only accept IntegerType
or LongType
for the random seed (but positive, zero, and negative values are allowed). I updated this PR to be consistent with that.
@@ -229,16 +232,20 @@ case class Uniform(min: Expression, max: Expression, seedExpression: Expression, | |||
if Seq(first, second).forall(integer) => IntegerType | |||
case (_, ShortType) | (ShortType, _) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, this is done.
case (_, DoubleType) | (DoubleType, _) => DoubleType | ||
case (_, FloatType) | (FloatType, _) => FloatType | ||
case (_, d: DecimalType) => d |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, this is done.
case (_, DoubleType) | (DoubleType, _) => DoubleType | ||
case (_, FloatType) | (FloatType, _) => FloatType | ||
case (_, d: DecimalType) => d | ||
case (d: DecimalType, _) => d |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, this is done.
What changes were proposed in this pull request?
This PR improve type coercion and boundary checking for UNIFORM SQL function.
@srielau found the following issues and wrote them down in SPARK-50561:
Why are the changes needed?
This PR fixes the above problems to make the function work in more cases and produce better error messages when it fails.
For example:
Does this PR introduce any user-facing change?
Yes, see above.
How was this patch tested?
This PR adds golden file based test coverage, and updates existing coverage.
Was this patch authored or co-authored using generative AI tooling?
No.