[SPARK-50561][SQL] Improve type coercion and boundary checking for UNIFORM SQL function #49237

dtenedor · 2024-12-18T23:36:17Z

What changes were proposed in this pull request?

This PR improve type coercion and boundary checking for UNIFORM SQL function.

@srielau found the following issues and wrote them down in SPARK-50561:

TINYINT and BIGINT and DECIMAL types were not supported.
No type coercion from floating-point numbers was implemented.
No explicit error checking for negative numbers was implemented, resulting in weird stacktraces instead.

Why are the changes needed?

This PR fixes the above problems to make the function work in more cases and produce better error messages when it fails.

For example:

SELECT uniform(cast(10 as decimal(10, 3)), cast(20 as decimal(10, 3)), 0.0D) AS result;
> 17.605

SELECT uniform(-20L, -10L, 0) AS result
> -12

SELECT uniform(0, cast(10 as tinyint), 0) AS result
> 7

SELECT uniform(0, cast(10 as bigint), 0) AS result
> 7

Does this PR introduce any user-facing change?

Yes, see above.

How was this patch tested?

This PR adds golden file based test coverage, and updates existing coverage.

Was this patch authored or co-authored using generative AI tooling?

No.

commit

dtenedor · 2024-12-19T18:04:17Z

cc @MaxGekk @HyukjinKwon @cloud-fan Greetings, this small PR is ready for a review at your convenience :)

MaxGekk · 2024-12-19T18:55:39Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala

      case _ =>
        throw SparkException.internalError(
          s"Unexpected argument data types: ${min.dataType}, ${max.dataType}")
    }
  }

  private def integer(t: DataType): Boolean = t match {
-    case _: ShortType | _: IntegerType | _: LongType => true
+    case _: ByteType | _: ShortType | _: IntegerType | _: LongType => true


Can you just check IntegralType?

I ended up just updating this to use the ExpectsInputTypes trait to simplify this code.

cloud-fan · 2024-12-20T07:21:15Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala

@@ -49,6 +49,9 @@ trait RDG extends Expression with ExpressionWithRandomSeed {
  @transient protected lazy val seed: Long = seedExpression match {
    case e if e.dataType == IntegerType => e.eval().asInstanceOf[Int]
    case e if e.dataType == LongType => e.eval().asInstanceOf[Long]
+    case e if e.dataType == FloatType => e.eval().asInstanceOf[Float].toLong
+    case e if e.dataType == DoubleType => e.eval().asInstanceOf[Double].toLong
+    case e if e.dataType.isInstanceOf[DecimalType] => e.eval().asInstanceOf[Decimal].toLong


truncation and overflow may happen here, how shall we deal it it?

BTW do we really expect users to use float/double/decimal as the random seed?

I checked and the existing RAND and RANDN functions only accept IntegerType or LongType for the random seed (but positive, zero, and negative values are allowed). I updated this PR to be consistent with that.

cloud-fan · 2024-12-20T07:23:54Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala

      case (_, DoubleType) | (DoubleType, _) => DoubleType
      case (_, FloatType) | (FloatType, _) => FloatType
+      case (_, d: DecimalType) => d


shall we require the other side to be integral type?

or at least numeric type?

Good idea, this is done.

cloud-fan · 2024-12-20T07:31:43Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala

@@ -229,16 +232,20 @@ case class Uniform(min: Expression, max: Expression, seedExpression: Expression,
        if Seq(first, second).forall(integer) => IntegerType
      case (_, ShortType) | (ShortType, _)


One idea to generalize it

case (left: IntegralType, right: IntegralType) => if (UpCastRule.legalNumericPrecedence(left, right)) right else left case float double stuff

Good idea, this is done.

cloud-fan · 2024-12-20T07:33:21Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala

      case (_, DoubleType) | (DoubleType, _) => DoubleType
      case (_, FloatType) | (FloatType, _) => FloatType
+      case (_, d: DecimalType) => d
+      case (d: DecimalType, _) => d


if it's two decimals, shouldn't we pick the wider one instead of preferring the right-side one?

Good idea, this is done.

respond to code review comments

dtenedor

Thanks @MaxGekk and @cloud-fan for your reviews! Responded to your comments, please take another look.

dtenedor · 2024-12-20T22:00:49Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala

      case _ =>
        throw SparkException.internalError(
          s"Unexpected argument data types: ${min.dataType}, ${max.dataType}")
    }
  }

  private def integer(t: DataType): Boolean = t match {
-    case _: ShortType | _: IntegerType | _: LongType => true
+    case _: ByteType | _: ShortType | _: IntegerType | _: LongType => true


I ended up just updating this to use the ExpectsInputTypes trait to simplify this code.

dtenedor · 2024-12-20T22:04:50Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala

@@ -49,6 +49,9 @@ trait RDG extends Expression with ExpressionWithRandomSeed {
  @transient protected lazy val seed: Long = seedExpression match {
    case e if e.dataType == IntegerType => e.eval().asInstanceOf[Int]
    case e if e.dataType == LongType => e.eval().asInstanceOf[Long]
+    case e if e.dataType == FloatType => e.eval().asInstanceOf[Float].toLong
+    case e if e.dataType == DoubleType => e.eval().asInstanceOf[Double].toLong
+    case e if e.dataType.isInstanceOf[DecimalType] => e.eval().asInstanceOf[Decimal].toLong


I checked and the existing RAND and RANDN functions only accept IntegerType or LongType for the random seed (but positive, zero, and negative values are allowed). I updated this PR to be consistent with that.

dtenedor · 2024-12-20T22:05:06Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala

@@ -229,16 +232,20 @@ case class Uniform(min: Expression, max: Expression, seedExpression: Expression,
        if Seq(first, second).forall(integer) => IntegerType
      case (_, ShortType) | (ShortType, _)


Good idea, this is done.

dtenedor · 2024-12-20T22:05:08Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala

      case (_, DoubleType) | (DoubleType, _) => DoubleType
      case (_, FloatType) | (FloatType, _) => FloatType
+      case (_, d: DecimalType) => d


Good idea, this is done.

dtenedor · 2024-12-20T22:05:10Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala

      case (_, DoubleType) | (DoubleType, _) => DoubleType
      case (_, FloatType) | (FloatType, _) => FloatType
+      case (_, d: DecimalType) => d
+      case (d: DecimalType, _) => d


Good idea, this is done.

commit

1585219

commit

dtenedor marked this pull request as ready for review December 18, 2024 23:36

github-actions bot added the SQL label Dec 18, 2024

commit

1f0197b

MaxGekk reviewed Dec 19, 2024

View reviewed changes

cloud-fan reviewed Dec 20, 2024

View reviewed changes

respond to code review comments

f98cb95

respond to code review comments

dtenedor commented Dec 20, 2024

View reviewed changes

dtenedor requested review from cloud-fan and MaxGekk December 20, 2024 22:30

fix test

2ad7227

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-50561][SQL] Improve type coercion and boundary checking for UNIFORM SQL function #49237

[SPARK-50561][SQL] Improve type coercion and boundary checking for UNIFORM SQL function #49237

dtenedor commented Dec 18, 2024 •

edited

Loading

dtenedor commented Dec 19, 2024

MaxGekk Dec 19, 2024

cloud-fan Dec 20, 2024

dtenedor Dec 20, 2024

cloud-fan Dec 20, 2024

cloud-fan Dec 20, 2024

dtenedor Dec 20, 2024

cloud-fan Dec 20, 2024

cloud-fan Dec 20, 2024

dtenedor Dec 20, 2024

cloud-fan Dec 20, 2024 •

edited

Loading

dtenedor Dec 20, 2024

cloud-fan Dec 20, 2024 •

edited

Loading

dtenedor Dec 20, 2024

dtenedor left a comment

dtenedor Dec 20, 2024

dtenedor Dec 20, 2024

dtenedor Dec 20, 2024

dtenedor Dec 20, 2024

dtenedor Dec 20, 2024

		@@ -229,16 +232,20 @@ case class Uniform(min: Expression, max: Expression, seedExpression: Expression,
		if Seq(first, second).forall(integer) => IntegerType
		case (_, ShortType) \| (ShortType, _)

[SPARK-50561][SQL] Improve type coercion and boundary checking for UNIFORM SQL function #49237

Are you sure you want to change the base?

[SPARK-50561][SQL] Improve type coercion and boundary checking for UNIFORM SQL function #49237

Conversation

dtenedor commented Dec 18, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

dtenedor commented Dec 19, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan Dec 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan Dec 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dtenedor left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dtenedor commented Dec 18, 2024 •

edited

Loading

cloud-fan Dec 20, 2024 •

edited

Loading

cloud-fan Dec 20, 2024 •

edited

Loading