Write unit test(s) for Spark functions #17

sjrusso8 · 2024-04-16T14:55:49Z

Description

There are many functions for Spark, and most of them are created via a macro. However, not all of them have unit test coverage. Create additional unit tests based on similar test conditions from the exiting Spark API test cases.

I have been mirror the docstring tests from the PySpark API for reference.

lexara-prime-ai · 2024-07-30T08:02:03Z

Hi, would the scope for this be the following path - core/src/functions/mod.rs
I can see quite a number of tests from other modules like column.rs.

sjrusso8 · 2024-07-30T11:28:32Z

Hi @lexara-prime-ai Welcome! Great question!

I have been trying to mirror the doctests from pyspark 3.5. And all the column unit tests in the docs are technically using the function col. So when I was writing some of the tests I just lumped them into functions. Was this correct? Hindsight probably not, but I think I was just in the zone of writing tests and kept moving forward.

Might be worth while to move those tests into column.rs

lexara-prime-ai · 2024-07-30T14:56:18Z

Thanks @sjrusso8! The explanation makes a lot of sense. Appreciate the groundwork you've laid with the existing test cases, they've been incredibly helpful as I piece everything together. I'll do my best to contribute and assist as much as possible.

lexara-prime-ai · 2024-08-07T17:39:08Z

Hi @sjrusso8 hope you're doing well.

Quick question on how the isnotnull function has been implemented.
I assumed it would be the exact opposite of the isnull function implementation.
However, the function that gets generated by the generate_functions! macro requires an extra argument.

Moving the "definition" from here

// functions that require exactly two col arguments
generate_functions!(
    two_cols: nvl,
    nullif,
    isnotnull,
    ifnull,

to here

// functions that require a single col argument
generate_functions!(
    one_col: isnan,
    isnull,
    isnotnull, // <---
    sqrt,

allows compilation and the test passes...

res -> RecordBatch { schema: Schema { fields: [Field { name: "a", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "r1", data_type: Boolean, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, columns: [PrimitiveArray<Int64>
[
  null,
  1,
], BooleanArray
[
  false,
  true,
]], row_count: 2 }
res -> RecordBatch { schema: Schema { fields: [Field { name: "a", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "r1", data_type: Boolean, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, columns: [PrimitiveArray<Int64>
[
  null,
  1,
], BooleanArray
[
  true,
  false,
]], row_count: 2 }
test functions::tests::test_func_is_not_null ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 117 filtered out; finished in 0.11s

INITIAL COMPILER DIAGNOSTIC

  --> core/src/functions/mod.rs:1180:32
     |
1180 |             .select([col("a"), isnotnull("a").alias("r1")])
     |                                ^^^^^^^^^-----
     |                                         ||
     |                                         |expected all arguments to be this `&str` type because they need to match the type of this parameter
     |                                         an argument of type `&str` is missing
     |
note: function defined here
    --> core/src/functions/mod.rs:468:5
     |
294  |             pub fn $func_name<T: expressions::ToExpr>(col1: T, col2: T) -> Column
     |                               -                       -------  ------- this parameter needs to match the `&str` type of `col1`
     |                               |                       |
     |                               |                       `col2` needs to match the `&str` type of this parameter
     |                               `col1` and `col2` all reference this parameter T
...
468  |     isnotnull,
     |     ^^^^^^^^^
help: provide the argument
     |
1180 |             .select([col("a"), isnotnull("a", /* &str */).alias("r1")])

sjrusso8 · 2024-08-07T18:13:06Z

@lexara-prime-ai I see what you mean. It should be only one input argument for the function, so your change is correct. Probably was not caught because of the lack of unit tests :)

I'm getting the list of arguments from the docs here -> https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/functions.html

lexara-prime-ai · 2024-08-07T18:28:20Z

Awesome!
Thanks for the confirmation! I'll make a note of the subtle change as part of the commit message for keeping track.

* Add test for ascending order with NULLs first. * feat(tests): add test for ascending order with NULLs first (#17) * feat(tests): implement `test_func_asc_nulls_first` - Verify sorting functionality for ascending order with NULL values positioned first. * feat(tests): add test for ascending order with NULLs last (#17) * feat(tests): implement `test_func_asc_nulls_last` - Verify sorting functionality for ascending order with NULL values positioned last. * feat(tests): add tests for descending order with NULLs first and last (#49) * feat(tests): implement `test_func_desc_nulls_first` - Verify sorting functionality for descending order with NULL values positioned first. * feat(tests): implement `test_func_desc_nulls_last` - Verify sorting functionality for descending order with NULL values positioned last. * validate robustness and accuracy of the sorting mechanism * feat(tests): add tests for descending order with NULLs first and last (#17) * feat(tests): implement `test_func_desc_nulls_first` - Verify sorting functionality for descending order with NULL values positioned first. * feat(tests): implement `test_func_desc_nulls_last` - Verify sorting functionality for descending order with NULL values positioned last. * validate robustness and accuracy of the sorting mechanism * feat(tests): add test for LIKE filter functionality (#17) * feat(tests): implement `test_func_like` - Verify filter functionality using the LIKE operator. - Ensure the filter correctly selects the name "Alice". * validate robustness and accuracy of the LIKE filter mechanism * feat(tests): add test for ILIKE filter functionality (#17) * feat(tests): implement `test_func_ilike` - Verify filter functionality using the ILIKE operator. - Ensure the filter correctly selects the name "Alice" using a case-insensitive pattern "%Ice". * validate robustness and accuracy of the ILIKE filter mechanism * feat(tests): add test for RLIKE filter functionality (#17) * feat(tests): implement - Implemented a test to validate the behavior of the 'rlike' filter on a DataFrame. - Ensured that the filter correctly identifies records with names matching the regex pattern ice. - Added necessary setup and assertions for expected results. * feat(tests): add test for RLIKE filter functionality (#17) * feat(tests): implement `test_func_rlike` - Implemented a test to validate the behavior of the 'rlike' filter on a DataFrame. - Ensured that the filter correctly identifies records with names matching the regex pattern ice. - Added necessary setup and assertions for expected results. * feat(tests): add test for EQ filter functionality (#17) * feat(tests): implement `test_func_eq` - Verify filter functionality using the EQ operator. * feat(tests): add test for OR filter functionality (#17) * feat(tests): implement `test_func_or` - Verify filter functionality using the OR operator. - Ensure the filter correctly selects the name "Alice" using a case-insensitive pattern "%Ice". * validate robustness and accuracy of the ILIKE filter mechanism main * feat(tests): add test for IS_NULL filter functionality (#17) * feat(tests): implement `test_func_is_null` - Implemented a test to validate the behavior of the 'is_null' filter on a DataFrame. - Ensured that the filter correctly identifies records with null values. - Added necessary setup and assertions for expected results. * Exclude additional is_null test. * feat(tests): add test for IS_NOT_NULL filter functionality (#17) * feat(tests): implement `test_func_is_not_null` - Implemented a test to validate the behavior of the 'isnotnull' filter on a DataFrame. - Ensured that the filter correctly identifies records with non-null values. - Added necessary setup and assertions for expected results. - Moved `isnotnull` function definition to ` // functions that require a single col argument generate_functions!( one_col: isnan, isnull, isnotnull, ...) ` * feat(tests): add tests for window functions and IS_NAN filter functionality(#17) * feat(tests): implement `test_func_over` - Implemented a test to validate the behavior of window functions on a DataFrame. - Ensured that the window functions (`rank` and `min`) correctly compute values within the specified window. - Added necessary setup, sorting, and assertions for expected results. * feat(tests): implement `test_func_isnan` - Implemented a test to validate the behavior of the `isnan` function on a DataFrame. - Ensured that the `isnan` function correctly identifies NaN values. - Added necessary setup and assertions for expected results. * feat(tests): add tests for column aliasing, casting, and substring functions (#17) - feat(tests): implement `test_func_cast` - Added a test to validate the behavior of casting an integer column to a string. - Ensured that the cast operation correctly transforms the data type and assigns the alias. - feat(tests): implement `test_func_alias` - Added a test to verify column aliasing functionality. - Confirmed that the DataFrame correctly applies aliases to columns. - feat(tests): implement `test_func_substr` - Added a test to validate the substring function on a string column. - Ensured that the substring operation works as expected and assigns the alias. * feat(fix): Fixed typo on test name - test_func_substract to test_func_subtract.

sjrusso8 · 2024-08-09T12:41:00Z

Column code coverage was added, keeping this open for now to track other function coverage

lexara-prime-ai · 2024-08-09T14:30:23Z

@sjrusso8 was looking through other modules as well. Are there any areas that you'd consider high priority so I can start from there?

sjrusso8 · 2024-08-14T13:00:31Z

@lexara-prime-ai there are a bunch of functions that have not been implemented. I attempted to make an inventory on the issue #60.

Any of the functions that do something with UDFs probably can't be implemented. Spark UDFs only really work with python and scala right now

lexara-prime-ai · 2024-08-15T16:12:16Z

Ok just went through the list, I can take a look at the other issues in the meantime.

sjrusso8 added the good first issue Good for newcomers label Apr 16, 2024

lexara-prime-ai mentioned this issue Aug 8, 2024

Enhance Column Function Coverage with Additional Test Cases #68

Merged

sjrusso8 closed this as completed in #68 Aug 9, 2024

sjrusso8 reopened this Aug 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write unit test(s) for Spark functions #17

Write unit test(s) for Spark functions #17

sjrusso8 commented Apr 16, 2024 •

edited

Loading

lexara-prime-ai commented Jul 30, 2024

sjrusso8 commented Jul 30, 2024

lexara-prime-ai commented Jul 30, 2024

lexara-prime-ai commented Aug 7, 2024

sjrusso8 commented Aug 7, 2024

lexara-prime-ai commented Aug 7, 2024

sjrusso8 commented Aug 9, 2024

lexara-prime-ai commented Aug 9, 2024

sjrusso8 commented Aug 14, 2024 •

edited

Loading

lexara-prime-ai commented Aug 15, 2024

Write unit test(s) for Spark functions #17

Write unit test(s) for Spark functions #17

Comments

sjrusso8 commented Apr 16, 2024 • edited Loading

Description

lexara-prime-ai commented Jul 30, 2024

sjrusso8 commented Jul 30, 2024

lexara-prime-ai commented Jul 30, 2024

lexara-prime-ai commented Aug 7, 2024

sjrusso8 commented Aug 7, 2024

lexara-prime-ai commented Aug 7, 2024

sjrusso8 commented Aug 9, 2024

lexara-prime-ai commented Aug 9, 2024

sjrusso8 commented Aug 14, 2024 • edited Loading

lexara-prime-ai commented Aug 15, 2024

sjrusso8 commented Apr 16, 2024 •

edited

Loading

sjrusso8 commented Aug 14, 2024 •

edited

Loading