Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write unit test(s) for Spark functions #17

Open
sjrusso8 opened this issue Apr 16, 2024 · 10 comments · Fixed by #68
Open

Write unit test(s) for Spark functions #17

sjrusso8 opened this issue Apr 16, 2024 · 10 comments · Fixed by #68
Labels
good first issue Good for newcomers

Comments

@sjrusso8
Copy link
Owner

sjrusso8 commented Apr 16, 2024

Description

There are many functions for Spark, and most of them are created via a macro. However, not all of them have unit test coverage. Create additional unit tests based on similar test conditions from the exiting Spark API test cases.

I have been mirror the docstring tests from the PySpark API for reference.

@sjrusso8 sjrusso8 added the good first issue Good for newcomers label Apr 16, 2024
@lexara-prime-ai
Copy link
Contributor

Hi, would the scope for this be the following path - core/src/functions/mod.rs
I can see quite a number of tests from other modules like column.rs.

@sjrusso8
Copy link
Owner Author

Hi @lexara-prime-ai Welcome! Great question!

I have been trying to mirror the doctests from pyspark 3.5. And all the column unit tests in the docs are technically using the function col. So when I was writing some of the tests I just lumped them into functions. Was this correct? Hindsight probably not, but I think I was just in the zone of writing tests and kept moving forward.

Might be worth while to move those tests into column.rs

@lexara-prime-ai
Copy link
Contributor

Thanks @sjrusso8! The explanation makes a lot of sense. Appreciate the groundwork you've laid with the existing test cases, they've been incredibly helpful as I piece everything together. I'll do my best to contribute and assist as much as possible.

@lexara-prime-ai
Copy link
Contributor

Hi @sjrusso8 hope you're doing well.

Quick question on how the isnotnull function has been implemented.
I assumed it would be the exact opposite of the isnull function implementation.
However, the function that gets generated by the generate_functions! macro requires an extra argument.

Moving the "definition" from here

// functions that require exactly two col arguments
generate_functions!(
    two_cols: nvl,
    nullif,
    isnotnull,
    ifnull,

to here

// functions that require a single col argument
generate_functions!(
    one_col: isnan,
    isnull,
    isnotnull, // <---
    sqrt,

allows compilation and the test passes...

res -> RecordBatch { schema: Schema { fields: [Field { name: "a", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "r1", data_type: Boolean, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, columns: [PrimitiveArray<Int64>
[
  null,
  1,
], BooleanArray
[
  false,
  true,
]], row_count: 2 }
res -> RecordBatch { schema: Schema { fields: [Field { name: "a", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "r1", data_type: Boolean, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, columns: [PrimitiveArray<Int64>
[
  null,
  1,
], BooleanArray
[
  true,
  false,
]], row_count: 2 }
test functions::tests::test_func_is_not_null ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 117 filtered out; finished in 0.11s

INITIAL COMPILER DIAGNOSTIC

  --> core/src/functions/mod.rs:1180:32
     |
1180 |             .select([col("a"), isnotnull("a").alias("r1")])
     |                                ^^^^^^^^^-----
     |                                         ||
     |                                         |expected all arguments to be this `&str` type because they need to match the type of this parameter
     |                                         an argument of type `&str` is missing
     |
note: function defined here
    --> core/src/functions/mod.rs:468:5
     |
294  |             pub fn $func_name<T: expressions::ToExpr>(col1: T, col2: T) -> Column
     |                               -                       -------  ------- this parameter needs to match the `&str` type of `col1`
     |                               |                       |
     |                               |                       `col2` needs to match the `&str` type of this parameter
     |                               `col1` and `col2` all reference this parameter T
...
468  |     isnotnull,
     |     ^^^^^^^^^
help: provide the argument
     |
1180 |             .select([col("a"), isnotnull("a", /* &str */).alias("r1")])

@sjrusso8
Copy link
Owner Author

sjrusso8 commented Aug 7, 2024

@lexara-prime-ai I see what you mean. It should be only one input argument for the function, so your change is correct. Probably was not caught because of the lack of unit tests :)

I'm getting the list of arguments from the docs here -> https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/functions.html

@lexara-prime-ai
Copy link
Contributor

Awesome!
Thanks for the confirmation! I'll make a note of the subtle change as part of the commit message for keeping track.

sjrusso8 pushed a commit that referenced this issue Aug 9, 2024
* Add test for ascending order with NULLs first.

* feat(tests): add test for ascending order with NULLs first (#17)

    * feat(tests): implement `test_func_asc_nulls_first`
      - Verify sorting functionality for ascending order with NULL values positioned first.

* feat(tests): add test for ascending order with NULLs last (#17)

    * feat(tests): implement `test_func_asc_nulls_last`
      - Verify sorting functionality for ascending order with NULL values positioned last.

* feat(tests): add tests for descending order with NULLs first and last (#49)

    * feat(tests): implement `test_func_desc_nulls_first`
      - Verify sorting functionality for descending order with NULL values positioned first.

    * feat(tests): implement `test_func_desc_nulls_last`
      - Verify sorting functionality for descending order with NULL values positioned last.

    * validate robustness and accuracy of the sorting mechanism

* feat(tests): add tests for descending order with NULLs first and last (#17)

    * feat(tests): implement `test_func_desc_nulls_first`
      - Verify sorting functionality for descending order with NULL values positioned first.

    * feat(tests): implement `test_func_desc_nulls_last`
      - Verify sorting functionality for descending order with NULL values positioned last.

    * validate robustness and accuracy of the sorting mechanism

* feat(tests): add test for LIKE filter functionality (#17)

    * feat(tests): implement `test_func_like`
      - Verify filter functionality using the LIKE operator.
      - Ensure the filter correctly selects the name "Alice".

    * validate robustness and accuracy of the LIKE filter mechanism

* feat(tests): add test for ILIKE filter functionality (#17)

    * feat(tests): implement `test_func_ilike`
      - Verify filter functionality using the ILIKE operator.
      - Ensure the filter correctly selects the name "Alice" using a case-insensitive pattern "%Ice".

    * validate robustness and accuracy of the ILIKE filter mechanism

* feat(tests): add test for RLIKE filter functionality (#17)
   * feat(tests): implement
     - Implemented a test to validate the behavior of the 'rlike' filter on a DataFrame.
     - Ensured that the filter correctly identifies records with names matching the regex pattern ice.
       - Added necessary setup and assertions for expected results.

* feat(tests): add test for RLIKE filter functionality (#17)
   * feat(tests): implement `test_func_rlike`
     - Implemented a test to validate the behavior of the 'rlike' filter on a DataFrame.
     - Ensured that the filter correctly identifies records with names matching the regex pattern ice.
     - Added necessary setup and assertions for expected results.

* feat(tests): add test for EQ filter functionality (#17)
    * feat(tests): implement `test_func_eq`
      - Verify filter functionality using the EQ operator.

* feat(tests): add test for OR filter functionality (#17)
    * feat(tests): implement `test_func_or`
      - Verify filter functionality using the OR operator.
      - Ensure the filter correctly selects the name "Alice" using a case-insensitive pattern "%Ice".

    * validate robustness and accuracy of the ILIKE filter mechanism
 main

* feat(tests): add test for IS_NULL filter functionality (#17)
   * feat(tests): implement `test_func_is_null`
     - Implemented a test to validate the behavior of the 'is_null' filter on a DataFrame.
     - Ensured that the filter correctly identifies records with null values.
     - Added necessary setup and assertions for expected results.

* Exclude additional is_null test.

* feat(tests): add test for IS_NOT_NULL filter functionality (#17)
   * feat(tests): implement `test_func_is_not_null`
     - Implemented a test to validate the behavior of the 'isnotnull' filter on a DataFrame.
     - Ensured that the filter correctly identifies records with non-null values.
     - Added necessary setup and assertions for expected results.
     - Moved `isnotnull` function definition to `
        // functions that require a single col argument
        generate_functions!(
            one_col: isnan,
            isnull,
            isnotnull,
            ...)
     `

* feat(tests): add tests for window functions and IS_NAN filter functionality(#17)

* feat(tests): implement `test_func_over`
  - Implemented a test to validate the behavior of window functions on a DataFrame.
  - Ensured that the window functions (`rank` and `min`) correctly compute values within the specified window.
  - Added necessary setup, sorting, and assertions for expected results.

* feat(tests): implement `test_func_isnan`
  - Implemented a test to validate the behavior of the `isnan` function on a DataFrame.
  - Ensured that the `isnan` function correctly identifies NaN values.
  - Added necessary setup and assertions for expected results.

* feat(tests): add tests for column aliasing, casting, and substring functions (#17)

- feat(tests): implement `test_func_cast`
  - Added a test to validate the behavior of casting an integer column to a string.
  - Ensured that the cast operation correctly transforms the data type and assigns the alias.

- feat(tests): implement `test_func_alias`
  - Added a test to verify column aliasing functionality.
  - Confirmed that the DataFrame correctly applies aliases to columns.

- feat(tests): implement `test_func_substr`
  - Added a test to validate the substring function on a string column.
  - Ensured that the substring operation works as expected and assigns the alias.

* feat(fix): Fixed typo on test name - test_func_substract to test_func_subtract.
@sjrusso8 sjrusso8 reopened this Aug 9, 2024
@sjrusso8
Copy link
Owner Author

sjrusso8 commented Aug 9, 2024

Column code coverage was added, keeping this open for now to track other function coverage

@lexara-prime-ai
Copy link
Contributor

@sjrusso8 was looking through other modules as well. Are there any areas that you'd consider high priority so I can start from there?

@sjrusso8
Copy link
Owner Author

sjrusso8 commented Aug 14, 2024

@lexara-prime-ai there are a bunch of functions that have not been implemented. I attempted to make an inventory on the issue #60.

Any of the functions that do something with UDFs probably can't be implemented. Spark UDFs only really work with python and scala right now

@lexara-prime-ai
Copy link
Contributor

Ok just went through the list, I can take a look at the other issues in the meantime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants