Make random data in Python tests deterministic #14071

vuule · 2023-09-08T19:23:09Z

Description

Some random data generators in cuDF default to seed=None, which means that an OS or time dependent seed is used, leading to different test data between systems/runs.
This PR changes the default to a fixed integer so that the same data is always generated.

Contributes to #17045.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

…bug-deterministic-tests

vuule · 2023-09-11T18:21:16Z

python/cudf/cudf/tests/test_array_ufunc.py

-        for arg in args:
-            set_random_null_mask_inplace(arg)
+        for idx, arg in enumerate(args):
+            set_random_null_mask_inplace(arg, seed=idx)


seed=idx to ensure different null masks for different columns

vuule · 2023-09-11T18:21:55Z

CC @galipremsagar

wence-

I realise that tracking down all uses of random sampling in the test suite is a big thing, and providing a default fixed seed everywhere is a pragmatic choice to get deterministic tests, but I think I don't want to break API compatibility with pandas for the two sample calls.

wence- · 2023-09-13T07:48:31Z

python/cudf/cudf/core/groupby/groupby.py

@@ -950,7 +950,7 @@ def sample(
        frac: Optional[float] = None,
        replace: bool = False,
        weights: Union[abc.Sequence, "cudf.Series", None] = None,
-        random_state: Union[np.random.RandomState, int, None] = None,
+        random_state: Union[np.random.RandomState, int, None] = 1,


issue: ‏I am not sure I like this change, it means that user code that previously worked to draw a sequence of independent samples from groupby objects now always returns the same result for each sample.

wence- · 2023-09-13T07:51:07Z

python/cudf/cudf/core/indexed_frame.py

@@ -3346,7 +3346,7 @@ def sample(
        frac=None,
        replace=False,
        weights=None,
-        random_state=None,
+        random_state=1,


issue: ‏Similarly here, I don't think we should set a specific seed as a default argument to sample. This is also creating a difference in the default API wrt pandas (which defaults to None https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html)

wence- · 2023-09-13T07:53:27Z

python/cudf/cudf/datasets.py

@@ -80,7 +80,7 @@ def timeseries(
    return gdf


-def randomdata(nrows=10, dtypes=None, seed=None):
+def randomdata(nrows=10, dtypes=None, seed=1):


note (non-blocking): ‏I am on the fence about these defaults. I suppose it is OK. Perhaps better would be to flip this to a no-default keyword only argument, forcing the caller to specify a seed:

Suggested change

def randomdata(nrows=10, dtypes=None, seed=1):

def randomdata(nrows=10, dtypes=None, *, seed):

vuule added 7 commits September 8, 2023 10:40

deterministic default seed in randomdata

c1689dc

deterministic set_random_null_mask_inplace

aa8d4c8

deterministic rand_dataframe

52f439a

deterministic timeseries

10e9e41

style

a87641f

deterministic sample

515ceac

Merge branch 'branch-23.10' of https://github.com/rapidsai/cudf into …

614f911

…bug-deterministic-tests

vuule added tests Unit testing for project tech debt non-breaking Non-breaking change labels Sep 8, 2023

vuule self-assigned this Sep 8, 2023

github-actions bot added the Python Affects Python cuDF API. label Sep 8, 2023

vuule added improvement Improvement / enhancement to an existing function and removed Python Affects Python cuDF API. labels Sep 8, 2023

allow None seed in get_dataframe

6bcd0e8

github-actions bot added the Python Affects Python cuDF API. label Sep 8, 2023

vuule added 3 commits September 8, 2023 15:20

make sure masks are unique in test_binary_ufunc_series_array

6e35758

Merge branch 'branch-23.10' into bug-deterministic-tests

29ccfbb

make sure masks are unique in a few more spots

fbd023d

vuule commented Sep 11, 2023

View reviewed changes

vuule marked this pull request as ready for review September 11, 2023 18:21

vuule requested review from a team as code owners September 11, 2023 18:21

vuule requested review from shwina and mroeschke September 11, 2023 18:21

galipremsagar self-requested a review September 11, 2023 19:20

wence- requested changes Sep 13, 2023

View reviewed changes

vuule changed the base branch from branch-23.10 to branch-23.12 September 22, 2023 16:22

vuule added the 0 - Waiting on Author Waiting for author to respond to review label Sep 22, 2023

vyasr removed the tech debt label Feb 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make random data in Python tests deterministic #14071

Make random data in Python tests deterministic #14071

vuule commented Sep 8, 2023 •

edited by vyasr

Loading

vuule Sep 11, 2023

vuule commented Sep 11, 2023

wence- left a comment

wence- Sep 13, 2023

wence- Sep 13, 2023

wence- Sep 13, 2023

	def randomdata(nrows=10, dtypes=None, seed=1):
	def randomdata(nrows=10, dtypes=None, *, seed):

Make random data in Python tests deterministic #14071

Are you sure you want to change the base?

Make random data in Python tests deterministic #14071

Conversation

vuule commented Sep 8, 2023 • edited by vyasr Loading

Description

Checklist

vuule Sep 11, 2023

Choose a reason for hiding this comment

vuule commented Sep 11, 2023

wence- left a comment

Choose a reason for hiding this comment

wence- Sep 13, 2023

Choose a reason for hiding this comment

wence- Sep 13, 2023

Choose a reason for hiding this comment

wence- Sep 13, 2023

Choose a reason for hiding this comment

vuule commented Sep 8, 2023 •

edited by vyasr

Loading