Chapter 2: Creating a test set, Stratify #689

hady42 · 2024-06-06T11:14:00Z

I am kindly asking for clarification in some points regarding Chapter 2.

Why do we need to introduce the random seed? And if it is to have consistent train/test sets over multiple runs, then why do we need to have multiple runs.
If using the hash function will keep the test set consistent, can new instances be included into the test set as the hash value of its id satisfies the condition crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32?
What is the point to use stratified sampling in the first place.
Why cant we just use the normal train_test_split method instead of StratifiedShuffleSplit?

Thank you for your kindness and your time.

Provide feedback