Data Split

Data Split module splits data into train, test, and/or validate sets of arbitrary sizes. The module is based on sampling method.

Use

Data Split supports local(same as homogeneous) and heterogeneous (only Guest has y) mode.

Here lists supported split modes and scenario.

Split Mode	Federated Heterogeneous	Federated Homogeneous(Local)
Random	✓	✓
Stratified	✓	✓

Data Split module takes single data input as specified in job config file and always outputs three tables (train, test, and validate data sets). Each data ouput may be used as input of another module. Below are the rules regarding set sizes:

if all three set sizes are None, the original data input will be split in the following ratio: 80% to train set, 20% to validate set, and an empty test set;
if only test size or validate size is given, train size is set to be of complement given size;
only one of the three sizes is needed to split input data, but all three may be specified. The module takes either int (instance count) or float (fraction) value for set sizes, but mixed-type inputs are not accepted.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data_split.md

data_split.md

Data Split

Use

Files

data_split.md

Latest commit

History

data_split.md

File metadata and controls

Data Split

Use