Its taking more than 20h to sample the data #14

mhnbitece · 2020-10-18T09:07:05Z

Hi Nick,

I am seeing huge runtime for my input data which is of 28K * 59.
Its running for more than a day.
I have even standardized the input data
Any possible solution ?

dist_matrix: 5%|4 | 276/5671 [50:48<16:50:38, 11.24s/it]

maxiuw · 2020-11-25T14:03:28Z

same issue here

snigdhasen · 2021-04-29T13:30:10Z

Same issue here. Its more than a day for 5 lakh data *32

luna57-lr · 2021-11-16T06:32:24Z

same issue!

naeemmrz · 2022-01-22T18:46:04Z

Me too, it's extremely slow on relatively large datasets. A cuda implementation and/or n_jobs option would be great.

MouadEt-tali · 2023-10-14T16:45:56Z

I think I have a potential solution for this problem and this MIGHT work for you :

My problem was using the default settings without specifying anything

Here is my previous code that was extremely slow
dataframe_oversampled = smogn.smoter( data=dataframe, y='TARGET_VARIABLE', )

However the moment I started tinkering the parameters somehow it got 15 times faster, a code that used to take me 6 hours only took 30 minutes !

Here is how I changed my code, I hope similar tinkering will help you too.

PS : in my project I made a special function to handle all missing data because I had special cases, so the drop_na_col and drop_na_row in these parameters are just for good measure.
`

Apply SMOGN to balance the dataset

dataframe_oversampled = smogn.smoter(
    data=dataframe,
    y='TARGET_VARIABLE',
    k=9,                    ## positive integer (k < n)
    pert=0.04,              ## real number (0 < R < 1)
    samp_method='balance',  ## string ('balance' or 'extreme')
    drop_na_col=True,       ## boolean (True or False)
    drop_na_row=True,       ## boolean (True or False)
    replace=False,          ## boolean (True or False)

    ## phi relevance arguments
    rel_thres=0.10,         ## real number (0 < R < 1)
    rel_method='manual',    ## string ('auto' or 'manual')
    # rel_xtrm_type = 'both', ## unused (rel_method = 'manual')
    # rel_coef = 1.50,        ## unused (rel_method = 'manual')
    rel_ctrl_pts_rg=rg_mtrx ## 2d array (format: [x, y])
)

`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Its taking more than 20h to sample the data #14

Its taking more than 20h to sample the data #14

mhnbitece commented Oct 18, 2020 •

edited

Loading

maxiuw commented Nov 25, 2020

snigdhasen commented Apr 29, 2021

luna57-lr commented Nov 16, 2021

naeemmrz commented Jan 22, 2022

MouadEt-tali commented Oct 14, 2023

Its taking more than 20h to sample the data #14

Its taking more than 20h to sample the data #14

Comments

mhnbitece commented Oct 18, 2020 • edited Loading

maxiuw commented Nov 25, 2020

snigdhasen commented Apr 29, 2021

luna57-lr commented Nov 16, 2021

naeemmrz commented Jan 22, 2022

MouadEt-tali commented Oct 14, 2023

Apply SMOGN to balance the dataset

mhnbitece commented Oct 18, 2020 •

edited

Loading