Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Its taking more than 20h to sample the data #14

Open
mhnbitece opened this issue Oct 18, 2020 · 5 comments
Open

Its taking more than 20h to sample the data #14

mhnbitece opened this issue Oct 18, 2020 · 5 comments

Comments

@mhnbitece
Copy link

mhnbitece commented Oct 18, 2020

Hi Nick,

I am seeing huge runtime for my input data which is of 28K * 59.
Its running for more than a day.
I have even standardized the input data
Any possible solution ?

dist_matrix: 5%|4 | 276/5671 [50:48<16:50:38, 11.24s/it]

@maxiuw
Copy link

maxiuw commented Nov 25, 2020

same issue here

@snigdhasen
Copy link

Same issue here. Its more than a day for 5 lakh data *32

@luna57-lr
Copy link

same issue!

@naeemmrz
Copy link

Me too, it's extremely slow on relatively large datasets. A cuda implementation and/or n_jobs option would be great.

@MouadEt-tali
Copy link

I think I have a potential solution for this problem and this MIGHT work for you :

My problem was using the default settings without specifying anything

Here is my previous code that was extremely slow
dataframe_oversampled = smogn.smoter( data=dataframe, y='TARGET_VARIABLE', )

However the moment I started tinkering the parameters somehow it got 15 times faster, a code that used to take me 6 hours only took 30 minutes !

Here is how I changed my code, I hope similar tinkering will help you too.

PS : in my project I made a special function to handle all missing data because I had special cases, so the drop_na_col and drop_na_row in these parameters are just for good measure.
`

Apply SMOGN to balance the dataset

dataframe_oversampled = smogn.smoter(
    data=dataframe,
    y='TARGET_VARIABLE',
    k=9,                    ## positive integer (k < n)
    pert=0.04,              ## real number (0 < R < 1)
    samp_method='balance',  ## string ('balance' or 'extreme')
    drop_na_col=True,       ## boolean (True or False)
    drop_na_row=True,       ## boolean (True or False)
    replace=False,          ## boolean (True or False)

    ## phi relevance arguments
    rel_thres=0.10,         ## real number (0 < R < 1)
    rel_method='manual',    ## string ('auto' or 'manual')
    # rel_xtrm_type = 'both', ## unused (rel_method = 'manual')
    # rel_coef = 1.50,        ## unused (rel_method = 'manual')
    rel_ctrl_pts_rg=rg_mtrx ## 2d array (format: [x, y])
)

`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants