Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace spam by imdb text data #76

Open
rhiever opened this issue Oct 30, 2016 · 3 comments
Open

Replace spam by imdb text data #76

rhiever opened this issue Oct 30, 2016 · 3 comments

Comments

@rhiever
Copy link
Contributor

rhiever commented Oct 30, 2016

Per the TODO file. Maybe @amueller can elaborate on this issue.

@amueller
Copy link
Owner

Well the imdb text data is bigger and what is used in the book. It does take a while to process, though. We could use a subsample, maybe?

@rhiever
Copy link
Contributor Author

rhiever commented Oct 31, 2016

What's "a while"? Minutes, hours, days? :-)

Either way, yes---using a subsample is probably the way to go.

@rasbt
Copy link
Collaborator

rasbt commented Nov 1, 2016

Either way, yes---using a subsample is probably the way to go.

I agree. I think one idea was to kind of motivate why we sometimes need to opt for a hashing vectorizer and/or out-of-core learning algorithm when it doesn't fit into memory. However, having a smaller subsample would be fine (after shuffling).

Coincidentally, I've used the dataset in my book as well :P And yeah, people were complaining that it takes too long (~5-10 minutes) and when they choose a subsample, the performance was really bad -- or in other words, people want the best of both worlds some times ... However, for the tutorial, I agree that having a subsample would be really necessary to keep on schedule ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants