Replace spam by imdb text data #76

rhiever · 2016-10-30T17:31:43Z

Per the TODO file. Maybe @amueller can elaborate on this issue.

amueller · 2016-10-31T17:37:00Z

Well the imdb text data is bigger and what is used in the book. It does take a while to process, though. We could use a subsample, maybe?

rhiever · 2016-10-31T20:48:14Z

What's "a while"? Minutes, hours, days? :-)

Either way, yes---using a subsample is probably the way to go.

rasbt · 2016-11-01T01:06:20Z

Either way, yes---using a subsample is probably the way to go.

I agree. I think one idea was to kind of motivate why we sometimes need to opt for a hashing vectorizer and/or out-of-core learning algorithm when it doesn't fit into memory. However, having a smaller subsample would be fine (after shuffling).

Coincidentally, I've used the dataset in my book as well :P And yeah, people were complaining that it takes too long (~5-10 minutes) and when they choose a subsample, the performance was really bad -- or in other words, people want the best of both worlds some times ... However, for the tutorial, I agree that having a subsample would be really necessary to keep on schedule ;)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace spam by imdb text data #76

Replace spam by imdb text data #76

rhiever commented Oct 30, 2016

amueller commented Oct 31, 2016

rhiever commented Oct 31, 2016

rasbt commented Nov 1, 2016

Replace spam by imdb text data #76

Replace spam by imdb text data #76

Comments

rhiever commented Oct 30, 2016

amueller commented Oct 31, 2016

rhiever commented Oct 31, 2016

rasbt commented Nov 1, 2016