Parallelize Pretraining #8842

alejandrojcastaneira · 2019-03-03T14:26:16Z

alejandrojcastaneira
Mar 3, 2019

Hello, great feature!, currently I'm doing some experiments on my specific use cases.
But I notice that pretraining speed It's considerably slow, 1 epoch took almost two days in 1B corpus at an average of 4800 w/s.

So I check the uses of resources of the task, I'm training whit a configuration of dual 12 cores Xeon CPUs, (total 24 CPUs) a single machine, without GPU, and I noticed It's only using 1 core at a time.

Will be possible to add the desired number of workers on this task, then we could use the maximum number of cores and parallelize the pretraining, it could accelerate the processing time?

Best regards

honnibal · 2019-03-03T23:37:04Z

honnibal
Mar 3, 2019
Maintainer

I'd love to have this, but it's non-trivial. I'm not even sure what the best algorithm would be -- I guess asynchronous SGD?

0 replies

alejandrojcastaneira · 2019-03-04T09:14:28Z

alejandrojcastaneira
Mar 4, 2019
Author

I was reading, about asynchronous SGD the thing it's the problem whit delayed gradients, they have been proposed several alternatives, like: https://arxiv.org/abs/1609.08326 or https://arxiv.org/abs/1601.04033, http://aclweb.org/anthology/D18-1332 among others.. Each one whit specific trade-offs.

Definitly I also believe independent of the method that better suits to thinc's architecture, whould be great! to have parallelization, maybe also on other routines like training NER, Textcat etc. That could be very helpful for scaling up experiments on clusters, cloud providers, etc.

0 replies

honnibal · 2019-03-05T10:00:29Z

honnibal
Mar 5, 2019
Maintainer

Thanks for the links!

I think the literature that's probably most relevant are the papers on really massive multi-GPU training. For multi-CPU training we'll have a fair amount of latency from inter-process communication, so techniques designed to work despite network latency would be good.

This is a new library that has interesting results: https://ai.googleblog.com/2019/03/introducing-gpipe-open-source-library.html . Another thing to look at is what Chainer is doing. They've been publishing pretty good benchmarks for large-scale training.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelize Pretraining #8842

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Parallelize Pretraining #8842

alejandrojcastaneira Mar 3, 2019

Replies: 3 comments

honnibal Mar 3, 2019 Maintainer

alejandrojcastaneira Mar 4, 2019 Author

honnibal Mar 5, 2019 Maintainer

alejandrojcastaneira
Mar 3, 2019

honnibal
Mar 3, 2019
Maintainer

alejandrojcastaneira
Mar 4, 2019
Author

honnibal
Mar 5, 2019
Maintainer