Gensim is an NLP library which claims to be highly effective during training and produce linear performance growth with increasing the number of threads.
Currently, that is not true on machines with a large number of cores (>10) and large data files. The reason of this is that almost all Gensim models which support multithreaded training work in the following way. There is single job producer -- worker which reads the data and pushes the chunks into the job queue. Also, there are many job consumers -- workers which pull the chunks and update the model parameters in parallel.
The problem is that consumers' code is optimized well, so this leads to workers starvation problem. Job producer just can't fill the queue at such a high pace. This is the case even using fastest
read the line, split it and yield corpus iterator.
This problem could be solved by allowing users to pass
K data streams (currently only single-stream == single job producer thread is supported), e.g. which point to
K large files and use
K job producers to fill the job queue.