Gensim: Benchmark ML frameworks on different hardware platforms

Created on 15 Jun 2017 · 8Comments · Source: RaRe-Technologies/gensim

We are very interested in a robust large-scale benchmark of the ML landscape, especially with regard to hardware, costs and implementation quality.

__Short description:__ compare a neural network algorithm (perhaps w2v / d2v) implementation across popular frameworks, on different cloud platforms, different hardware setups (CPU/GPU), measuring various metrics such as training quality, speed, memory footprint, ease of use and relative $$$ costs.

Questions we want to answer:

Which hardware provider is the best for deep learning? Cost/performance ratio?
What are the fastest frameworks and implementations? Most memory-efficient? Most accurate?
Which HW platforms and ML frameworks are the most user-friendly? Easy to set up, launch and debug?

Plan:

Choose a model from gensim that we will compare (probably w2v, as this is the most popular model)
Take the same model from other popular frameworks: Tensorflow, DeepLearning4J, original C implementation, Spark (single node, cluster)
Take a big enough corpus (e.g. Wikipedia or another publicly available corpus)
Take a popular hardware provider: IBM Softlayer, AWS, SkyScale, Hetzner
Choose an execution model: CPU (incl. multicore), GPU
Fit a model, measure and report several metrics:
- Time to train
- Peak memory usage
- Model quality, e.g. on the standard "word analogies task"
- Total cost of training and the complexity of setup/usage for the given hardware provider
- Complexity of setup/usage for the given ML framework -- how difficult is it to install, run and debug

The benchmark must be fully reproducible -- all scripts, data and settings must be recorded and versioned. It is also necessary to explicitly describe and set all relevant parameters, random seeds, etc. It is very important to write fully self-contained scripts for repeatable deployment. For example, you can use Docker/Ansible. Run the experiments multiple times, measuring the spread/variance of each metric.

Results:

Answers to the questions above, in the form of hard measurements and clear tables with summaries.
A dedicated Github repo with all the scripts, configs and data links, so anyone can repeat the benchmarks themselves.
A blog post on the RaRe site describing the setup, methodology, results and final recommendations.

difficulty medium testing wishlist

Source

menshikh-iv

Most helpful comment

Other potentially useful evaluations of word embeddings (along with code) can be found here - https://github.com/mfaruqui/eval-word-vectors

jayantj on 15 Jun 2017

👍2

All 8 comments

@menshikh-iv Sounds really useful. I am interested to work on this issue.

manneshiva on 15 Jun 2017

Other potentially useful evaluations of word embeddings (along with code) can be found here - https://github.com/mfaruqui/eval-word-vectors

jayantj on 15 Jun 2017

👍2

@menshikh-iv I have finished training the Gensim's Word2vec on a Google Cloud n1-highcpu instance ( 4 core Xeon E5 with 3.6 GB RAM) and it takes around 7.5 hours to train a model on Wikipedia corpus. I will look into Tensorflow and Word2Vec C code.

souravsingh on 17 Jun 2017

@souravsingh @manneshiva maybe you will be work together?

menshikh-iv on 19 Jun 2017

It is very important to write fully self-contained scripts for repeatable deployment.

Before we even start running the benchmarks, we should focus on the setup to make everything(tests, scripts etc.) reproducible. Using docker seems to be the easiest way to achieve this. I have build a docker which will allow us to run word2vec implementations of all popular frameworks. Also tested(ran) it with original c, tensorflow-cpu, gensim and dl4j codes on a small test corpus(text8). Will be pushing the code to a repo as soon as I refactor and write a few scripts.

manneshiva on 19 Jun 2017

@manneshiva you are right, keep it up!

menshikh-iv on 22 Jun 2017

@menshikh-iv I have created a repo to address this issue. Here is the link:
https://github.com/manneshiva/benchmark-word2vec-frameworks
Still requires a lot more things to be finished. Working on it. Will complete it soon.

manneshiva on 30 Jun 2017

@manneshiva A similar post, for inspiration: http://minimaxir.com/2017/07/cpu-or-gpu/