I am involved in a project aiming at improving the resolution of CP problems by using massive parallelism.
Thus, I learnt parallelism like a novice.
First, I thought that it was not complex to use at the same time all the cores of a modern multicore CPU. Then, I learnt that it is not so easy ...
My first test was about a simple problem requiring a lot of memory (about 500Mo). I was thinking that I could run the problem 4 times at the same moment on a 4 cores machine without any problem, that is it would not need more time than with one core (or a little bit more due to the frequency change). The first result was clear : NO, it does not work like that.
Thus, I made some experiments using different machines. I learnt that the results may be quite different! I reproduce the results here:
Using k threads means that I try to solve the very same problem k times in parallel (one per thread). The ms column corresponds to the number of ms that are elapsed for solving all the problems (that is the time for all the threads to finish)
|
Intel i7 920
|
|
|
|
Intel i7 3820
|
|
|
4 cœurs
|
2 canaux mémoire
|
|
4 cœurs
|
4 canaux mémoire
|
|
2,66 Ghz
|
2 threads / cœur
|
|
3,6 Ghz
|
2 threads / cœur
|
#threads
|
ms
|
perf/thread
|
ratio / 1 thread
|
ms
|
perf/thread
|
|
1
|
835
|
835
|
1,0
|
|
680
|
680
|
1,0
|
2
|
1000
|
500
|
1,7
|
|
750
|
375
|
1,8
|
4
|
1365
|
341
|
2,4
|
|
915
|
229
|
3,0
|
6
|
1900
|
317
|
2,6
|
|
1220
|
203
|
3,3
|
8
|
2460
|
308
|
2,7
|
|
1530
|
191
|
3,6
|
10
|
3240
|
324
|
2,6
|
|
1970
|
197
|
3,5
|
12
|
3850
|
321
|
2,6
|
|
2250
|
188
|
3,6
|
|
Intel i7 3930
|
|
|
AMD FX8150
|
|
|
6 cœurs
|
4 canaux mémoire
|
|
8 cœurs
|
2 canaux mémoire
|
|
3,2 Ghz
|
2 threads / cœur
|
|
3,6 Ghz
|
1 thread/cœur
|
#threads
|
ms
|
perf/thread
|
|
|
ms
|
perf/thread
|
|
1
|
715
|
715
|
1,0
|
|
757
|
757
|
1,0
|
2
|
790
|
395
|
1,8
|
|
1100
|
550
|
1,4
|
4
|
900
|
225
|
3,2
|
|
1800
|
450
|
1,7
|
6
|
1090
|
182
|
3,9
|
|
2600
|
433
|
1,7
|
8
|
1390
|
174
|
4,1
|
|
3325
|
416
|
1,8
|
10
|
1630
|
163
|
4,4
|
|
4120
|
412
|
1,8
|
12
|
1890
|
158
|
4,5
|
|
5000
|
417
|
1,8
|
|
Intel 4x E7-4870
|
|
|
|
10 cœurs/CPU
|
4 canaux mémoire
|
|
2,4 Ghz
|
2 threads / cœur
|
|
#threads
|
ms
|
perf/thread
|
|
|
1
|
1744
|
1744
|
1,0
|
|
2
|
1733
|
867
|
2,0
|
|
4
|
2089
|
522
|
3,3
|
|
40
|
2712
|
68
|
25,7
|
|
80
|
4484
|
56
|
31,1
|
|
120
|
8350
|
70
|
25,1
|
|
160
|
10433
|
65
|
26,7
|
|
320
|
20566
|
64
|
27,1
|
|
What do we learn? When using memory, one of the most important thing is the number of memory channels. We also learn that the latest Intel CPUs (3870 and 3930) are really good.
Note that in these benchmarks there is no concurrent access in memory: the memory is splitted into several parts at the beginning.
Can we observe these results for all problems? The answer is NO. When, we do not require so much memory the results are quite different, as shown by the following tables; for these benchmarks, the same protocol is used, but the problem which is solved by each thread requires at most 1Mo of memory.
|
Intel i7 920
|
|
|
|
Intel i7 3820
|
|
|
4 cœurs
|
2 canaux mémoire
|
|
4 cœurs
|
4 canaux mémoire
|
|
2,66 Ghz
|
2 threads / cœur
|
|
3,6 Ghz
|
2 threads / cœur
|
#threads
|
ms
|
perf/thread
|
ratio / 1 thread
|
ms
|
perf/thread
|
|
1
|
1790
|
1790
|
1,0
|
|
1100
|
1100
|
1,0
|
2
|
1790
|
895
|
2,0
|
|
1140
|
570
|
1,9
|
4
|
2000
|
500
|
3,6
|
|
1300
|
325
|
3,4
|
6
|
2800
|
467
|
3,8
|
|
1830
|
305
|
3,6
|
8
|
3610
|
451
|
4,0
|
|
2260
|
283
|
3,9
|
10
|
4500
|
450
|
4,0
|
|
2810
|
281
|
3,9
|
12
|
5400
|
450
|
4,0
|
|
3300
|
275
|
4,0
|
|
Intel i7 3930
|
|
|
AMD FX8150
|
|
|
6 cœurs
|
4 canaux mémoire
|
|
8 cœurs
|
2 canaux mémoire
|
|
3,2 Ghz
|
2 threads / cœur
|
|
3,6 Ghz
|
1 thread/cœur
|
#threads
|
ms
|
perf/thread
|
|
|
ms
|
perf/thread
|
|
1
|
1210
|
1210
|
1,0
|
|
2000
|
2000
|
1,0
|
2
|
1210
|
605
|
2,0
|
|
2051
|
1026
|
2,0
|
4
|
1285
|
321
|
3,8
|
|
2160
|
540
|
3,7
|
6
|
1650
|
275
|
4,4
|
|
2625
|
438
|
4,6
|
8
|
1880
|
235
|
5,1
|
|
2830
|
354
|
5,7
|
10
|
2240
|
224
|
5,4
|
|
3850
|
385
|
5,2
|
12
|
2620
|
218
|
5,5
|
|
4490
|
374
|
5,3
|
|
Intel 4x E7-4870
|
|
|
|
10 cœurs/CPU
|
4 canaux mémoire
|
|
2,4 Ghz
|
2 threads / cœur
|
|
#threads
|
ms
|
perf/thread
|
|
|
1
|
1910
|
1910
|
1,0
|
|
2
|
1920
|
960
|
2,0
|
|
4
|
1920
|
480
|
4,0
|
|
40
|
3030
|
76
|
25,2
|
|
80
|
4080
|
51
|
37,5
|
|
120
|
6230
|
52
|
36,8
|
|
160
|
8230
|
51
|
37,1
|
|
320
|
16260
|
51
|
37,6
|
|
In this case, it is clear that the results of the AMD CPU are better than in the previous case. However, it is still clearly outperformed by recent Intel CPUs.
For those who can be interested, here are the results for an Intel i7 2600
i7 2600 |
|
|
ms |
perf/thread |
|
685 |
685 |
1,0 |
841 |
421 |
1,6 |
1423 |
356 |
1,9 |
1943 |
324 |
2,1 |
2693 |
337 |
2,0 |
3532 |
353 |
1,9 |
4338 |
362 |
1,9 |
I also compare Windows (VS 2012) and Linux (GCC latest version) on the very same machine (core I7 3930): the maximum difference is 10% (Windows is faster when we do not require too much memory)
These results give me some inputs on my quest of 100x improvement machine.
These results also show that we have to be careful when giving results for parallelisms. The impact of the machine that is used is more important than for sequential result