Threadripper 2 Tests Matthew Dillon DragonFlyBSD 2-Sep-2018 14-Oct-2018 1-socket 32-core/64-thread Threadripper 2990WX ( 4 channels of DDR4 memory, cpu core advertised 2.9GHz, 3.4 actual) 1-socket 8-core/16-thread Ryzen 2700X ( 2 channels of DDR4 memory, cpu core advertised 3.7GHz, 3.9 actual) 2-socket 16-core/32-thread Xeon (2 x E5-2620) (12 channels of DDR4 memory, cpu core advertised 2.1GHz) 4-socket 48-core/48-thread Opteron 6168 (16 channels of DDR3 memory, cpu core advertised 1.9GHz) (Xeon is using 2133 memory, not sure what the Opteron is using, the 2990WX is running at 2666C15). TEST 1 - Concurrent compile test. Basically compiles sys/net/altq/altq_red.c, a small 16K source file that is part of our kernel sources. Each cpu thread runs a script locked to that thread that executes the compile in a loop, ~200 times. Aggregate # of compiles is the same for each machine, so 200 times for the threadripper, 400 for the Xeon, ~266 for the opteron (and we scale the results) TEST 2 - time make -j 128 nativekernel NO_MODULES=TRUE > /dev/null. This tests a kernel compile without modules. There is a single-threaded make depend stage, then all kernel files are compiles concurrently with no gaps, then a single-threaded link stage. TEST 1 - Scripted concurrent compile TR2@2666 4-channels, Stock cpu speeds 3893.002u 1142.101s 1:30.32 5574.7% 7860+721k 153600+0io 0pf+0w TR2@3000 4-channels, Stock cpu speeds 3601.388u 1081.815s 1:26.18 5434.1% 7820+718k 153600+0io 0pf+0w 2700X@3000 2-channels, XFR2 (190W) 2213.833u 464.188s 2:49.49 1580.0% 7924+727k 153600+0io 0pf+0w 2xXeon 3301.402u 763.107s 2:16.08 2986.8% 7773+714k 153600+0io 0pf+0w Monster (note: 48 cores, ran 48 scripts, so results must be scaled to 64 fo equivalance): 3024.338u 798.413s 1:43.58 3690.6% 7740+711k 115200+0io 0pf+0w 103.58 seconds -> 138.10 seconds ------------------ 2:18.09 (roughly the same as the Xeon) TEST 2 - make -j 128 nativekernel NO_MODULES=TRUE > /dev/null TR2@3000 4-channels, Stock cpu speeds 702.136u 129.311s 0:48.12 1727.8% 8034+737k 18250+8io 136pf+0w TR2@2166 4-channels, Stock cpu speeds 732.324u 128.462s 0:50.68 1698.4% 8026+736k 18890+8io 180pf+0w TR2@2133 2-channels, Stock cpu speeds 1396.239u 361.471s 1:04.88 2709.1% 7981+732k 18032+8io 124pf+0w RYZEN2700X@3000 2-channels, XFR 190W 425.862u 68.100s 0:58.25 848.0% 7940+729k 21996+936io 124pf+0w 2xXeon 638.787u 113.125s 1:11.37 1053.5% 7863+723k 61048+0io 136pf+0w Opteron 820.761u 166.210s 2:07.32 775.1% 7862+724k 60994+0io 788pf+0w A note on DragonFlyBSD DragonFlyBSD is a well known BSD OS project that began well over 15 years ago. We've kept up-to-date ever since. The primary goal for DFly is maximal SMP efficiency. The tests run below have almost zero SMP contention, and that's a big deal for compiles which do an unbelievable number of system calls over their lives. All the tests below are well cached and use tmpfs for output files, so there is zero actual disk I/O. DragonFly v5.3.0.18645.g2c5cc-DEV DragonFly v5.3.0.18645.g2c5cc-DEV + NUMA fixes On October 14th I ran additional tests explicitly to try to test memory latency, using a separate jail with all files copied (no null mounts) for each concurrent test. In addition, I found a bug in DragonFly's kernel allocation code that was causing a slight amount of contention on tests with greater than 16 cpu threads. With that bug fixed, the 32-thread and 64-thread tests did not diverge as much as in the original tests. The NUMA fixes only effect tests with greater than 16 threads (i.e. tests that have to make use of the secondary CPU dies that do not have a directly-attached memory controller). Generally speaking, the fix had a noticable, but not huge effect. The NO_MODULES=TRUE kernel compile test (TEST 2) improved by 3.8 seconds out of ~50 or so. Overclocking the CPU I ran the concurrent compile test with the CPU power envelope increased from 250W to 500W. At stock speeds the cpu ran at around 3.4GHz on all cores at full load. Overclocked it ran at 3.9GHz-4.0GHz on all cores at full load. However the concurrent compile test was only 2 seconds faster. This was with memory running at 2666. I honestly don't want to run the test again with memory at 3000 or with the later fixes because I only have a 650W PSU in the box, and 500W is cutting it too close. It's a scary test to run that much power into a cpu socket! What I think is going on here is that the concurrent compile test is limited by memory bandwidth, and overclocking the CPU doesn't help. Otherwise, though, for workloads that have more computation and less memory bandwidth, I would expect significant scaling going from 3.4GHz to 4.0GHz (essentially another 20%). But I wouldn't do this on a production system because power efficiency goes to hell. This is somewhat validated by the 2700X test, where a maximal amount of CPU is available but with only 2 memory channels. The result of the concurrent compile (TEST1) took almost precisely twice the time. So it is clearly possible for the 2990WX to limit-out on memory bandwidth. Synth Test One of the fun tests we do on these beefy machines is a full bulk build of ports, now around 30500 applications (including flavors). This is not a well-controlled test because the ports set is getting bigger all the time, but it does give us a very good performance metric for a heavily loaded system doing a massive amount of concurrent compilation and filesystem access. It allows us to test performance enhancing algorithms in the kernel, and other things. Threadripper 2990WX 64G of ram, 96G of SSD swap Dual socket Xeon 128G of ram, 200G of SSD swap Quad socket Opteron 128G of ram, 593G of SSD swap The Synth workload is a memory heavy and system-call-heavy, but all extracted working sets for the build are in tmpfs so 'permanent' storage is barely touched and basically not relevant to the test. http://apollo.backplane.com/DFlyMisc/synth_times.txt A full bulk build with a recent DPorts (with flabors) takes around 21 hours on the xeon and 18 on the opteron systems. The 2990WX runs the build in 12 hours which is 75% faster than the 2 socket Xeon and 50% faster than the quad socket Opteron. Not only that, but since I only had 64G of ram in the TR2 (the Xeon and Opteron systems have 128G), I had to balance the workload against paging load with the result being significantly more cpu idle time on the TR2 during the test than either of the other systems. This means the TR2 will be even faster on this test once I get 128G of ram into it. Statistics graph for for 2990WX synth run: http://apollo.backplane.com/DFlyMisc/synth_tr2_64g.jpg Memory Timings and Power Limits WARNING! YOU CAN TURN YOUR VERY EXPENSIVE CPU INTO SCRAP METAL IF YOU START MESSING WITH CPU OVERCLOCKING. OVERCLOCKING MEMORY VIA XMP IS RELATIVELY SAFE, BUT OVERCLOCKING A THREADRIPPER's CORE CPU FREQUENCY CAN BE DANGEROUS AS HELL! Running the concurrent compile test, again on all 64 threads, I wanted to determine whether boosting the CPU mattered. The answer is, at least for the compile test (which is fairly memory intensive), that the boosting the cpu just wastes energy. I then tried going the other way, capping power consumption, and got some very interesting results in the efficiency metrics. (1) Note that we are getting an indirect measure of the infinity fabric power consumption (though the memory sticks own consumption is also included). At reduced power caps, there is a trade-off between memory speed and CPU frequency that the BIOS has to make. It is very obvious in the test results. (2) This particular compile test doesn't actually lose ANY performance even at bounds that are 80W lower, and still gets within 10% when over 110W lower when adjusting the fabric speed a little lower to allow the CPUs to run a little faster. (3) Even when I drop power consumption all the way down to 153W at the wall, there's a sweet spot using 2666MHz memory that allows the CPU to run at a modest frequency for the same consumption at the wall. I have to say, it's pretty insane that I can get numbers with power capped at 153W that are within 20% of the numbers at 353W for a memory-intensive workload. I think the sweet spot for server-like operations where power efficiency is important is probably around 220W at the wall (the 150W PPT setting). (4) By my reckoning, AMD's 'stock' parameters are actually a bit goosed, but not too much. This would ensure that people running with only XMP (memory speed boost) adjustments don't blow something up, and also to not cut computation-heavy (memory-light) workload off at the knees. I would conclude that most people buying a threadripper system should just mess with the memory (XMP) profile and not try to goose the CPU any more... and, in fact, depending on the situation, might want to reduce the power envelope via the PPT cap to improve efficiency. (5) NUMA vs UMA. It appears that UMA cannot be set for the 2990WX, but can be set for the 2950X. UMA mode is normally set via AMD CBS, Zen options for memory. Set the interleaving mode to DIE and the interleave size to whatever (I used 2KB). This will not do anything on a 2990WX so don't bother. The 2990WX is NUMA-only. The memory fabric really gets costly in terms of power at 2933 or higher. This argues for NOT running your memory full-out, because you can bump up the cpu frequency a bit more if you don't, but also not running your memory at a lowly 2133. Times in minutes:seconds, lower is better. CPU frequency is free-floating, ranges are listed at full load. BIOS is set to cap by wattage only. Set ~350W (PPT,TDC,EDC) BIOS limit (cpu socket) memory 3000MHz 1:26 353W at wall 83W idle, 3.6 GHz (CPU frequency seems wrong here, but I don't want to blow my machine up so I'm not going to re-test. I might have meant to say 3.8 or 3.9). Stock CPU power limits Concurrent compile test memory 3000MHz 1:27 300W at wall 84W idle, 3.3-3.4 GHz memory 2800MHz 1:32 293W at wall 83W idle, 3.3-3.4 GHz memory 2666MHz 1:34 252W at wall 66W idle, 3.3-3.4 GHz memory 2400MHz 1:42 241W at wall 63W idle, 3.3-3.4 GHz memory 2133MHz 1:57 229W at wall 62W idle, 3.3-3.4 GHz (note: highest stock wattage is around 330W at the wall with 3000Mhz memory. This test only gets to 300W) memory 2666MHz (NOTE1) 1:07 316W at wall 67W idle, 3.3-3.4 GHz NOTE1 - After scheduler work on kernel, other results are before scheduler work. Compare this 1:07 against the 1:13 we get @ 225W at the wall (150W PPT). Set 210W PPT BIOS limit (cpu socket) (SLIGHTLY LOWER THAN STOCK) memory 3000MHz 1:27 295W at wall 83W idle, 3.3-3.4 GHz memory 2800MHz 1:32 293W at wall 83W idle, 3.3-3.4 GHz memory 2666MHz 1:35 295W at wall 68W idle, 3.5-3.6 GHz memory 2400MHz 1:44 279W at wall 65W idle, 3.6 GHz memory 2133MHz 1:57 275W at wall 65W idle, 3.6 GHz (note: power capped) Set 150W PPT BIOS limit (cpu socket) (LOWER THAN STOCK) memory 3000MHz 1:29 220W at wall 83W idle, 2.6-2.7 GHz memory 2800MHz 1:34 220W at wall 82W idle, 2.6-2.7 GHz memory 2666MHz 1:36 220W at wall 65W idle, 3.0-3.1 GHz memory 2400MHz 1:47 219W at wall 62W idle, 3.1-3.2 GHz memory 2133MHz 1:59 218W at wall 62W idle, 3.2-3.3 GHz (note: power capped) memory 2666MHz (NOTE1) 1:13 225W at wall 62W idle, 2.7-2.8 GHz NOTE1 - After scheduler work on kernel, other results are before scheduler work. Set 125W PPT BIOS limit (cpu socket) (MUCH LOWER THAN STOCK) memory 3000MHz 1:53 188W at wall 86W idle, 1.4-1.5 GHz memory 2800MHz 1:49 187W at wall 82W idle, 1.7-1.8 GHz memory 2666MHz 1:38 190W at wall 65W idle, 2.6-2.7 GHz memory 2400MHz 1:47 189W at wall 65W idle, 2.8-2.9 GHz memory 2133MHz 1:58 187W at wall 62W idle, 2.9-3.0 GHz (note: power capped) Set 100W PPT BIOS limit (cpu socket) (RIDICULOUSLY LOWER THAN STOCK) memory 3000Mhz 3:42 153W at wall 86W idle, 655 MHz memory 2800MHz 3:33 153W at wall 86W idle, 724 Mhz memory 2666MHz 1:47 160W at wall 65W idle, 1.7-1.8 GHz memory 2400MHz 1:50 160W at wall 65W idle, 2.2-2.3 GHz memory 2133MHz 2:00 158W at wall 62W idle, 2.4-2.5 GHz (note: power capped) Conclusions The Threadripper 2990WX is a beast. It is at *least* 50% faster than both our quad socket opteron and the dual socket Xeon system I tested against. The primary limitation for the 2990WX is likely its 4 channels of DDR4 memory, and like all Zen and Zen+ CPUs, memory performance matters more than CPU frequency (and costs almost no power to pump up the performance). That said, it still blow away a dual-socket Xeon with 3x the number of memory channels. That is impressive! My particular TR2 system is air-cooled (the Noctua NH-U14S). Airflow is inline front-to-back (including the CPU cooler's fan). At full load (3000Mhz DDR4, Stock CPU speeds), the system pulls 250W-350W at the wall. (In contrast the Xeon system pulls 200W and the old Opteron system pulls 1000W). But for memory-heavy workloads, I can bring the Threadripper's power envelope all the way down to 220W without losing much performance and at that power point the power efficiency is just insane. At full load after ~15minutes I read around 55C on the VRM heatsink, 32C for the memory on the ingress side of the fans, and 55C for the memory on the egress side. ACPI shows the cpu at 52C. For me, not being a dedicated overclocker, those temps are ok but I don't think I would want to O.C. the CPU much (if at all). Just dropping the memory speed a bit makes a big difference in power consumption. This puts the 2990WX at par efficiency vs a dual-socket Xeon system, and better than the dual-socket Xeon with slower memory and a power cap. This is VERY impressive. I should note that the 2990WX is more specialized with its asymetric NUMA architecture and 32 cores. I think the sweet spot in terms of CPU pricing and efficiency is likely going to be with the 2950X (16-cores/32-threads). It is clear that the 2990WX (32-cores/64-threads) will max out 4-channel memory bandwidth for many workloads, making it a more specialized part. But still awesome. Increasing the power envelope might help with cpu-centric workloads, but it won't make one iota of difference for memory-centric workloads such as compiles. In fact, detuning the power consumption can give you the same performance for such workloads at far greater power efficiency. For the same reason, buying 3000MHz memory will improve performance for memory-centric workloads but there is a trade-off against power consumption at those levels and I expect for most people it will be better to go with 2666MHz memory. This thing is an incredible beast, I'm glad I got it. ECC + DECREASING THE POWER ENVELOPE Since the TR2 is primarily limited by its 4 channels of DDR4 memory, if you have a system with ECC memory running at a fairly low frequency (2133, 2400, 2666 Mhz EUDIMMs), you might actually want to use XFR2's PPT settings to *REDUCE* the power envelope of the CPU instead of increase it. For example, you can run the CPU at 200W or 230W instead of at 300W and, depending on your workload, potentially get the same performance as you would at 330W. Many workloads are going to be limited by memory bandwidth. For example, bulk compiles are almost certainly going to be limited by memory bandwidth and running the cpu cores at a higher frequency just won't give you anything. This is particularly true when running 128GB of memory. 128GB memory configurations cannot be overclocked to the same degree that configurations with fewer/less-dense DDR4 sticks can be OCd. And doubly so if you are running ECC. In this situation, test your workload at various power envelopes all the way down to 180W to determine the most efficient operating point for the 2990WX. When the power envelope is used in this manner, the 2990WX can be made significantly more power/performance efficient than most dual-socket Xeon systems. CONCURRENT COMPILE TESTS 260W RESTRICTION AT THE WALL (PPT=175, 2666 memory @C15) (WITHOUT NUMA BUG FIX) This is compiling the same .c file to different .o's in /tmp, and exec'ing the same compiler binaries concurrently, so there is a considerable amount of resource sharing going on. Timing test for 200 serial compiles x N concurrency. N 1-64. Each compile is essentially a 'cc', meaning cc->cc1|as. For the locked case note that once we get to N=32, we are starting to lock to hyperthreads and the overall result will be governed by the worst-case hyperthread pair in the LOCKED case. Thus we see an immediate jump in time because all other cores will go idle while the few with hyperthread pairs continue to churn (albeit at a higher CPU frequency). When we don't lock the load to particular cpus, the scheduler will do a better job smoothing out this kind of assymetry but at the cost of some churn. In the non-locked case we see a small step-function loss of performance due to scheduler churn. Remember that cc1 and as for a single job in the loop are exec'd concurrently so, in fact, they will be scheduled to different cpus if possible. That said, cc1 tends to 'burst' the assembly all out to as all at once, at the end, so there is not much actual concurrency on a per-job basis. When we go over N=32, however, the scheduler begins to shift the odd-man-out threads between cores, smoothing the result, so there is no step function in time going from 32 to 33 and the non-locked test is suddenly doing better than the locked test. It doesn't last, scheduler churn catches up again, but its an interesting effect. * We observe 6.89x performance going from N=1 to N=8, the perfect case would be 8x. * We observe 12.99x performance going from N=1 to N=16, the perfect case would be 16x. * We observe 19.29x performance going from N=1 to N=32, the perfect case would be 32x. * We observe 22.40x performance going from N=1 to N=64, This is the boost we get from hyperthreading... around another 3 (full-bandwidth) cores worth of performance going from 32 to 64. * We observe 1.48x performance going from N=16 to N=32, the perfect case would be 2x. * We observe 1.16x performance going from N=32 to N=64, This is the hyperthreaded case, so in this test hyperthreading improved IPC by an additional 16%. * Finally, if we increase the power envelope to 330W at the wall (250W -> 330W) we observe an only 8% improvement on the N=64 test. These losses are almost certainly related to hitting memory bandwidth caps on the TR2's 4 memory channels, and on cache/sharing overhead in the kernel. However, these tests have other variables, in particular OS resource sharing, NUMA memory allocation, and CPU auto turboing for the power envelope. See the set after this for the fixed-frequency / non-shared compile test. I was able to remove the OS shared resources and most of the memory allocation SMP conflicts (due to a bug) in the MEMORY LATENCY Tests which I did a few weeks later. [ EACH LOOP LOCKED TO A CPU ] [ NOT LOCKED TO A CPU ] 01 24.358u 4.458s 0:29.19 98.6% 25.379u 6.435s 0:29.80 106.7% 02 49.570u 9.085s 0:29.72 197.3% 51.068u 13.647s 0:30.73 210.5% 03 77.036u 15.034s 0:31.17 295.3% 81.992u 18.893s 0:32.91 306.5% 04 104.166u 21.221s 0:31.83 393.9% 108.576u 27.282s 0:34.04 399.0% 05 130.015u 28.054s 0:32.41 487.6% 139.098u 36.134s 0:35.04 500.0% 06 156.970u 35.131s 0:32.73 586.9% 169.537u 42.250s 0:35.53 596.0% 07 185.297u 41.739s 0:32.98 688.3% 199.329u 50.386s 0:36.37 686.5% 08 215.101u 51.916s 0:33.86 788.5% 230.967u 63.317s 0:37.23 790.4% 09 243.638u 57.243s 0:34.22 879.2% 265.157u 67.484s 0:37.82 879.5% 10 270.026u 63.240s 0:34.17 975.2% 296.513u 83.845s 0:39.22 969.7% 11 301.748u 68.491s 0:34.19 1082.8% 326.524u 96.977s 0:39.09 1083.3% 12 329.669u 79.172s 0:34.58 1182.2% 358.294u 104.100s 0:39.10 1182.5% 13 358.051u 86.439s 0:34.70 1280.9% 396.529u 124.403s 0:40.49 1286.5% 14 387.708u 92.283s 0:34.86 1376.8% 432.112u 133.455s 0:41.13 1375.0% 15 415.959u 101.557s 0:35.24 1468.5% 466.407u 146.063s 0:41.59 1472.6% 16 449.786u 110.657s 0:35.96 1558.4% 506.392u 160.839s 0:42.42 1572.8% 17 479.247u 125.413s 0:35.99 1680.0% 541.989u 175.557s 0:43.04 1667.1% 18 510.136u 131.999s 0:36.59 1754.9% 575.657u 181.133s 0:42.93 1762.8% 19 542.507u 142.056s 0:38.90 1759.7% 619.813u 195.384s 0:43.74 1863.7% 20 574.530u 154.355s 0:40.22 1812.2% 659.492u 201.122s 0:44.37 1939.6% 21 613.670u 162.686s 0:40.76 1904.6% 697.902u 223.302s 0:45.03 2045.7% 22 645.295u 176.658s 0:41.18 1995.9% 741.871u 232.805s 0:45.68 2133.6% 23 677.973u 190.436s 0:41.64 2085.4% 779.278u 250.854s 0:46.33 2223.4% 24 718.547u 206.327s 0:42.49 2176.6% 826.102u 265.695s 0:46.43 2351.4% 25 756.149u 218.904s 0:43.05 2264.9% 867.463u 276.337s 0:47.40 2413.0% 26 790.960u 241.969s 0:43.86 2355.0% 915.360u 294.381s 0:48.30 2504.6% 27 831.009u 254.781s 0:44.39 2446.0% 956.663u 307.738s 0:48.74 2594.1% 28 874.643u 271.044s 0:44.88 2552.7% 1004.950u 327.544s 0:49.72 2679.9% 29 919.603u 290.522s 0:45.80 2642.1% 1053.129u 342.459s 0:49.78 2803.4% 30 956.533u 317.450s 0:46.57 2735.6% 1099.098u 359.170s 0:51.03 2857.6% 31 1003.951u 332.093s 0:47.36 2821.0% 1152.805u 370.670s 0:51.51 2957.6% 32 1051.647u 355.312s 0:48.42 2905.7% 1202.240u 391.165s 0:52.32 3045.4% 33 1115.878u 362.188s 0:56.15 2632.3% 1253.501u 408.950s 0:52.98 3137.8% 34 1169.432u 381.370s 0:56.91 2725.0% 1312.901u 428.970s 0:54.05 3222.7% 35 1234.850u 399.358s 0:57.83 2825.8% 1367.018u 445.754s 0:55.06 3292.3% 36 1309.884u 407.763s 0:57.22 3001.8% 1426.144u 468.195s 0:55.94 3386.3% 37 1368.728u 427.967s 0:58.65 3063.3% 1484.853u 488.646s 0:56.40 3499.0% 38 1436.710u 444.548s 0:59.39 3167.6% 1542.732u 510.618s 0:57.24 3587.2% 39 1518.529u 457.861s 1:00.28 3278.6% 1601.704u 525.208s 0:58.81 3616.5% 40 1583.535u 476.039s 1:00.45 3407.0% 1661.849u 551.896s 0:58.94 3755.9% 41 1672.601u 484.721s 1:01.56 3504.4% 1724.686u 576.029s 1:00.01 3833.8% 42 1717.921u 502.885s 1:00.85 3649.6% 1794.104u 592.641s 1:00.88 3920.4% 43 1777.458u 521.377s 1:00.83 3779.0% 1854.301u 612.750s 1:01.66 4001.0% 44 1838.486u 533.679s 1:01.66 3847.1% 1925.946u 634.533s 1:02.62 4088.9% 45 1927.931u 551.890s 1:02.24 3984.2% 1980.492u 654.785s 1:04.15 4107.9% 46 1994.668u 564.091s 1:01.99 4127.6% 2056.067u 681.279s 1:04.52 4242.6% 47 2064.817u 587.681s 1:02.67 4232.4% 2118.048u 706.150s 1:05.95 4282.3% 48 2149.009u 599.155s 1:02.80 4376.0% 2194.818u 731.134s 1:06.54 4397.2% 49 2207.048u 623.900s 1:07.92 4168.0% 2259.866u 754.583s 1:07.68 4453.9% 50 2284.487u 649.000s 1:09.02 4250.1% 2330.967u 775.348s 1:08.72 4520.2% 51 2376.840u 676.010s 1:10.80 4311.9% 2401.277u 804.394s 1:09.83 4590.6% 52 2446.429u 703.630s 1:11.99 4375.6% 2477.763u 831.367s 1:10.94 4664.6% 53 2548.868u 732.945s 1:12.79 4508.5% 2551.761u 856.038s 1:11.92 4738.3% 54 2610.208u 769.572s 1:13.55 4595.2% 2615.837u 882.833s 1:13.22 4778.2% 55 2674.063u 793.926s 1:13.67 4707.4% 2690.139u 905.051s 1:14.15 4848.5% 56 2755.808u 825.292s 1:14.97 4776.6% 2767.776u 936.510s 1:14.95 4942.3% 57 2864.763u 862.357s 1:15.68 4924.8% 2846.835u 964.574s 1:16.34 4992.6% 58 2904.320u 908.749s 1:15.99 5017.8% 2919.594u 987.038s 1:17.44 5044.7% 59 2997.571u 936.030s 1:17.42 5080.8% 2988.816u 1013.862s 1:18.63 5090.5% 60 3130.319u 972.251s 1:17.67 5282.0% 3064.238u 1033.900s 1:19.68 5143.2% 61 3225.914u 1020.308s 1:19.45 5344.5% 3136.046u 1063.830s 1:21.10 5178.6% 62 3351.754u 1050.025s 1:20.51 5467.3% 3213.474u 1093.905s 1:21.91 5258.6% 63 3422.268u 1107.337s 1:21.36 5567.3% 3283.901u 1119.551s 1:23.87 5250.3% 64 3556.139u 1188.396s 1:23.41 5688.1% 3360.159u 1136.321s 1:24.41 5326.9% MEMORY LATENCY TEST CONCURRENT COMPILE FIXED FREQUENCY, NO SHARED RESOURCES (WITH BUG FIX TO KERNEL NUMA) Added 14-Oct-2018 Here we run the same tests as above but with the CPU clocks fixed at 3.4 GHz. We do this by setting manual OC mode in the BIOS and spinning in idle (no HLT). The cpu frequency has been verified for all tests. Note that the baseline is slower than our PPT test because the CPU frequency is fixed at a lower value even for tests with just a few threads. In addition, each individual loop runs in its own jail, with no shared system resources (each jail has its own compiler binary, include files, source files, libraries, etc). DragonFlyBSD tracks lock contention and latency and I verified that the contention is close to 0. With these settings, CPU temperature stabilized at 55C. The power curve was more interesting, though. The system pulles 380-390W or so up to 32 threads. But when we go above 32 threads we continue to gain marginal performance improvements by adding hyperthreads, and power consumption actually goes DOWN rather than up. Another interesting facet occurs in the transition from primary memory-connected nodes (through 16 threads) to secondary nodes (through 32 threads). The times are for the slowest thread. What we can see here is that the primary nodes actually starved the secondary nodes for memory bandwidth a bit, causing a jump going from 16 to 24. And then, when we start digging into the sibling hyperthreads above 32 threads, something strange occurs... power consumption goes down even as aggregate performance incrementally improves. We can come to a number of conclusions from these numbers: * 7% loss on primary (direct memory connected) nodes 1 thread verses 16 threads. * 18% loss from 1 thread to 32 threads, using only cores. Mainly due to primary nodes starving secondary nodes quite a bit. So primary nodes finish the test first (by a good margin). This is something AMD could improve on. * And finally, a 50% performance improvement with hyperthreading, going from 32 threads (on only cores) to 64 threads (on all cpu threads). And, not only that, but also a 10% reduction in power consumption. So very, very serious power efficiencies are gained from hyperthreading. [ EACH LOOP LOCKED TO A CPU ] 01 36.530u 6.572s 0:43.14 3.4 GHz, 384W 02 75.216u 12.914s 0:44.51 3.4 GHz, 384W -3.2% 1->2 04 154.884u 27.376s 0:46.14 3.4 GHz, 384W -3.7% 2->4 08 310.280u 54.430s 0:46.37 3.4 GHz, 383W -0.5% 4->8 16 622.688u 104.389s 0:46.50 3.4 GHz, 386W -5.5% 8->16 24 964.627u 166.673s 0:49.10 3.4 GHz, 390W -5.5% 16->24 32 1337.391u 238.400s 0:50.93 3.4 GHz, 393W -3.7% 24->32 40 1818.287u 342.507s 0:59.89 3.4 GHz, 380W <--- note drop in W 48 2347.969u 474.800s 1:02.76 3.4 GHz, 368W <--- note drop in W 56 3023.862u 629.234s 1:11.69 3.4 GHz, 355W <--- note drop in W 64 3871.123u 822.167s 1:19.31 3.4 GHz, 340W <--- note drop in W I can also intentionally load the indirect CPU dies. Here is the 01 count and 04 count running on ONLY the indirect CPU dies. We can see that timings are almost the same as above for the 1-thread and 4-thread test. [ EACH LOOP LOCKED TO A CPU ] 01 37.146u 5.676s 0:42.94 3.4 GHz, 364W 04 161.184u 27.614s 0:47.39 3.4 GHz, 370W CONCURRENT COMPILE FIXED PPT (PPT 150 = ~225W at wall), NO SHARED RESOURCES (WITH BUG FIX TO KERNEL NUMA) Added 14-Oct-2018 If we run this with a PPT limit instead of a fixed core frequency, here are the results using PPT=150 (2666 memory @C15). We are on primary nodes from 1-16 threads, secondary nodes from 16-32 threads, then start eating into hyperthread siblings from 32-64 threads. We can see a clear transition once we hit the secondary nodes, probably due to higher memory fabric power consumption, and then again once we start to dig into hyperthread siblings. A number of conclusions can be reached: * For cores, power efficiency normalized to 16 threads worth of work is included below in watt-seconds (total energy). We can see that adding cores, and also sibling hyperthreads, improves efficiency linearly. * Even though power efficiency is improved, we also see that time efficiency does not improve a whole lot with hyperthreading in this situation (where wall power is capped at 225W). Total time for 32 threads is 45.62 (91.23), verses 82.45 with 64 threads using hyperthreads. That's only equivalent to a 10% improvement on time. [ EACH LOOP LOCKED TO A CPU ] ITER LOSS POWER EFF 01 24.071u 4.584s 0:28.76 4.1 GHz, 107W - 49237 Ws 02 49.591u 9.806s 0:29.99 4.1 GHz, 119W -4.2% 28550 Ws 04 105.278u 20.214s 0:31.95 4.0 GHz, 145W -6.5% 18530 Ws 08 214.092u 40.870s 0:32.72 4.0 GHz, 200W -2.4% 13087 Ws 16 465.457u 84.157s 0:35.39 3.7 GHz, 215W -8.1% 7608 Ws 24 786.285u 154.921s 0:40.87 3.3 GHz, 217W -15.4% 5912 Ws 32 1181.459u 232.569s 0:45.62 3.1 GHz, 225W -11.6% 5132 Ws 40 1756.814u 342.938s 1:02.87 3.0 GHz, 225W 5658 Ws 48 2420.125u 475.546s 1:06.43 2.9 GHz, 226W 5004 Ws 56 3217.801u 636.215s 1:15.61 2.8 GHz, 226W 4882 Ws 64 4181.927u 817.397s 1:22.45 2.8 GHz, 227W 4679 Ws As a side note, with the NUMA bug fix that reduces kernel memory allocation contention the 64-thread test came in at 6088.0% verses 5688% without the fix. That is, 60.8x concurrency verses 56.88%, so the bug fix definitely made a difference. I can also intentionally load the indirect CPU dies. Here is the 01 count and 04 count running on ONLY the indirect CPU dies. We can see that timings are almost the same as above for the 1-thread and 4-thread test. Only slightly slower for the 4-thread test, but probably within the margin of error. The additional memory latency does not appear to cause problems on the alternative nodes for a small number of threads when they are not competing against primary nodes. [ EACH LOOP LOCKED TO A CPU ] 01 24.035u 4.640s 0:28.70 4.1 GHz, 108W 04 112.019u 22.174s 0:33.66 4.1 GHz, 150W PPT POWER TESTS W/COMPILE (WITHOUT NUMA BUG FIX) We can very easily show that the system is constrained by memory bandwidth by running the same test at three different wattages (and thus three different CPU frequencies). 16 518.192u 118.058s 0:40.97 1552.9% (PPT=100 - 160W at wall) 16 449.786u 110.657s 0:35.96 1558.4% (PPT=175 - 260W at wall) 16 417.448u 102.839s 0:33.27 1563.7% (PPT=400 - 425W at wall) 64 5848.872u 1514.124s 2:02.32 6019.4% (PPT=100 - 160W at wall, 1.6GHz) 64 3556.139u 1188.396s 1:23.41 5688.1% (PPT=175 - 260W at wall, 3.0GHz) 64 3308.483u 1112.276s 1:21.77 5406.3% (PPT=400 - 425W at wall, 3.7GHz) I can also boost the memory from 2666 to 2800 and show improvement due to running the memory faster. 64 3493.878u 1100.301s 1:20.37 5716.2% (PPT=175 - 256W at wall, 2.8GHz) 64 3089.999u 1078.154s 1:16.01 5483.6% (PPT=400 - 480W at wall, 3.8GHz) My conclusion is that at least for these kinds of loads, its a waste of watts to run the cpu at high power levels. Even more importantly (and I hope this isn't lost on people), look at the efficiency at lower power levels! For many workloads we can get nearly the same performance for significantly less power. Running the TR2 at 260W (at the wall) is 70W lower than stock and it gets nearly the same performance on this test. ANONYMOUS MEMORY FAULT RATE PPT=175 (260W at wall) (WITHOUT NUMA BUG FIX) These tests give us a fairly good idea of core scaling. Note that we are power-capped, so I also show what frequency the CPUs are running at. The BIOS/HW reduces the CPU frequency to keep the system within the specified power envelope. mmap()/bzero()/munmap() sequence in a loop x N threads. Unlocked. This test gives us a fairly good idea of basic system memory-zeroing overheads due to faults. Memory bandwidth limitations are self evident. threads faults/sec 1 647K 4.1 GHz 2 1.2M 4.1 GHz 4 2.5M ( 3.8x) 4.1 GHz 8 4.1M ( 6.3x) 4.0 GHz (power cap) 16 6.9M (10.6x) 3.9 GHz (power cap) 32 8.4M (12.98x) 3.4 GHz (power cap) 64 11M (1.3x vs 32 threads) 3.2 GHz (power cap) NOTE: If we correct for frequency 12.98 -> 15.6 w/ 32 threads, about 50% of best-case, but CPU frequency is unlikely to be the limitation here. It's virtually certain to be memory bandwidth or latency. -- getpid() loop (uncached) - system calls/sec x N threads. Unlocked. This test gives us a fairly good idea of the unfettered core performance when doing a simple getuid() system call loop. threads syscalls/sec 1 14M 4.1 GHz 2 28M 4.1 GHz 4 56M ( 4.0x vs 1) 4.1 GHz 8 114M ( 8.1x vs 1) (see note) 4.0 GHz (power cap) 16 216M (14.4x vs 1) 3.8 GHz (power cap) 32 355M (25.3x vs 1) 3.3 GHz (power cap) 64 430M (1.21x vs 32 threads) 3.0 GHz (power cap) NOTE: Reported 14M for 1 thread probably 14.x something, systat rounds it off, so calculations have a certain amount of error. NOTE: If we correct for CPU frequency, 14.4 -> 15.53 (almost perfect), and 25.3 -> 31.4 (also basically pefect scaling), and for the hyperthreaded cases, 64 threads scales to 41.97... almost 42 true cores equivalent, which is around 1.31x IPC added with hyperthreading). From this I conclude that the Zen+ cores themselves are extremely capable and on the threadripper the main limitation is memory bandwidth due to having only 4 memory channels. Ryzen 2 channels Threadripper 4 channels EPYC 8 channels EPYC x 2 16 channels Core 2 channels Xeon 6 channels Xeon x 2 12 channels -Matt