HEXL v1.1.0 integration
Initial integration with Intel HEXL https://github.com/intel/hexl
Preferred over !284 (closed)
Closes #297 (closed)
On ICX machine, configured via
(with compiler g++-10)
-- WITH_NTL: OFF
-- WITH_TCM: OFF
-- WITH_INTEL_HEXL: ON
-- WITH_OPENMP: OFF
-- NATIVE_SIZE: 64
-- CKKS_M_FACTOR: 1
-- WITH_NATIVEOPT: ON
I'm seeing:
WITH_INTEL_HEXL=OFF
./bin/benchmark/lib-benchmark-hexl
NTTTransform1024 14.3 us 14.2 us 98352
INTTTransform1024 13.7 us 13.7 us 101888
NTTTransform4096 67.3 us 67.2 us 20827
INTTTransform4096 64.1 us 64.0 us 21876
NTTTransformInPlace1024 14.1 us 14.1 us 99464
INTTTransformInPlace1024 13.6 us 13.6 us 102788
NTTTransformInPlace4096 67.1 us 67.0 us 20884
INTTTransformInPlace4096 63.3 us 63.2 us 22150
BFVrns_KeyGen 1132 us 1131 us 1236
BFVrns_MultKeyGen 1820 us 1818 us 768
BFVrns_EvalAtIndexKeyGen 1860 us 1859 us 758
BFVrns_Encryption 1286 us 1284 us 1089
BFVrns_Decryption 305 us 304 us 4597
BFVrns_Add 21.0 us 21.0 us 66308
BFVrns_AddInPlace 17.8 us 17.8 us 78919
BFVrns_MultNoRelin 4388 us 4384 us 319
BFVrns_MultRelin 5101 us 5097 us 275
BFVrns_EvalAtIndex 555 us 554 us 2525
CKKS_KeyGen 2362 us 2361 us 592
CKKS_MultKeyGen 6009 us 6006 us 232
CKKS_EvalAtIndexKeyGen 6084 us 6080 us 230
CKKS_Encryption 2380 us 2362 us 592
CKKS_Decryption 788 us 783 us 1795
CKKS_Add 43.6 us 43.3 us 32455
CKKS_AddInPlace 35.4 us 35.2 us 38980
CKKS_MultNoRelin 251 us 250 us 5596
CKKS_MultRelin 4079 us 4061 us 345
CKKS_Relin 4267 us 4251 us 330
CKKS_Rescale 787 us 784 us 1922
CKKS_RescaleInPlace 722 us 720 us 1939
CKKS_EvalAtIndex 3575 us 3565 us 377
BGVrns_KeyGen 2364 us 2358 us 592
BGVrns_MultKeyGen 6071 us 6058 us 231
BGVrns_EvalAtIndexKeyGen 6203 us 6190 us 226
BGVrns_Encryption 2765 us 2760 us 507
BGVrns_Decryption 393 us 393 us 3595
BGVrns_Add 50.1 us 50.0 us 27946
BGVrns_AddInPlace 44.7 us 44.7 us 31614
BGVrns_MultNoRelin 243 us 243 us 5719
BGVrns_MultRelin 4168 us 4163 us 337
BGVrns_Relin 4339 us 4334 us 323
BGVrns_ModSwitch 738 us 737 us 1902
BGVrns_ModSwitchInPlace 732 us 731 us 1915
BGVrns_EvalAtIndex 3617 us 3614 us 388
WITH_INTEL_HEXL=ON
./bin/benchmark/lib-benchmark-hexl
-------------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------------
NTTTransform1024 1.24 us 1.24 us 1099334
INTTTransform1024 1.35 us 1.35 us 1037487
NTTTransform4096 6.57 us 6.57 us 213796
INTTTransform4096 7.05 us 7.04 us 198808
NTTTransformInPlace1024 1.18 us 1.18 us 1185657
INTTTransformInPlace1024 1.25 us 1.25 us 1121451
NTTTransformInPlace4096 5.81 us 5.80 us 241158
INTTTransformInPlace4096 6.22 us 6.22 us 225011
BFVrns_KeyGen 759 us 758 us 1848
BFVrns_MultKeyGen 1303 us 1293 us 1079
BFVrns_EvalAtIndexKeyGen 1358 us 1349 us 1039
BFVrns_Encryption 816 us 811 us 1725
BFVrns_Decryption 118 us 117 us 10949
BFVrns_Add 21.0 us 20.9 us 66312
BFVrns_AddInPlace 19.2 us 19.1 us 73093
BFVrns_MultNoRelin 2339 us 2329 us 603
BFVrns_MultRelin 2465 us 2455 us 574
BFVrns_EvalAtIndex 223 us 223 us 6301
CKKS_KeyGen 1755 us 1749 us 800
CKKS_MultKeyGen 4223 us 4211 us 336
CKKS_EvalAtIndexKeyGen 4157 us 4146 us 327
CKKS_Encryption 1628 us 1624 us 830
CKKS_Decryption 658 us 657 us 2133
CKKS_Add 44.1 us 44.0 us 32038
CKKS_AddInPlace 35.2 us 35.2 us 39823
CKKS_MultNoRelin 76.5 us 76.4 us 19422
CKKS_MultRelin 2341 us 2337 us 599
CKKS_Relin 3188 us 3184 us 465
CKKS_Rescale 173 us 173 us 7901
CKKS_RescaleInPlace 170 us 170 us 8572
CKKS_EvalAtIndex 2009 us 2005 us 686
BGVrns_KeyGen 1655 us 1653 us 850
BGVrns_MultKeyGen 4083 us 4079 us 343
BGVrns_EvalAtIndexKeyGen 4218 us 4214 us 335
BGVrns_Encryption 1625 us 1623 us 859
BGVrns_Decryption 168 us 168 us 8173
BGVrns_Add 49.9 us 49.9 us 28724
BGVrns_AddInPlace 43.5 us 43.5 us 32781
BGVrns_MultNoRelin 49.5 us 49.4 us 27436
BGVrns_MultRelin 2343 us 2341 us 596
BGVrns_Relin 3140 us 3138 us 488
BGVrns_ModSwitch 200 us 200 us 7214
BGVrns_ModSwitchInPlace 188 us 188 us 7920
BGVrns_EvalAtIndex 2035 us 2033 us 697
To show that the modulus size changes in lib-benchmark-hexl
are required, I also run the benchmark with lib-benchmark. Observe that the runtimes are slower than lib-benchmark-hexl
WITH_INTEL_HEXL=ON
./bin/benchmark/lib-benchmark
NTTTransform1024 2.85 us 2.84 us 384658
INTTTransform1024 3.17 us 3.17 us 441572
NTTTransform4096 14.2 us 14.1 us 99072
INTTTransform4096 15.3 us 15.2 us 91806
NTTTransformInPlace1024 2.85 us 2.85 us 491763
INTTTransformInPlace1024 3.11 us 3.11 us 450279
NTTTransformInPlace4096 13.4 us 13.4 us 103985
INTTTransformInPlace4096 14.5 us 14.5 us 96490
BFVrns_KeyGen 1885 us 1884 us 739
BFVrns_MultKeyGen 2978 us 2975 us 468
BFVrns_EvalAtIndexKeyGen 3253 us 3251 us 433
BFVrns_Encryption 2001 us 2000 us 699
BFVrns_Decryption 301 us 301 us 4644
BFVrns_Add 41.5 us 41.5 us 33735
BFVrns_AddInPlace 38.0 us 38.0 us 36843
BFVrns_MultNoRelin 5865 us 5861 us 239
BFVrns_MultRelin 6389 us 6385 us 221
BFVrns_EvalAtIndex 610 us 607 us 2306
CKKS_KeyGen 1753 us 1740 us 802
CKKS_MultKeyGen 4309 us 4281 us 323
CKKS_EvalAtIndexKeyGen 4202 us 4178 us 337
CKKS_Encryption 1625 us 1617 us 889
CKKS_Decryption 661 us 658 us 2128
CKKS_Add 41.8 us 41.6 us 33117
CKKS_AddInPlace 36.8 us 36.6 us 38243
CKKS_MultNoRelin 74.0 us 73.7 us 19129
CKKS_MultRelin 2474 us 2467 us 561
CKKS_Relin 2900 us 2892 us 478
CKKS_Rescale 179 us 178 us 7940
CKKS_RescaleInPlace 170 us 169 us 8329
CKKS_EvalAtIndex 1977 us 1973 us 716
BGVrns_KeyGen 1598 us 1595 us 878
BGVrns_MultKeyGen 4057 us 4049 us 344
BGVrns_EvalAtIndexKeyGen 4191 us 4184 us 334
BGVrns_Encryption 1607 us 1604 us 862
BGVrns_Decryption 167 us 167 us 8379
BGVrns_Add 50.9 us 50.8 us 27451
BGVrns_AddInPlace 44.8 us 44.7 us 31218
BGVrns_MultNoRelin 49.7 us 49.6 us 28248
BGVrns_MultRelin 2319 us 2316 us 598
BGVrns_Relin 2871 us 2868 us 491
BGVrns_ModSwitch 186 us 186 us 7462
BGVrns_ModSwitchInPlace 178 us 178 us 7862
BGVrns_EvalAtIndex 2016 us 2015 us 697
For the polynomial benchmarks, I see:
WITH_INTEL_HEXL=OFF
poly-benchmark-16k
-------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------
Native_add 22.3 us 22.3 us 31366
DCRT_add/towers:1 22.6 us 22.6 us 30802
DCRT_add/towers:2 47.4 us 47.3 us 14811
DCRT_add/towers:4 103 us 102 us 6835
DCRT_add/towers:8 205 us 205 us 3418
Native_mul 57.8 us 57.7 us 12128
DCRT_mul/towers:1 57.9 us 57.8 us 12093
DCRT_mul/towers:2 118 us 118 us 5915
DCRT_mul/towers:4 245 us 245 us 2858
DCRT_mul/towers:8 490 us 489 us 1431
Native_ntt 323 us 322 us 2167
DCRT_ntt/towers:1 323 us 322 us 2172
DCRT_ntt/towers:2 646 us 645 us 1083
DCRT_ntt/towers:4 1293 us 1292 us 539
DCRT_ntt/towers:8 2587 us 2585 us 269
Native_intt 299 us 299 us 2336
DCRT_intt/towers:1 299 us 299 us 2342
DCRT_intt/towers:2 599 us 599 us 1168
DCRT_intt/towers:4 1199 us 1198 us 583
DCRT_intt/towers:8 2398 us 2397 us 291
WITH_INTEL_HEXL=ON
poly-hexl-benchmark-16k
-------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------
Native_add 21.9 us 21.9 us 31894
DCRT_add/towers:1 22.2 us 22.2 us 31308
DCRT_add/towers:2 46.2 us 46.2 us 15158
DCRT_add/towers:4 99.3 us 99.2 us 7057
DCRT_add/towers:8 199 us 198 us 3528
Native_mul 9.16 us 9.16 us 76442
DCRT_mul/towers:1 9.36 us 9.35 us 74907
DCRT_mul/towers:2 21.0 us 21.0 us 33477
DCRT_mul/towers:4 52.4 us 52.4 us 13350
DCRT_mul/towers:8 117 us 117 us 5977
Native_ntt 42.3 us 42.2 us 16566
DCRT_ntt/towers:1 42.4 us 42.2 us 16567
DCRT_ntt/towers:2 87.1 us 86.4 us 8097
DCRT_ntt/towers:4 183 us 182 us 3850
DCRT_ntt/towers:8 379 us 376 us 1857
Native_intt 41.3 us 41.1 us 15828
DCRT_intt/towers:1 41.3 us 41.1 us 17043
DCRT_intt/towers:2 86.1 us 85.6 us 8175
DCRT_intt/towers:4 185 us 184 us 3793
DCRT_intt/towers:8 385 us 383 us 1826
WITH_INTEL_HEXL=OFF
poly-benchmark-4k
-------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------
Native_add 5.22 us 5.21 us 133395
DCRT_add/towers:1 5.23 us 5.22 us 134075
DCRT_add/towers:2 10.6 us 10.6 us 66104
DCRT_add/towers:4 22.8 us 22.8 us 30762
DCRT_add/towers:8 48.6 us 48.5 us 14427
Native_mul 14.1 us 14.1 us 49776
DCRT_mul/towers:1 14.0 us 14.0 us 49937
DCRT_mul/towers:2 28.3 us 28.2 us 24805
DCRT_mul/towers:4 58.0 us 58.0 us 12079
DCRT_mul/towers:8 119 us 119 us 5885
Native_ntt 69.5 us 69.5 us 10071
DCRT_ntt/towers:1 69.5 us 69.5 us 10077
DCRT_ntt/towers:2 139 us 139 us 5036
DCRT_ntt/towers:4 278 us 278 us 2521
DCRT_ntt/towers:8 556 us 555 us 1259
Native_intt 65.7 us 65.7 us 10660
DCRT_intt/towers:1 65.7 us 65.6 us 10668
DCRT_intt/towers:2 131 us 131 us 5332
DCRT_intt/towers:4 263 us 263 us 2665
DCRT_intt/towers:8 526 us 526 us 1331
poly-hexl-benchmark-4k
WITH_INTEL_HEXL=ON
-------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------
Native_add 5.22 us 5.19 us 134880
DCRT_add/towers:1 5.30 us 5.27 us 132848
DCRT_add/towers:2 10.7 us 10.7 us 65424
DCRT_add/towers:4 22.7 us 22.6 us 30917
DCRT_add/towers:8 47.7 us 47.5 us 14722
Native_mul 1.91 us 1.91 us 366919
DCRT_mul/towers:1 2.06 us 2.05 us 341413
DCRT_mul/towers:2 4.29 us 4.28 us 163604
DCRT_mul/towers:4 10.0 us 10.00 us 70020
DCRT_mul/towers:8 22.8 us 22.7 us 30872
Native_ntt 8.35 us 8.32 us 84105
DCRT_ntt/towers:1 8.37 us 8.34 us 84062
DCRT_ntt/towers:2 17.0 us 16.9 us 41334
DCRT_ntt/towers:4 33.9 us 33.8 us 20678
DCRT_ntt/towers:8 67.8 us 67.6 us 10362
Native_intt 8.81 us 8.79 us 79758
DCRT_intt/towers:1 8.83 us 8.80 us 79497
DCRT_intt/towers:2 17.8 us 17.7 us 39483
DCRT_intt/towers:4 37.2 us 37.2 us 18843
DCRT_intt/towers:8 71.0 us 70.9 us 9910
Edited by Fabian Boemer