...

Commits (2)
 % IPDPS requirement !Abstract (Maximum 250 words)! \begin{abstract} Among the (uncontended) common wisdom in High-Performance Computing (HPC) is the applications’ need for large amount of double-precision support in hardware. Hardware manufacturers, the TOP500 list, and (rarely revisited) legacy software have without doubt followed and contributed to this~view. Among the (uncontended) common wisdom in High-Performance Computing (HPC) is the applications' need for large amount of double-precision support in hardware. Hardware manufacturers, the TOP500 list, and (rarely revisited) legacy software have without doubt followed and contributed to this~view. In this paper, we challenge that wisdom, and we do so by exhaustively comparing a large number of HPC proxy application on two processors: Intel's Knights Landing (KNL) and Knights Mill (KNM). Although similar, the KNM and KNL architecturally deviate at one important point: the silicon area devoted to double-precision arithmetics. This fortunate discrepancy allows us to empirically quantify the performance impact in reducing the amount of hardware double-precision arithmetic. %this part reads strange In this paper, we challenge that wisdom, and we do so by exhaustively comparing a large number of HPC proxy applications on two processors: Intel's Knights Landing (KNL) and Knights Mill (KNM). Although similar, the KNL and KNM architecturally deviate at one important point: the silicon area devoted to double-precision arithmetics. This fortunate discrepancy allows us to empirically quantify the performance impact in reducing the amount of hardware double-precision arithmetic. %this part reads strange Our analysis shows that this common wisdom might not always be right. We find that the investigated HPC proxy applications do allow for a (significant) reduction in double-precision with little-to-no performance implications. With the advent of a failing of Moore's law, our results partially reinforce the view taken by modern industry (e.g. upcoming Fujitsu ARM64FX) to integrate hybrid-precision hardware units. Our analysis shows that this common wisdom might not always be right. We find that the investigated HPC proxy applications do allow for a (significant) reduction in double-precision with little-to-no performance implications. With the advent of a failing of Moore's law, our results partially reinforce the view taken by modern industry (e.g., upcoming Fujitsu ARM64FX) to integrate hybrid-precision hardware units. %\cJD{PLACEHOLDER: (Maximum 250 words)!} %Common perception in supercomputing is that double precision floating point ... ...
 ... ... @@ -21,7 +21,7 @@ With the ending of Dennard's scaling~\cite{dennard_design_1974} and the ending o there is today an ever-increasing need to oversee how we allocate the silicon to various functional units in modern many-core processors. Amongst those decisions is how we distributed the hardware support for various levels of compute-precision. Historically, most of the compute silicon has been allocated to double-precision (64-bit) compute. Historically, most of the compute silicon has been allocated to double-precision (DP; 64-bit) compute. Nowadays -- in processors such as the forthcoming AA64FX~\cite{yoshida_fujitsu_2018} and NVIDIA Volta~\cite{choquette_volta:_2018} -- the trend, mostly driven by market/AI demands, is to replace some of the double-precision units with lower-precision units. ... ... @@ -41,16 +41,16 @@ a decade (Knights Ferry was announced in 2010), and has changed drastically sinc The latest (and also last) two revisions -- the Knights Landing and Knights Mill -- are of particular importance since they arguable reflect two different ways of thinking. Knights Landing has relatively large support for double-precision (64-bit) computations, and follows a more traditional school of thought. The Knights Mill follows a different direction, which is the replacement more traditional school of thought. While Knights Mill follows a different direction, which is the replacement of double-precision compute units with lower-precision (single-precision, half-precision, and integer) compute capabilities. In the present paper, we quantify and analyze the performance and compute bottlenecks of Intel's Knights Landing~\cite{sodani_knights_2016} and Mill architectures~\cite{bradford_knights_2017} -- two Intel's Knights Landing~\cite{sodani_knights_2016} and Knights Mill~\cite{bradford_knights_2017} architectures -- two processors with identical micro-architecture where the main difference is in the relative allocation of double-precision units. We stress both processors with numerous realistic benchmarks from both the Exascale Computing Project (ECP) proxy applications~\cite{noauthor_ecp_2018} and RIKEN-CCS Fiber Miniapp Suite~\cite{riken_aics_fiber_2015} -- benchmarks used in HPC system acquisition. RIKEN R-CCS Fiber Miniapp Suite~\cite{riken_aics_fiber_2015} -- benchmarks used in HPC system acquisition. Through an extensive (and robust) performance measurement process (which we also open-source), we empirically show the architecture's relative weaknesses. In short, the contributions of the present paper are: \begin{enumerate} ... ...
This diff is collapsed.
 ... ... @@ -22,8 +22,8 @@ In this section, we present our rigorous benchmarking approach into investigatin % \struc{assume BMs are well tuned} Due to the fact that the benchmarks, listed in Section~\ref{ssec:bm}, are firstly realistic proxies of the original applications~\cite{aaziz_methodology_2018} and secondly are used in the procurement process, we can confidently assume that these benchmarks are well tuned and come with appropriate compiler options for a variety of compilers. original applications~\cite{aaziz_methodology_2018} and secondly are used in the procurement process, we can assume that these benchmarks are well tuned and come with appropriate compiler options for a variety of compilers -- \added{a hypothesis we will test in Section~\ref{ssec:eval_roof}}. Hence, we refrain from both manual code optimization and alterations of the compiler options. % %\struc{how we compiled} ... ... @@ -55,17 +55,17 @@ aim is~\unit[1]{sec}--\unit[10]{min} due to the large sample size we have to cov realistic amount of main memory (e.g., avoid cache-only executions)? Are the results repeatable (randomness/seeds)? We optimize for the metrics reported by the benchmark (e.g., select the input with the highest~\unit[]{Gflop/s} rate). % % \struc{explain parameter-sweep for num\_mpi and num\_omp, why, reason, how on all system,( and examples maybe)} % Furthermore, one of the most important considerations while selecting the right inputs is Furthermore, one of the most important consideration while selecting the right inputs is \textit{strong-scaling}. We require strong-scaling properties of the benchmark for two reasons: the results collected in Step~(2) need to be comparable, and even more importantly, the results of Step (3) must be comparable between different architectures, since we may have to use different numbers of MPI processes for KNL and KNL (and our BDW reference architecture) due to their difference in core counts. The only exception is MiniAMR for which we are unable to find a strong-scaling input configuration and instead optimized for the reported~\unit[]{Gflop/s} of the benchmark. Accordingly, we then choose the same amount of MPI processes on our KNL and KNM compute nodes for MiniAMR. input configuration and instead optimized for the reported~\unit[]{Gflop/s} of the benchmark. Accordingly, we then choose the same amount of MPI processes on our KNL and KNM compute nodes for MiniAMR. In Step (2), we evaluate numerous combinations of MPI processes and OpenMP threads %\cJD{any other threadin models?} -> candle uses omp-ish too, see ... ... @@ -73,11 +73,12 @@ In Step (2), we evaluate numerous combinations of MPI processes and OpenMP threa for each benchmark, including combinations which over-/undersubscribe the CPU cores, and test each combination with three runs to minimize the potential for outliers due to system noise. For all subsequent measurements, we select the number of processes and threads based on the best'' (w.r.t time-to-solution of the solver) combination among these tested versions, see Table~\ref{table:rest} for details. time-to-solution of the solver) combination among these tested versions, see Table~\ref{table:rest} % for details. at the end of this paper for details. %\cJD{no specific intel' mpi tuning (except hpgc, babel) because initial test consistently resulted %in worse time to solution when non-default options where used} We are not applying specific tuning options to the Intel MPI library, except for using Intel's recommended settings for HPCG with respect to thread affinity and MPI\_allreduce implementation. We are not applying specific tuning options to Intel's MPI library, except for using Intel's recommended settings for HPCG with respect to thread affinity and MPI\_allreduce. % implementation. The reason is that our pretests (with a subset of the benchmarks) with non-default parameters for Intel MPI consistently resulted in longer time-to-solution. ... ... @@ -87,8 +88,8 @@ For Step (3), we run each benchmark ten times to identify the fastest time-to-so (compute) kernel of the benchmark. Additionally, for the profiling runs, we execute the benchmark once for each of the profiling tools and/or metrics (in case the tool is used for multiple metrics), see Section~\ref{ssec:metrics} for details. Finally, we perform frequency scaling experiments for each benchmark, where we throttle the CPU frequency to all the available lower CPU states below the maximum CPU frequency we use for the performance runs, and record the lowest kernel for each benchmark, where we throttle the CPU frequency to all of the available lower CPU states below the maximum CPU frequency, which we use for the performance runs, and record the lowest kernel time-to-solution among ten trials per frequency. The reason and results of the frequency scaling test will be further explained in Section~\ref{ssec:eval_freq}. One may argue for more than ten runs per benchmark to find the optimal time-to-solution, however, ... ... @@ -169,7 +170,7 @@ all presented data will be based exclusively on the kernel portion of each bench START\_ASSAY\; } \caption{Injecting analysis instructions} % \vspace{-0.5em} \vspace{-.5em} \end{algorithm} %\struc{what does each tool have in terms of capabilities, how is it applied to the benchmarks, ... ... @@ -178,8 +179,8 @@ all presented data will be based exclusively on the kernel portion of each bench For tool stability reason, attention to detail/accuracy, and overlap with our needs, we settle on the use of the MPI API for runtime measurements, alongside with Intel's Processor Counter Monitor (PCM)~\cite{willhalm_intel_2017}, Intel's Software Development Emulator (SDE)~\cite{raman_calculating_2015}, and Intel's VTune Amplifier~\cite{sobhee_intel_2018}\footnote{~To avoid persistent compute node crashes, we had to use disable VTune's \\$~~~\,\quad$build-in sampling driver and instead rely on Linux' perf tool.}. Amplifier~\cite{sobhee_intel_2018}\footnote{~To avoid persistent compute node crashes (likely due to incompatibilities\\$~~~\,\quad$with the Spectre/Meltdown patches), we had to disable VTune's build-in \\$~~~\,\quad$sampling driver and instead rely on Linux' \texttt{perf} tool.}. Furthermore, as auxiliary tools we rely on RRZE's Likwid~\cite{treibig_likwid:_2010} for frequency scaling\footnote{~Our Linux kernel version required us to disable the default Intel P-State\\$~~~\,\quad$driver to have full access to the fine-grained frequency scaling.} and ... ... @@ -189,7 +190,7 @@ Section~\ref{ssec:bm}, is shown in Table~\ref{tb:Mtools}. Furthermore, derived m as~\unit[]{Gflop/s}, will be explained on-demand in Section~\ref{sec:eval}. % \begin{table}[tp] \vspace{-0.5em} %\vspace{-0.5em} \centering\scriptsize \caption{\label{tb:Mtools}Summary of metrics and method/tool to collect these metrics} \begin{tabular}{|l|l|} ... ... @@ -204,7 +205,7 @@ as~\unit[]{Gflop/s}, will be explained on-demand in Section~\ref{sec:eval}. SIMD instructions per cycle & perf + VTune (hpc-performance') \\\hline \rC Memory/Back-end boundedness & perf + VTune (memory-access') \\\hline \end{tabular} \vspace{-0.5em} \vspace{-.5em} \end{table} % ... ...
This diff is collapsed.
 \begin{table*}[tbp] \caption{\label{table:rest} Application configuration and measured metrics; Missing data for CANDLE due to SDE crashes on Phi; Measurements indicate CANDLE/MKL-DNN ignores OpenMP settings and tries to utilize full chip $\rightarrow$ listed in italic; Label explanation: t2sol = time-to-solution (kernel), Gop (D $|$ S $|$ I) = Giga operations (FP64 $|$ FP32 $|$ Integer), SIMDi/cyc = SIMD instructions per cycle, FPAIp[R $|$ W] = FP Arithmetic instructions per memory [read $|$ write], [B $|$ M]Bd = [Back-end $|$ Memory] Bound (see~\cite{sobhee_intel_2018} for details), L2h = L2 cache hit rate, LLh = Last level cache hit rate (L3 for BDW, MCDRAM for KNL/KNM), Gbra/s = Giga branches/s}%\cJD{highlight important?}} \caption{\label{table:rest} Application configuration and measured metrics; Missing data for CANDLE due to SDE crashes on Phi; Measurements indicate CANDLE/MKL-DNN ignores OpenMP settings and tries to utilize full chip $\rightarrow$ listed in italic; Label explanation: t2sol = time-to-solution (kernel), Gop (D $|$ S $|$ I) = Giga operations (FP64 $|$ FP32 $|$ Integer), SIMDi/cyc = SIMD instructions per cycle, FPAIp[R $|$ W] = FP Arithmetic instructions per memory [read $|$ write], [B $|$ M]Bd = [Back-end $|$ Memory] Bound (see~\cite{sobhee_intel_2018} for details), L2h = L2 cache hit rate, LLh = Last level cache hit rate (L3 for BDW, MCDRAM for KNL/KNM), Gbra/s = Giga branches/s;\qquad\added{Note: SIMDi/cyc and FPAIp* as well as BBd and MBd occupy the same columns due to their similarity and space constraints}} \centering\scriptsize \begin{tabular}{|l|r|r|r|r|r|r|r|c|r|r|r|r|} \hline \hC ... ...
 ... ... @@ -9,9 +9,10 @@ the three architectures, this section summarizes the relevant points to consider from our study, which should be taken into account when moving forward. \subsection{Performance Metrics} The de facto performance metric reported in HPC is \unit[]{flop/s}. Reporting \unit[]{flop/s} is not limited to applications that are compute-bound. Benchmarks that are designed to resemble realistic workloads, e.g., the memory-bound HPCG benchmark, typically report performance in \unit[]{flop/s}. The proxy-/mini-apps in this study as well typically report \unit[]{flop/s} despite only six out of 20 proxy-/mini-apps we analyze in this study appearing to be compute-bound (including NGSA that is bound by ALUs, not FPUs). We The de facto performance metric reported in HPC is \unit[]{flop/s}. However, reporting \unit[]{flop/s} is not limited to applications that are compute-bound. Benchmarks that are designed to resemble realistic workloads, e.g., the memory-bound HPCG benchmark, typically report performance in \unit[]{flop/s}. The proxy-/mini-apps in this study as well typically report \unit[]{flop/s} despite the fact that only six out of 20 proxy-/mini-apps we analyze in this study appear to be compute-bound (including NGSA that is bound by ALUs, not FPUs). We argue that convening on reporting relevant metrics would shift the focus of the community to be less \unit[]{flop/s}-centered. %It is important to mention that reporting only time-to-solution and scalability, without reporting performance, is a common pitfall that distorts the interpretation of results in HPC~\cite{hoefler_scientific_2015}. ... ... @@ -35,7 +36,7 @@ scientific domain for different supercomputing facilities (based on yearly reports of mentioned facilities). For instance, by simply mapping the scientific domains in Figure~\ref{fid:disc:breakdown} to representative proxies, ANL's ALCF and \mbox{R-CCS's} K-computer would be achieving $\approx$14\% and $\approx$11\%, respectively, of the peak \unit[]{flop/s} when projecting $\approx$11\% of the peak \unit[]{flop/s}, respectively, when projecting for the annual node-hours. %oversomplification? It is worth mentioning that the relevance of \unit[]{flop/s} is even more of an issue for supercomputers dedicated to specific workloads: the relevance of ... ... @@ -44,10 +45,10 @@ mainly to weather forecasting, e.g., the~\unit[18]{Pflop/s} system recently installed at Japan's Meteorological Agency~\cite{japan_meteorological_agency_jma_jma_2018}, should give minimal relevance to \unit[]{flop/s} since the proxy representing this workload on that supercomputer achieves $\approx$6\% of the peak \unit[]{flop/s}, since those workloads are typically memory-bound. On the other hand, a supercomputer dedicated to AI/ML such as ABCI, the world 5\textsuperscript{th} because those workloads are typically memory-bound. On the other hand, a supercomputer dedicated to AI/ML such as ABCI, the world's 5\textsuperscript{th} fastest supercomputer as of June 2018, would put high emphasis on \unit[]{flop/s} since deep learning workloads rely heavily on dense matrix multiplication operations. due to the fact that current deep learning workloads rely heavily on dense matrix multiplications. \subsection{Memory-bound Applications} ... ... @@ -55,12 +56,13 @@ As demonstrated in Figure~\ref{fig:flops}, the performance of memory-bound applications is mostly not affected by the peak \unit[]{flop/s} available. Accordingly, investment in data-centric architectures and programming models should take priority over paying premium for \unit[]{flop/s}-centric systems. In one motivating instance, during the investigation that NASA Ames Research Center conducted to identify planned upgrade of the Pleiades supercomputer in 2016~\cite{saini_performance_2016}, the study concluded that the performance gain from upgrading to In one motivating instance, an investigation conducted by the NASA Ames Research Center, for a planned upgrade of the Pleiades supercomputer in 2016~\cite{saini_performance_2016}, concluded that the performance gain of their applications from upgrading to Intel Haswell processors was insignificant in comparison to using the older Ivy Bridge-based processors (the newer processor offered double the peak \unit[]{flop/s} at almost the same memory bandwidth). And hence the choice was only do a partial upgrade to Haswell processors. \unit[]{flop/s} at almost the same memory bandwidth). And hence the choice was to only do a partial upgrade to Haswell processors. \subsection{Compute-bound Applications} Investing more in data-centric architectures to accommodate memory-bound ... ...
 \section{Conclusion}\label{sec:conclusion} % goal: 1/4 page % \begin{comment} \struc{what did we learn which can be beneficial for others in the HPC community} \struc{what is our recommendation for vendors and centers buying new systems} \struc{praise our github w/ link so that others can perform similar stuff and check, study, validate our results, also link to our TR or extended version with appendix of less interesting results, etc.} \end{comment} %\struc{what did we learn which can be beneficial for others in the HPC community} %\struc{what is our recommendation for vendors and centers buying new systems} %\struc{show our github w/ link so that others can perform similar stuff and %check, study, validate our results, also link to our TR or extended version %with appendix of less interesting results if we have any, etc.} We compared two architectural similar processors that have different double-precision silicon budget. By studying a large number of HPC proxy application, we found no significant ... ...
 \section*{Acknowledgment \& Author Contributions} \added{ This work was supported by MEXT, JST special appointed survey 30593/2018 as well as JST-CREST under Grant Number JPMJCR1303, and the AIST/TokyoTech Real-world Big-Data Computation Open Innovation Laboratory, Japan. Furthermore, we would like to thank Intel for receiving technical support. K.M., J.D., H.Z., K.Y., T.T. and Y.T. performed the required experiments and data collection. The authors K.M., J.D., H.Z., K.Y., T.T. and Y.T. performed the required experiments and data collection. J.D., M.W., A.P. designed the study, analyzed the data, and supervised its execution together with S.M., while all authors contributed to writing and editing. } \ No newline at end of file
 \appendices % \section{Reproducibility}\label{apx:reprod} to infinity \struc{code and logs in git, explain how to pull, install, compile, config, run} \struc{explain about proprietary code and packages we dont ship with the repo} \struc{explain how we analyze the codes, or tools/scripts we used} \struc{detailes about software version if necessary} \struc{have we patched any bugs? in the codes?} \section{Detailed Input/Parameters for Benchmarks}\label{apx:inputs} and beyond \struc{title says it all} %\section{Reproducibility}\label{apx:reprod} %to infinity % %\struc{code and logs in git, explain how to pull, install, compile, config, run} %\struc{explain about proprietary code and packages we dont ship with the repo} %\struc{explain how we analyze the codes, or tools/scripts we used} %\struc{detailes about software version if necessary} %\struc{have we patched any bugs? in the codes?} % %\section{Detailed Input/Parameters for Benchmarks}\label{apx:inputs} %and beyond % %\struc{title says it all} % \section{Additionally Evaluated Metrics}\label{apx:metrics} woooshhhh \struc{here comes everything text/figs/etc we left out of the main eval section} %woooshhhh % %\struc{here comes everything text/figs/etc we left out of the main eval section} % \input{41-rest-table} \ No newline at end of file
 ... ... @@ -15,7 +15,7 @@ $(PAPER).pdf:$(TEX) $(BIB)$(FIG) cleanall rm $(PAPER).dvi clean: rm -f *.ilg *.aux *.log *.dvi *.idx *.toc *.lof *.lot rm -f *.ilg *.aux *.log *.dvi *.idx *.toc *.lof *.lot *.soc$(PAPER).out rm -f *.blg *.bbl *~ cleanall: clean ... ...
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
 set terminal svg size 1600,600 dynamic enhanced fname 'Times' fsize 28 butt dashlength 1.0 set output "../figures/flops-relA.svg" set grid set auto x set auto y set xrange [-0.5:23.5] set yrange [0:4] set ytics 0,1,4 set xtic font ",24" rotate by -45 scale 0 left set key left top vertical Right maxrows 3 box width +2 set datafile missing '-' bdw = "#A61A00" knl = "#00B358" knm = "#1924B1" set ylabel "Rel. Perf. (Gflop/s) Improvement over BDW" # max DP flops lines bdwmax(x)=( -1 < x && x < 24 ) ? 1.0 : 1/0 plot \ "../data/flops.data" u ($4/$2):xtic(1) pt 20 ps 0.8 lc rgb knl title 'KNL_{rel}', \ "" u ($6/$2):xtic(1) pt 9 ps 0.8 lc rgb knm title 'KNM_{rel}', \ bdwmax(x) with lines lt 0 lw 2 lc rgb bdw title 'BDW_{rel}'
 set terminal svg size 1600,600 dynamic enhanced fname 'Times' fsize 28 butt dashlength 1.0 set output "../figures/flops-relB.svg" set grid set auto x set auto y set xrange [-0.5:23.5] set yrange [0:100] set ytics 0,20,100 set xtic font ",24" rotate by -45 scale 0 left set key left top vertical Right maxrows 3 box width +2 set datafile missing '-' bdw = "#A61A00" knl = "#00B358" knm = "#1924B1" set ylabel "Abs. achieved Gflop/s out of Peak [in %]" # max DP flops lines bdwmax(x)=( -1 < x && x < 24 ) ? 1.0 : 1/0 plot \ "../data/flops.data" u (100.0*$4/$9):xtic(1) pt 20 ps 0.8 lc rgb knl title 'KNL_{abs}', \ "" u (100.0*$6/$10) pt 9 ps 0.8 lc rgb knm title 'KNM_{abs}', \ "" u (100.0*$2/$8) pt 3 ps 0.8 lc rgb bdw title 'BDW_{abs}'
 ... ... @@ -5,10 +5,10 @@ set grid set auto x set auto y set xrange [0:27] set yrange [0:450] set yrange [0:500] set xtic font ",24" rotate by -45 scale 0 left set key left top vertical Right box width +2 set key opaque left top vertical Right box width +2 #reverse noenhanced autotitle columnhead box set datafile missing '-' ... ... @@ -23,10 +23,10 @@ knlmax(x)=( -0.5 < x && x < 27.5 ) ? 439 : 1/0 knmmax(x)=( -0.5 < x && x < 27.5 ) ? 430 : 1/0 plot \ "../data/bytes-n-flops.data" u 2:xtic(1) pt 3 ps 0.8 lc rgb bdw title 'BDW', \ bdwmax(x) with lines lt 0 lw 2 lc rgb bdw notitle, \ "" u 6:xtic(1) pt 20 ps 0.8 lc rgb knl title 'KNL', \ "../data/bytes-n-flops.data" u 6:xtic(1) pt 20 ps 0.8 lc rgb knl title 'KNL', \ knlmax(x) with lines lt 0 lw 2 lc rgb knl notitle, \ "" u 10:xtic(1) pt 9 ps 0.8 lc rgb knm title 'KNM', \ knmmax(x) with lines lt 0 lw 2 lc rgb knm notitle knmmax(x) with lines lt 0 lw 2 lc rgb knm notitle, \ "" u 2:xtic(1) pt 3 ps 0.8 lc rgb bdw title 'BDW'
 set terminal svg size 1600,1200 dynamic enhanced fname 'Times' fsize 32 butt dashlength 1.0 set output "../figures/roofline-bdw.svg" # gflops knl_fpeak = 2662.0 knm_fpeak = 1728.0 bdw_fpeak = 691.0 # gb/s knl_mpeak = 439.0 knm_mpeak = 430.0 bdw_mpeak = 122.0 xmin = 0.001 xmax = 100 ymin = 0.1 ymax = 2000 set xtics nomirror set xrange [xmin:xmax] set logscale x 10 set yrange [ymin:ymax] set logscale y 10 # Functions mem(x,y) = exp( log( y ) - log( x )) min(a,b) = (a < b) ? a : b max(a,b) = (a > b) ? a : b knl_froof(x) = knl_fpeak knl_mroof(x) = mem(knl_fpeak / knl_mpeak, knl_fpeak) * x knl_rflne(x) = min(knl_froof(x), knl_mroof(x)) knm_froof(x) = knm_fpeak knm_mroof(x) = mem(knm_fpeak / knm_mpeak, knm_fpeak) * x knm_rflne(x) = min(knm_froof(x), knm_mroof(x)) bdw_froof(x) = bdw_fpeak bdw_mroof(x) = mem(bdw_fpeak / bdw_mpeak, bdw_fpeak) * x bdw_rflne(x) = min(bdw_froof(x), bdw_mroof(x)) set grid #set key left top vertical Right box width +2 unset key set xlabel "Arithmetic Intensity (flop/byte)" set ylabel "Gflop/s" bdw = "#A61A00" knl = "#00B358" knm = "#1924B1" set label 1 "Theor. Peak Performance (FP64)" at xmax-10, 1.25*bdw_froof(xmax) right set label 2 "Stream Triad Bandwidth (GB/s)" at 1.25*xmin, 1.6*bdw_mroof(xmin) left rotate by 42 plot bdw_rflne(x) lt 1 lc rgb "black" lw 4 notitle, \ "../data/bytes-n-flops.data" u ($3/$5)/($2):($3/$5) pt 28 ps 0.6 lc rgb bdw title 'BDW', \ "" u ($3/$5)/($2):($3/$5):($1) with labels offset 0,-1 font "Times,22" point pt 28 ps 0.6 lc rgb bdw notitle #plot knl_rflne(x) lt 1 lc rgb knl lw 4 notitle, \ # knm_rflne(x) lt 1 lc rgb knm lw 4 notitle, \ # bdw_rflne(x) lt 1 lc rgb bdw lw 4 notitle, \ # "../data/bytes-n-flops.data" u ($7/$9)/($6):($7/$9) pt 20 ps 0.6 lc rgb knl title 'KNL', \ # "" u ($11/$13)/($10):($11/$13) pt 9 ps 0.6 lc rgb knm title 'KNM', \ # "" u ($11/$13)/($10):($11/$13):($1) with labels offset -2.5,-.3 font "Times,24" point pt 9 ps 0.6 lc rgb knm notitle, \ # "" u ($3/$5)/($2):($3/$5) pt 28 ps 0.6 lc rgb bdw title 'BDW', \ # "" u ($3/$5)/($2):($3/$5):($1) with labels offset 0,-1 font "Times,24" point pt 28 ps 0.6 lc rgb bdw notitle
 ... ... @@ -5,8 +5,8 @@ set grid set auto x set auto y set xrange [-0.5:23.5] set yrange [0.5:1.5] set ytics 0.5,0.5,1.5 set yrange [0:3] set ytics 0,1,3 set xtic font ",24" rotate by -45 scale 0 left set key left top vertical Right box width +2 ... ... @@ -19,9 +19,10 @@ knm = "#1924B1" set ylabel "Speedup (w.r.t Time-to-Solution)" knlmax(x)=( -1 < x && x < 24 ) ? 1.0 : 1/0 bdwmax(x)=( -1 < x && x < 24 ) ? 1.0 : 1/0 plot \ "../data/t2solv.data" u ($4/$6):xtic(1) pt 20 ps 0.8 lc rgb knm title 'KNM', \ knlmax(x) with lines lt 0 lw 2 lc rgb knl title 'KNL' "../data/t2solv.data" u ($2/$4):xtic(1) pt 20 ps 0.8 lc rgb knl title 'KNL', \ "" u ($2/$6):xtic(1) pt 9 ps 0.8 lc rgb knm title 'KNM', \ bdwmax(x) with lines lt 0 lw 2 lc rgb bdw title 'BDW'
 ... ... @@ -316,7 +316,35 @@ \begin{document} %%%% DONT TOUCH FOR DRAFT \iftoggle{highlightChanges}{ \begin{titlepage} \mbox{}\\{\Large \textbf{Cover Letter for Submission:}\\\\Double-precision FPUs in High-Performance Computing: an Embarrassment of Riches?} \newline\newline\newline\newline\large Summary of the implemented changes (also highlighted in blue on subsequent pages): \begin{itemize} \item Added discussion of why we report flop/s and why it is necessary in this paper despite our later recommendation that the HPC community should not (only) report flop/s; \item Split of Fig.~\ref{fig:flops} (rel./abs. flop/s comparison) into two subfigures for easier readability and modification of text in Sec.~\ref{ssec:eval_flops} to reflect the change; \item Added Fig.~\ref{fig:t2s-rel} for Time-to-Solution'' and its explanation/discussion in Sec.~\ref{ssec:eval_flops}; \item Added roofline analysis in Sec.~\ref{ssec:eval_roof} to determine the optimization status of the FP-intensive proxy-apps, which we used for this study (incl 2 additional references for this part) \item Added details about the theoretical peak speedup with turbo boost shown in Fig.~\ref{fig:freq} and explanation of why a pessimistic +100Mhz was chosen in this case and why this resulted in superlinear speedup'' for some benchmarks \item Added acknowledgement of funding resources and author's contributions; \item Added note to Tab.~\ref{table:rest} to point out the multiuse of two columns by similar metrics (VTune reports slightly different metrics for BDW vs. KNM/KNL for arithmetic intensity and memory-boundedness; and readers can consult Ref. [41] for an in-depth documentation on these metrics (as stated previously in the table's caption)); \item (+ multiple smaller grammar and text adjustments which will no be highlighted). \end{itemize} \end{titlepage} }{} %%%% DONT TOUCH FOR DRAFT => TODO take out if we buy 1 page \bstctlcite{IEEEexample:BSTcontrol} %%%% DONT TOUCH FOR DRAFT ... ... @@ -450,7 +478,6 @@ \input{70-conclusion} % 1/2 page \iftoggle{includeacknowl}{ \input{80-acknowledgment} % 1/2 page ... ... @@ -630,16 +657,12 @@ %No appendix in first submission \input{41-rest-table} \begin{comment} \iftoggle{includeappendix}{ \clearpage \input{90-appendix}} { \clearpage \input{90-appendix} }{ \input{41-rest-table} } \end{comment} % that's all folks \end{document}