...
 
Commits (2)
%
% PACKAGES
% PACKAGES and CONDITIONALS
%
\usepackage{etoolbox}
\usepackage{units}
......@@ -14,26 +14,42 @@
\DontPrintSemicolon
\SetAlFnt{\scriptsize}
%
\usepackage[table,xcdraw]{xcolor}
%\usepackage{booktabs}
%
% TODO: take out rebuttal flag for final submission
\newtoggle{highlightChanges}
\toggletrue{highlightChanges}
%\togglefalse{highlightChanges}
%
\iftoggle{highlightChanges}{
\usepackage{changes}
}{
\newcommand{\added}[1]{{#1}}
}
%
\newtoggle{doubleblind}
%\toggletrue{doubleblind}
\togglefalse{doubleblind}
%
\newtoggle{includeappendix}
\toggletrue{includeappendix}
%\togglefalse{includeappendix}
%\toggletrue{includeappendix}
\togglefalse{includeappendix}
%
\newtoggle{includeacknowl}
%\toggletrue{includeacknowl}
\togglefalse{includeacknowl}
\toggletrue{includeacknowl}
%\togglefalse{includeacknowl}
% NEW COMMANDS
%
\usepackage[table,xcdraw]{xcolor}
%\usepackage{booktabs}
% only during writing
%
% NEW COMMANDS
%\newcommand{\struc}[1]{\textcolor{blue}{ToWrite: #1 \\}}
%\newcommand{\todo}[1]{\textcolor{red}{TODO: #1}}
%\newcommand{\cJD}[1]{\textcolor{red}{\small{FIX: #1}}}
%
% colored tables, rows, etc
%
\newcommand{\struc}[1]{\textcolor{blue}{ToWrite: #1 \\}}
\newcommand{\todo}[1]{\textcolor{red}{TODO: #1}}
\newcommand{\cJD}[1]{\textcolor{red}{\small{FIX: #1}}}
\newcommand{\rC}[0]{\rowcolor[HTML]{CCCCCC}}
\newcommand{\hC}[0]{\rowcolor[HTML]{333333}}
\newcommand{\tH}[1]{\multicolumn{1}{c}{\textcolor{white}{#1}}}
......
% IPDPS requirement !Abstract (Maximum 250 words)!
\begin{abstract}
Among the (uncontended) common wisdom in High-Performance Computing (HPC) is the applications need for large amount of double-precision support in hardware. Hardware manufacturers, the TOP500 list, and (rarely revisited) legacy software have without doubt followed and contributed to this~view.
Among the (uncontended) common wisdom in High-Performance Computing (HPC) is the applications' need for large amount of double-precision support in hardware. Hardware manufacturers, the TOP500 list, and (rarely revisited) legacy software have without doubt followed and contributed to this~view.
In this paper, we challenge that wisdom, and we do so by exhaustively comparing a large number of HPC proxy application on two processors: Intel's Knights Landing (KNL) and Knights Mill (KNM). Although similar, the KNM and KNL architecturally deviate at one important point: the silicon area devoted to double-precision arithmetics. This fortunate discrepancy allows us to empirically quantify the performance impact in reducing the amount of hardware double-precision arithmetic. %this part reads strange
In this paper, we challenge that wisdom, and we do so by exhaustively comparing a large number of HPC proxy applications on two processors: Intel's Knights Landing (KNL) and Knights Mill (KNM). Although similar, the KNL and KNM architecturally deviate at one important point: the silicon area devoted to double-precision arithmetics. This fortunate discrepancy allows us to empirically quantify the performance impact in reducing the amount of hardware double-precision arithmetic. %this part reads strange
Our analysis shows that this common wisdom might not always be right. We find that the investigated HPC proxy applications do allow for a (significant) reduction in double-precision with little-to-no performance implications. With the advent of a failing of Moore's law, our results partially reinforce the view taken by modern industry (e.g. upcoming Fujitsu ARM64FX) to integrate hybrid-precision hardware units.
Our analysis shows that this common wisdom might not always be right. We find that the investigated HPC proxy applications do allow for a (significant) reduction in double-precision with little-to-no performance implications. With the advent of a failing of Moore's law, our results partially reinforce the view taken by modern industry (e.g., upcoming Fujitsu ARM64FX) to integrate hybrid-precision hardware units.
%\cJD{PLACEHOLDER: (Maximum 250 words)!}
%Common perception in supercomputing is that double precision floating point
......
......@@ -21,7 +21,7 @@ With the ending of Dennard's scaling~\cite{dennard_design_1974} and the ending o
there is today an ever-increasing need to oversee how we allocate the silicon to various functional units in modern many-core
processors. Amongst those decisions is how we distributed the hardware support for various levels of compute-precision.
Historically, most of the compute silicon has been allocated to double-precision (64-bit) compute.
Historically, most of the compute silicon has been allocated to double-precision (DP; 64-bit) compute.
Nowadays -- in processors such as the forthcoming AA64FX~\cite{yoshida_fujitsu_2018} and NVIDIA
Volta~\cite{choquette_volta:_2018} -- the trend, mostly driven by market/AI demands, is to replace
some of the double-precision units with lower-precision units.
......@@ -41,16 +41,16 @@ a decade (Knights Ferry was announced in 2010), and has changed drastically sinc
The latest (and also last) two revisions -- the Knights Landing and Knights Mill -- are of
particular importance since they arguable reflect two different ways of thinking. Knights Landing
has relatively large support for double-precision (64-bit) computations, and follows a
more traditional school of thought. The Knights Mill follows a different direction, which is the replacement
more traditional school of thought. While Knights Mill follows a different direction, which is the replacement
of double-precision compute units with lower-precision (single-precision, half-precision, and integer)
compute capabilities.
In the present paper, we quantify and analyze the performance and compute bottlenecks of
Intel's Knights Landing~\cite{sodani_knights_2016} and Mill architectures~\cite{bradford_knights_2017} -- two
Intel's Knights Landing~\cite{sodani_knights_2016} and Knights Mill~\cite{bradford_knights_2017} architectures -- two
processors with identical micro-architecture where the main difference is in the relative allocation of double-precision units.
We stress both processors with numerous realistic benchmarks from both the
Exascale Computing Project (ECP) proxy applications~\cite{noauthor_ecp_2018} and
RIKEN-CCS Fiber Miniapp Suite~\cite{riken_aics_fiber_2015} -- benchmarks used in HPC system acquisition.
RIKEN R-CCS Fiber Miniapp Suite~\cite{riken_aics_fiber_2015} -- benchmarks used in HPC system acquisition.
Through an extensive (and robust) performance measurement process (which we also open-source), we
empirically show the architecture's relative weaknesses. In short, the contributions of the present paper are:
\begin{enumerate}
......
This diff is collapsed.
......@@ -22,8 +22,8 @@ In this section, we present our rigorous benchmarking approach into investigatin
% \struc{assume BMs are well tuned}
Due to the fact that the benchmarks, listed in Section~\ref{ssec:bm}, are firstly realistic proxies of the
original applications~\cite{aaziz_methodology_2018} and secondly are used in the procurement process, we can confidently assume
that these benchmarks are well tuned and come with appropriate compiler options for a variety of compilers.
original applications~\cite{aaziz_methodology_2018} and secondly are used in the procurement process, we can assume
that these benchmarks are well tuned and come with appropriate compiler options for a variety of compilers -- \added{a hypothesis we will test in Section~\ref{ssec:eval_roof}}.
Hence, we refrain from both manual code optimization and alterations of the compiler options.
%
%\struc{how we compiled}
......@@ -55,17 +55,17 @@ aim is~\unit[1]{sec}--\unit[10]{min} due to the large sample size we have to cov
realistic amount of main memory (e.g., avoid cache-only executions)? Are the results repeatable
(randomness/seeds)? We optimize for the metrics reported by the benchmark (e.g., select the input
with the highest~\unit[]{Gflop/s} rate).
%
% \struc{explain parameter-sweep for num\_mpi and num\_omp, why, reason, how on all system,( and examples maybe)}
%
Furthermore, one of the most important considerations while selecting the right inputs is
Furthermore, one of the most important consideration while selecting the right inputs is
\textit{strong-scaling}. We require strong-scaling properties of the benchmark for two reasons:
the results collected in Step~(2) need to be comparable, and even more importantly, the results
of Step (3) must be comparable between different architectures, since we may have to use different
numbers of MPI processes for KNL and KNL (and our BDW reference architecture) due to their difference
in core counts. The only exception is MiniAMR for which we are unable to find a strong-scaling
input configuration and instead optimized for the reported~\unit[]{Gflop/s} of the benchmark. Accordingly, we then choose
the same amount of MPI processes on our KNL and KNM compute nodes for MiniAMR.
input configuration and instead optimized for the reported~\unit[]{Gflop/s} of the benchmark.
Accordingly, we then choose the same amount of MPI processes on our KNL and KNM compute nodes for MiniAMR.
In Step (2), we evaluate numerous combinations of MPI processes and OpenMP threads
%\cJD{any other threadin models?} -> candle uses omp-ish too, see
......@@ -73,11 +73,12 @@ In Step (2), we evaluate numerous combinations of MPI processes and OpenMP threa
for each benchmark, including combinations which over-/undersubscribe the CPU cores, and test each
combination with three runs to minimize the potential for outliers due to system noise.
For all subsequent measurements, we select the number of processes and threads based on the ``best'' (w.r.t
time-to-solution of the solver) combination among these tested versions, see Table~\ref{table:rest} for details.
time-to-solution of the solver) combination among these tested versions, see Table~\ref{table:rest} % for details.
at the end of this paper for details.
%\cJD{no specific intel' mpi tuning (except hpgc, babel) because initial test consistently resulted
%in worse time to solution when non-default options where used}
We are not applying specific tuning options to the Intel MPI library, except for using Intel's recommended
settings for HPCG with respect to thread affinity and MPI\_allreduce implementation.
We are not applying specific tuning options to Intel's MPI library, except for using Intel's recommended
settings for HPCG with respect to thread affinity and MPI\_allreduce. % implementation.
The reason is that our pretests (with a subset of the benchmarks) with non-default parameters for
Intel MPI consistently resulted in longer time-to-solution.
......@@ -87,8 +88,8 @@ For Step (3), we run each benchmark ten times to identify the fastest time-to-so
(compute) kernel of the benchmark. Additionally, for the profiling runs, we execute the benchmark
once for each of the profiling tools and/or metrics (in case the tool is used for multiple metrics),
see Section~\ref{ssec:metrics} for details. Finally, we perform frequency scaling experiments
for each benchmark, where we throttle the CPU frequency to all the available lower CPU states
below the maximum CPU frequency we use for the performance runs, and record the lowest kernel
for each benchmark, where we throttle the CPU frequency to all of the available lower CPU states
below the maximum CPU frequency, which we use for the performance runs, and record the lowest kernel
time-to-solution among ten trials per frequency. The reason and results of the frequency scaling
test will be further explained in Section~\ref{ssec:eval_freq}.
One may argue for more than ten runs per benchmark to find the optimal time-to-solution, however,
......@@ -169,7 +170,7 @@ all presented data will be based exclusively on the kernel portion of each bench
START\_ASSAY\;
}
\caption{Injecting analysis instructions}
% \vspace{-0.5em}
\vspace{-.5em}
\end{algorithm}
%\struc{what does each tool have in terms of capabilities, how is it applied to the benchmarks,
......@@ -178,8 +179,8 @@ all presented data will be based exclusively on the kernel portion of each bench
For tool stability reason, attention to detail/accuracy, and overlap with our needs, we settle
on the use of the MPI API for runtime measurements, alongside with Intel's Processor Counter
Monitor (PCM)~\cite{willhalm_intel_2017}, Intel's Software Development Emulator (SDE)~\cite{raman_calculating_2015}, and Intel's VTune
Amplifier~\cite{sobhee_intel_2018}\footnote{~To avoid persistent compute node crashes, we had to use disable VTune's
\\$~~~\,\quad$build-in sampling driver and instead rely on Linux' perf tool.}.
Amplifier~\cite{sobhee_intel_2018}\footnote{~To avoid persistent compute node crashes (likely due to incompatibilities\\$~~~\,\quad$with the Spectre/Meltdown patches), we had to disable VTune's build-in
\\$~~~\,\quad$sampling driver and instead rely on Linux' \texttt{perf} tool.}.
Furthermore, as auxiliary tools we rely on RRZE's Likwid~\cite{treibig_likwid:_2010} for frequency
scaling\footnote{~Our Linux kernel version required us to disable
the default Intel P-State\\$~~~\,\quad$driver to have full access to the fine-grained frequency scaling.} and
......@@ -189,7 +190,7 @@ Section~\ref{ssec:bm}, is shown in Table~\ref{tb:Mtools}. Furthermore, derived m
as~\unit[]{Gflop/s}, will be explained on-demand in Section~\ref{sec:eval}.
%
\begin{table}[tp]
\vspace{-0.5em}
%\vspace{-0.5em}
\centering\scriptsize
\caption{\label{tb:Mtools}Summary of metrics and method/tool to collect these metrics}
\begin{tabular}{|l|l|}
......@@ -204,7 +205,7 @@ as~\unit[]{Gflop/s}, will be explained on-demand in Section~\ref{sec:eval}.
SIMD instructions per cycle & perf + VTune (`hpc-performance') \\\hline \rC
Memory/Back-end boundedness & perf + VTune (`memory-access') \\\hline
\end{tabular}
\vspace{-0.5em}
\vspace{-.5em}
\end{table}
%
......
This diff is collapsed.
\begin{table*}[tbp]
\caption{\label{table:rest} Application configuration and measured metrics; Missing data for CANDLE due to SDE crashes on Phi; Measurements indicate CANDLE/MKL-DNN ignores OpenMP settings and tries to utilize full chip $\rightarrow$ listed in italic; Label explanation: t2sol = time-to-solution (kernel), Gop (D $|$ S $|$ I) = Giga operations (FP64 $|$ FP32 $|$ Integer), SIMDi/cyc = SIMD instructions per cycle, FPAIp[R $|$ W] = FP Arithmetic instructions per memory [read $|$ write], [B $|$ M]Bd = [Back-end $|$ Memory] Bound (see~\cite{sobhee_intel_2018} for details), L2h = L2 cache hit rate, LLh = Last level cache hit rate (L3 for BDW, MCDRAM for KNL/KNM), Gbra/s = Giga branches/s}%\cJD{highlight important?}}
\caption{\label{table:rest} Application configuration and measured metrics; Missing data for CANDLE due to SDE crashes on Phi; Measurements indicate CANDLE/MKL-DNN ignores OpenMP settings and tries to utilize full chip $\rightarrow$ listed in italic; Label explanation: t2sol = time-to-solution (kernel), Gop (D $|$ S $|$ I) = Giga operations (FP64 $|$ FP32 $|$ Integer), SIMDi/cyc = SIMD instructions per cycle, FPAIp[R $|$ W] = FP Arithmetic instructions per memory [read $|$ write], [B $|$ M]Bd = [Back-end $|$ Memory] Bound (see~\cite{sobhee_intel_2018} for details), L2h = L2 cache hit rate, LLh = Last level cache hit rate (L3 for BDW, MCDRAM for KNL/KNM), Gbra/s = Giga branches/s;\qquad\added{Note: SIMDi/cyc and FPAIp* as well as BBd and MBd occupy the same columns due to their similarity and space constraints}}
\centering\scriptsize
\begin{tabular}{|l|r|r|r|r|r|r|r|c|r|r|r|r|}
\hline \hC
......
......@@ -9,9 +9,10 @@ the three architectures, this section summarizes the relevant points to consider
from our study, which should be taken into account when moving forward.
\subsection{Performance Metrics}
The de facto performance metric reported in HPC is \unit[]{flop/s}. Reporting \unit[]{flop/s} is not limited to applications that are compute-bound. Benchmarks that are designed to resemble realistic workloads, e.g., the
memory-bound HPCG benchmark, typically report performance in \unit[]{flop/s}. The proxy-/mini-apps in this study as well typically report \unit[]{flop/s} despite only six out of 20 proxy-/mini-apps we analyze in this study appearing to be
compute-bound (including NGSA that is bound by ALUs, not FPUs). We
The de facto performance metric reported in HPC is \unit[]{flop/s}. However, reporting \unit[]{flop/s} is not limited to applications that are compute-bound. Benchmarks that are designed to resemble realistic workloads, e.g., the
memory-bound HPCG benchmark, typically report performance in \unit[]{flop/s}. The proxy-/mini-apps in this study
as well typically report \unit[]{flop/s} despite the fact that only six out of 20 proxy-/mini-apps we analyze in
this study appear to be compute-bound (including NGSA that is bound by ALUs, not FPUs). We
argue that convening on reporting relevant metrics would shift the focus of the community to be less \unit[]{flop/s}-centered.
%It is important to mention that reporting only time-to-solution and scalability, without reporting performance, is a common pitfall that distorts the interpretation of results in HPC~\cite{hoefler_scientific_2015}.
......@@ -35,7 +36,7 @@ scientific domain for different supercomputing facilities (based on yearly
reports of mentioned facilities). For instance, by simply mapping the scientific
domains in Figure~\ref{fid:disc:breakdown} to representative proxies,
ANL's ALCF and \mbox{R-CCS's} K-computer would be achieving $\approx$14\% and
$\approx$11\%, respectively, of the peak \unit[]{flop/s} when projecting
$\approx$11\% of the peak \unit[]{flop/s}, respectively, when projecting
for the annual node-hours. %oversomplification?
It is worth mentioning that the relevance of \unit[]{flop/s} is even more
of an issue for supercomputers dedicated to specific workloads: the relevance of
......@@ -44,10 +45,10 @@ mainly to weather forecasting, e.g., the~\unit[18]{Pflop/s} system recently
installed at Japan's Meteorological Agency~\cite{japan_meteorological_agency_jma_jma_2018},
should give minimal relevance to \unit[]{flop/s} since the proxy representing
this workload on that supercomputer achieves $\approx$6\% of the peak \unit[]{flop/s},
since those workloads are typically memory-bound. On the other hand, a
supercomputer dedicated to AI/ML such as ABCI, the world 5\textsuperscript{th}
because those workloads are typically memory-bound. On the other hand, a
supercomputer dedicated to AI/ML such as ABCI, the world's 5\textsuperscript{th}
fastest supercomputer as of June 2018, would put high emphasis on \unit[]{flop/s}
since deep learning workloads rely heavily on dense matrix multiplication operations.
due to the fact that current deep learning workloads rely heavily on dense matrix multiplications.
\subsection{Memory-bound Applications}
......@@ -55,12 +56,13 @@ As demonstrated in Figure~\ref{fig:flops}, the performance of memory-bound
applications is mostly not affected by the peak \unit[]{flop/s} available.
Accordingly, investment in data-centric architectures and programming models
should take priority over paying premium for \unit[]{flop/s}-centric systems.
In one motivating instance, during the investigation that NASA Ames Research
Center conducted to identify planned upgrade of the Pleiades supercomputer in
2016~\cite{saini_performance_2016}, the study concluded that the performance gain from upgrading to
In one motivating instance, an investigation conducted by the NASA Ames Research Center,
for a planned upgrade of the Pleiades supercomputer in 2016~\cite{saini_performance_2016},
concluded that the performance gain of their applications from upgrading to
Intel Haswell processors was insignificant in comparison to using the older
Ivy Bridge-based processors (the newer processor offered double the peak
\unit[]{flop/s} at almost the same memory bandwidth). And hence the choice was only do a partial upgrade to Haswell processors.
\unit[]{flop/s} at almost the same memory bandwidth).
And hence the choice was to only do a partial upgrade to Haswell processors.
\subsection{Compute-bound Applications}
Investing more in data-centric architectures to accommodate memory-bound
......
\section{Conclusion}\label{sec:conclusion}
% goal: 1/4 page
%
\begin{comment}
\struc{what did we learn which can be beneficial for others in the HPC community}
\struc{what is our recommendation for vendors and centers buying new systems}
\struc{praise our github w/ link so that others can perform similar stuff and
check, study, validate our results, also link to our TR or extended version
with appendix of less interesting results, etc.}
\end{comment}
%\struc{what did we learn which can be beneficial for others in the HPC community}
%\struc{what is our recommendation for vendors and centers buying new systems}
%\struc{show our github w/ link so that others can perform similar stuff and
%check, study, validate our results, also link to our TR or extended version
%with appendix of less interesting results if we have any, etc.}
We compared two architectural similar processors that have different double-precision
silicon budget. By studying a large number of HPC proxy application, we found no significant
......
\section*{Acknowledgment \& Author Contributions}
\added{
This work was supported by MEXT, JST special appointed survey 30593/2018 as well as JST-CREST
under Grant Number JPMJCR1303, and the AIST/TokyoTech Real-world Big-Data Computation Open
Innovation Laboratory, Japan.
Furthermore, we would like to thank Intel for receiving technical support.
K.M., J.D., H.Z., K.Y., T.T. and Y.T. performed the required experiments and data collection.
The authors K.M., J.D., H.Z., K.Y., T.T. and Y.T. performed the required experiments and data collection.
J.D., M.W., A.P. designed the study, analyzed the data, and supervised its execution together
with S.M., while all authors contributed to writing and editing.
}
\ No newline at end of file
\appendices
%
\section{Reproducibility}\label{apx:reprod}
to infinity
\struc{code and logs in git, explain how to pull, install, compile, config, run}
\struc{explain about proprietary code and packages we dont ship with the repo}
\struc{explain how we analyze the codes, or tools/scripts we used}
\struc{detailes about software version if necessary}
\struc{have we patched any bugs? in the codes?}
\section{Detailed Input/Parameters for Benchmarks}\label{apx:inputs}
and beyond
\struc{title says it all}
%\section{Reproducibility}\label{apx:reprod}
%to infinity
%
%\struc{code and logs in git, explain how to pull, install, compile, config, run}
%\struc{explain about proprietary code and packages we dont ship with the repo}
%\struc{explain how we analyze the codes, or tools/scripts we used}
%\struc{detailes about software version if necessary}
%\struc{have we patched any bugs? in the codes?}
%
%\section{Detailed Input/Parameters for Benchmarks}\label{apx:inputs}
%and beyond
%
%\struc{title says it all}
%
\section{Additionally Evaluated Metrics}\label{apx:metrics}
woooshhhh
\struc{here comes everything text/figs/etc we left out of the main eval section}
%woooshhhh
%
%\struc{here comes everything text/figs/etc we left out of the main eval section}
%
\input{41-rest-table}
\ No newline at end of file
@IEEEtranBSTCTL{IEEEexample:BSTcontrol,
CTLuse_forced_etal = "yes",
CTLmax_names_forced_etal = "2",
CTLnames_show_etal = "1" }
CTLnames_show_etal = "1",
CTLname_url_prefix = "URL: "
}
@misc{lawrence_livermore_national_laboratory_sierra_nodate,
title = {Sierra {Advanced} {Technology} {System}},
......@@ -129,7 +130,7 @@ CTLnames_show_etal = "1" }
@techreport{dongarra_hpcg_2015,
title = {{HPCG} {Benchmark}: a {New} {Metric} for {Ranking} {High} {Performance} {Computing} {Systems}},
url = {http://www.eecs.utk.edu/resources/library/file/1047/ut-eecs-15-736.pdf},
url = {https://library.eecs.utk.edu/pub/594},
number = {ut-eecs-15-736},
institution = {University of Tennessee},
author = {Dongarra, Jack and Heroux, Michael and Luszczek, Piotr},
......@@ -647,7 +648,8 @@ CTLnames_show_etal = "1" }
urldate = {2018-10-01},
institution = {ExaNoDe},
author = {Asifuzzaman, Kazi and Radulovic, Milan and Radojkovic, Petar},
year = {2017}
year = {2017},
file = {Asifuzzaman et al. - 2017 - Report on the HPC application bottlenecks.pdf:/home/domke/Documents/Zotero/storage/CHGMBEUU/Asifuzzaman et al. - 2017 - Report on the HPC application bottlenecks.pdf:application/pdf}
}
@inproceedings{saini_performance_2016,
......@@ -711,4 +713,42 @@ CTLnames_show_etal = "1" }
year = {2013},
keywords = {Finite Element Method, Fluid Analysis, Improving Performance, The K computer},
pages = {2496--2499}
}
\ No newline at end of file
}
@techreport{lento_whitepaper:_2014,
title = {Whitepaper: {Optimizing} {Performance} with {Intel}{\textregistered} {Advanced} {Vector} {Extensions}},
url = {https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/performance-xeon-e5-v3-advanced-vector-extensions-paper.pdf},
urldate = {2018-12-01},
institution = {Intel Corporation},
author = {Lento, Gregory},
year = {2014},
file = {Lento - 2014 - Whitepaper Optimizing Performance with Intel{\textregistered} Adv.pdf:/home/domke/Documents/Zotero/storage/P9P8CA2W/Lento - 2014 - Whitepaper Optimizing Performance with Intel{\textregistered} Adv.pdf:application/pdf}
}
@inproceedings{ofenbeck_applying_2014,
title = {Applying the {Roofline} {Model}},
doi = {10.1109/ISPASS.2014.6844463},
abstract = {The recently introduced roofline model plots the performance of executed code against its operational intensity (operations count divided by memory traffic). It also includes two platform-specific performance ceilings: the processor's peak performance and a ceiling derived from the memory bandwidth, which is relevant for code with low operational intensity. The model thus makes more precise the notions of memory- and compute-bound and, despite its simplicity, can provide an insightful visualization of bottlenecks. As such it can be valuable to guide manual code optimization as well as in education. Unfortunately, to date the model has been used almost exclusively with back-of-the-envelope calculations and not with measured data. In this paper we show how to produce roofline plots with measured data on recent generations of Intel platforms. We show how to accurately measure the necessary quantities for a given program using performance counters, including threaded and vectorized code, and for warm and cold cache scenarios. We explain the measurement approach, its validation, and discuss limitations. Finally, we show, to this extent for the first time, a set of roofline plots with measured data for common numerical functions on a variety of platforms and discuss their possible uses.},
booktitle = {2014 {IEEE} {International} {Symposium} on {Performance} {Analysis} of {Systems} and {Software} ({ISPASS})},
author = {Ofenbeck, Georg and Steinmann, Ruedi and Caparros, Victoria and Spampinato, Daniele G. and P{\"u}schel, Markus},
month = mar,
year = {2014},
keywords = {cache storage, Computational modeling, software performance evaluation, memory bandwidth, Microarchitecture, Radiation detectors, Bandwidth, back-of-the-envelope calculations, bottleneck visualization, Bridges, cold cache scenarios, compute-bound, executed code performance, Intel platforms, manual code optimization, memory traffic, memory-bound, multi-threading, operation count, operational intensity, performance counters, platform-specific performance ceilings, processor peak performance, program compilers, Q measurement, roofline model, roofline plots, source code (software), threaded code, Time measurement, vectorized code, warm cache scenarios},
pages = {76--85},
file = {Ofenbeck et al. - 2014 - Applying the Roofline Model.pdf:/home/domke/Documents/Zotero/storage/JJM98V4T/Ofenbeck et al. - 2014 - Applying the Roofline Model.pdf:application/pdf}
}
@inproceedings{jouppi_-datacenter_2017,
address = {New York, NY, USA},
series = {{ISCA} '17},
title = {In-{Datacenter} {Performance} {Analysis} of a {Tensor} {Processing} {Unit}},
isbn = {978-1-4503-4892-8},
doi = {10.1145/3079856.3080246},
booktitle = {Proceedings of the 44th {Annual} {International} {Symposium} on {Computer} {Architecture}},
publisher = {ACM},
author = {Jouppi, Norman P. and Young, Cliff and Patil, Nishant and Patterson, David and Agrawal, Gaurav and Bajwa, Raminder and Bates, Sarah and Bhatia, Suresh and Boden, Nan and Borchers, Al and Boyle, Rick and Cantin, Pierre-luc and Chao, Clifford and Clark, Chris and Coriell, Jeremy and Daley, Mike and Dau, Matt and Dean, Jeffrey and Gelb, Ben and Ghaemmaghami, Tara Vazir and Gottipati, Rajendra and Gulland, William and Hagmann, Robert and Ho, C. Richard and Hogberg, Doug and Hu, John and Hundt, Robert and Hurt, Dan and Ibarz, Julian and Jaffey, Aaron and Jaworski, Alek and Kaplan, Alexander and Khaitan, Harshit and Killebrew, Daniel and Koch, Andy and Kumar, Naveen and Lacy, Steve and Laudon, James and Law, James and Le, Diemthu and Leary, Chris and Liu, Zhuyuan and Lucke, Kyle and Lundin, Alan and MacKean, Gordon and Maggiore, Adriana and Mahony, Maire and Miller, Kieran and Nagarajan, Rahul and Narayanaswami, Ravi and Ni, Ray and Nix, Kathy and Norrie, Thomas and Omernick, Mark and Penukonda, Narayana and Phelps, Andy and Ross, Jonathan and Ross, Matt and Salek, Amir and Samadiani, Emad and Severn, Chris and Sizikov, Gregory and Snelham, Matthew and Souter, Jed and Steinberg, Dan and Swing, Andy and Tan, Mercedes and Thorson, Gregory and Tian, Bo and Toma, Horia and Tuttle, Erick and Vasudevan, Vijay and Walter, Richard and Wang, Walter and Wilcox, Eric and Yoon, Doe Hyun},
year = {2017},
keywords = {accelerator, CNN, deep learning, DNN, domain-specific architecture, GPU, LSTM, MLP, neural network, RNN, TensorFlow, TPU},
pages = {1--12},
file = {Jouppi et al. - 2017 - In-Datacenter Performance Analysis of a Tensor Pro.pdf:/home/domke/Documents/Zotero/storage/YG5GSTHI/Jouppi et al. - 2017 - In-Datacenter Performance Analysis of a Tensor Pro.pdf:application/pdf}
}
......@@ -15,7 +15,7 @@ $(PAPER).pdf: $(TEX) $(BIB) $(FIG) cleanall
rm $(PAPER).dvi
clean:
rm -f *.ilg *.aux *.log *.dvi *.idx *.toc *.lof *.lot
rm -f *.ilg *.aux *.log *.dvi *.idx *.toc *.lof *.lot *.soc $(PAPER).out
rm -f *.blg *.bbl *~
cleanall: clean
......
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
set terminal svg size 1600,600 dynamic enhanced fname 'Times' fsize 28 butt dashlength 1.0
set output "../figures/flops-relA.svg"
set grid
set auto x
set auto y
set xrange [-0.5:23.5]
set yrange [0:4]
set ytics 0,1,4
set xtic font ",24" rotate by -45 scale 0 left
set key left top vertical Right maxrows 3 box width +2
set datafile missing '-'
bdw = "#A61A00"
knl = "#00B358"
knm = "#1924B1"
set ylabel "Rel. Perf. (Gflop/s) Improvement over BDW"
# max DP flops lines
bdwmax(x)=( -1 < x && x < 24 ) ? 1.0 : 1/0
plot \
"../data/flops.data" u ($4/$2):xtic(1) pt 20 ps 0.8 lc rgb knl title 'KNL_{rel}', \
"" u ($6/$2):xtic(1) pt 9 ps 0.8 lc rgb knm title 'KNM_{rel}', \
bdwmax(x) with lines lt 0 lw 2 lc rgb bdw title 'BDW_{rel}'
set terminal svg size 1600,600 dynamic enhanced fname 'Times' fsize 28 butt dashlength 1.0
set output "../figures/flops-relB.svg"
set grid
set auto x
set auto y
set xrange [-0.5:23.5]
set yrange [0:100]
set ytics 0,20,100
set xtic font ",24" rotate by -45 scale 0 left
set key left top vertical Right maxrows 3 box width +2
set datafile missing '-'
bdw = "#A61A00"
knl = "#00B358"
knm = "#1924B1"
set ylabel "Abs. achieved Gflop/s out of Peak [in %]"
# max DP flops lines
bdwmax(x)=( -1 < x && x < 24 ) ? 1.0 : 1/0
plot \
"../data/flops.data" u (100.0*$4/$9):xtic(1) pt 20 ps 0.8 lc rgb knl title 'KNL_{abs}', \
"" u (100.0*$6/$10) pt 9 ps 0.8 lc rgb knm title 'KNM_{abs}', \
"" u (100.0*$2/$8) pt 3 ps 0.8 lc rgb bdw title 'BDW_{abs}'
......@@ -5,10 +5,10 @@ set grid
set auto x
set auto y
set xrange [0:27]
set yrange [0:450]
set yrange [0:500]
set xtic font ",24" rotate by -45 scale 0 left
set key left top vertical Right box width +2
set key opaque left top vertical Right box width +2
#reverse noenhanced autotitle columnhead box
set datafile missing '-'
......@@ -23,10 +23,10 @@ knlmax(x)=( -0.5 < x && x < 27.5 ) ? 439 : 1/0
knmmax(x)=( -0.5 < x && x < 27.5 ) ? 430 : 1/0
plot \
"../data/bytes-n-flops.data" u 2:xtic(1) pt 3 ps 0.8 lc rgb bdw title 'BDW', \
bdwmax(x) with lines lt 0 lw 2 lc rgb bdw notitle, \
"" u 6:xtic(1) pt 20 ps 0.8 lc rgb knl title 'KNL', \
"../data/bytes-n-flops.data" u 6:xtic(1) pt 20 ps 0.8 lc rgb knl title 'KNL', \
knlmax(x) with lines lt 0 lw 2 lc rgb knl notitle, \
"" u 10:xtic(1) pt 9 ps 0.8 lc rgb knm title 'KNM', \
knmmax(x) with lines lt 0 lw 2 lc rgb knm notitle
knmmax(x) with lines lt 0 lw 2 lc rgb knm notitle, \
"" u 2:xtic(1) pt 3 ps 0.8 lc rgb bdw title 'BDW'
set terminal svg size 1600,1200 dynamic enhanced fname 'Times' fsize 32 butt dashlength 1.0
set output "../figures/roofline-bdw.svg"
# gflops
knl_fpeak = 2662.0
knm_fpeak = 1728.0
bdw_fpeak = 691.0
# gb/s
knl_mpeak = 439.0
knm_mpeak = 430.0
bdw_mpeak = 122.0
xmin = 0.001
xmax = 100
ymin = 0.1
ymax = 2000
set xtics nomirror
set xrange [xmin:xmax]
set logscale x 10
set yrange [ymin:ymax]
set logscale y 10
# Functions
mem(x,y) = exp( log( y ) - log( x ))
min(a,b) = (a < b) ? a : b
max(a,b) = (a > b) ? a : b
knl_froof(x) = knl_fpeak
knl_mroof(x) = mem(knl_fpeak / knl_mpeak, knl_fpeak) * x
knl_rflne(x) = min(knl_froof(x), knl_mroof(x))
knm_froof(x) = knm_fpeak
knm_mroof(x) = mem(knm_fpeak / knm_mpeak, knm_fpeak) * x
knm_rflne(x) = min(knm_froof(x), knm_mroof(x))
bdw_froof(x) = bdw_fpeak
bdw_mroof(x) = mem(bdw_fpeak / bdw_mpeak, bdw_fpeak) * x
bdw_rflne(x) = min(bdw_froof(x), bdw_mroof(x))
set grid
#set key left top vertical Right box width +2
unset key
set xlabel "Arithmetic Intensity (flop/byte)"
set ylabel "Gflop/s"
bdw = "#A61A00"
knl = "#00B358"
knm = "#1924B1"
set label 1 "Theor. Peak Performance (FP64)" at xmax-10, 1.25*bdw_froof(xmax) right
set label 2 "Stream Triad Bandwidth (GB/s)" at 1.25*xmin, 1.6*bdw_mroof(xmin) left rotate by 42
plot bdw_rflne(x) lt 1 lc rgb "black" lw 4 notitle, \
"../data/bytes-n-flops.data" u ($3/$5)/($2):($3/$5) pt 28 ps 0.6 lc rgb bdw title 'BDW', \
"" u ($3/$5)/($2):($3/$5):($1) with labels offset 0,-1 font "Times,22" point pt 28 ps 0.6 lc rgb bdw notitle
#plot knl_rflne(x) lt 1 lc rgb knl lw 4 notitle, \
# knm_rflne(x) lt 1 lc rgb knm lw 4 notitle, \
# bdw_rflne(x) lt 1 lc rgb bdw lw 4 notitle, \
# "../data/bytes-n-flops.data" u ($7/$9)/($6):($7/$9) pt 20 ps 0.6 lc rgb knl title 'KNL', \
# "" u ($11/$13)/($10):($11/$13) pt 9 ps 0.6 lc rgb knm title 'KNM', \
# "" u ($11/$13)/($10):($11/$13):($1) with labels offset -2.5,-.3 font "Times,24" point pt 9 ps 0.6 lc rgb knm notitle, \
# "" u ($3/$5)/($2):($3/$5) pt 28 ps 0.6 lc rgb bdw title 'BDW', \
# "" u ($3/$5)/($2):($3/$5):($1) with labels offset 0,-1 font "Times,24" point pt 28 ps 0.6 lc rgb bdw notitle
......@@ -5,8 +5,8 @@ set grid
set auto x
set auto y
set xrange [-0.5:23.5]
set yrange [0.5:1.5]
set ytics 0.5,0.5,1.5
set yrange [0:3]
set ytics 0,1,3
set xtic font ",24" rotate by -45 scale 0 left
set key left top vertical Right box width +2
......@@ -19,9 +19,10 @@ knm = "#1924B1"
set ylabel "Speedup (w.r.t Time-to-Solution)"
knlmax(x)=( -1 < x && x < 24 ) ? 1.0 : 1/0
bdwmax(x)=( -1 < x && x < 24 ) ? 1.0 : 1/0
plot \
"../data/t2solv.data" u ($4/$6):xtic(1) pt 20 ps 0.8 lc rgb knm title 'KNM', \
knlmax(x) with lines lt 0 lw 2 lc rgb knl title 'KNL'
"../data/t2solv.data" u ($2/$4):xtic(1) pt 20 ps 0.8 lc rgb knl title 'KNL', \
"" u ($2/$6):xtic(1) pt 9 ps 0.8 lc rgb knm title 'KNM', \
bdwmax(x) with lines lt 0 lw 2 lc rgb bdw title 'BDW'
......@@ -316,7 +316,35 @@
\begin{document}
%%%% DONT TOUCH FOR DRAFT
\iftoggle{highlightChanges}{
\begin{titlepage}
\mbox{}\\{\Large \textbf{Cover Letter for Submission:}\\\\Double-precision FPUs in High-Performance Computing: an Embarrassment of Riches?}
\newline\newline\newline\newline\large
Summary of the implemented changes (also highlighted in blue on subsequent pages):
\begin{itemize}
\item Added discussion of why we report flop/s and why it is necessary in this paper
despite our later recommendation that the HPC community should not (only) report flop/s;
\item Split of Fig.~\ref{fig:flops} (rel./abs. flop/s comparison) into two subfigures for
easier readability and modification of text in Sec.~\ref{ssec:eval_flops} to reflect the
change;
\item Added Fig.~\ref{fig:t2s-rel} for ``Time-to-Solution'' and its explanation/discussion
in Sec.~\ref{ssec:eval_flops};
\item Added roofline analysis in Sec.~\ref{ssec:eval_roof} to determine the optimization
status of the FP-intensive proxy-apps, which we used for this study (incl 2 additional
references for this part)
\item Added details about the theoretical peak speedup with turbo boost shown in
Fig.~\ref{fig:freq} and explanation of why a pessimistic +100Mhz was chosen in this case
and why this resulted in ``superlinear speedup'' for some benchmarks
\item Added acknowledgement of funding resources and author's contributions;
\item Added note to Tab.~\ref{table:rest} to point out the multiuse of two columns by
similar metrics (VTune reports slightly different metrics for BDW vs. KNM/KNL for
arithmetic intensity and memory-boundedness; and readers can consult Ref. [41] for an
in-depth documentation on these metrics (as stated previously in the table's caption));
\item (+ multiple smaller grammar and text adjustments which will no be highlighted).
\end{itemize}
\end{titlepage}
}{}
%%%% DONT TOUCH FOR DRAFT => TODO take out if we buy 1 page
\bstctlcite{IEEEexample:BSTcontrol}
%%%% DONT TOUCH FOR DRAFT
......@@ -450,7 +478,6 @@
\input{70-conclusion}
% 1/2 page
\iftoggle{includeacknowl}{
\input{80-acknowledgment}
% 1/2 page
......@@ -630,16 +657,12 @@
%No appendix in first submission
\input{41-rest-table}
\begin{comment}
\iftoggle{includeappendix}{
\clearpage
\input{90-appendix}}
{
\clearpage
\input{90-appendix}
}{
\input{41-rest-table}
}
\end{comment}
% that's all folks
\end{document}