Commit 8e839ef9 authored by Jens Domke's avatar Jens Domke

minor text adjustments and take watermark out

parent a1429a0a
......@@ -8,7 +8,7 @@
\usepackage{hyperref}
\usepackage{breakurl}
\usepackage{comment} % TODO remove later
\usepackage{draftwatermark}
%\usepackage{draftwatermark}
%
\usepackage[ruled,vlined]{algorithm2e}
\DontPrintSemicolon
......@@ -44,20 +44,20 @@
\renewcommand{\algorithmcfname}{PseudoCode}
\makeatother
\makeatletter
\renewcommand\sc@wm@print[1]{% redefine positioning of mark (-1in to 330pt)
\if@sc@wm@stamp
\setbox\@tempboxa\vbox to \z@{%
\vskip 100pt \moveleft -0.5in \vbox{%
\hbox to \z@{%
#1\hss}}\vss}
\dp\@tempboxa\z@
\box\@tempboxa
\fi}
\makeatother
\SetWatermarkText{DRAFT}
\SetWatermarkScale{1}
\SetWatermarkColor[rgb]{.95,.95,.95}
%\makeatletter
%\renewcommand\sc@wm@print[1]{% redefine positioning of mark (-1in to 330pt)
% \if@sc@wm@stamp
% \setbox\@tempboxa\vbox to \z@{%
% \vskip 100pt \moveleft -0.5in \vbox{%
% \hbox to \z@{%
% #1\hss}}\vss}
% \dp\@tempboxa\z@
% \box\@tempboxa
% \fi}
%\makeatother
%\SetWatermarkText{DRAFT}
%\SetWatermarkScale{1}
%\SetWatermarkColor[rgb]{.95,.95,.95}
%% preprint rules (this should cover most venues for us)
% IEEE (2018-10-22)
......
......@@ -2,7 +2,7 @@
\begin{abstract}
Among the (uncontended) common wisdom in High-Performance Computing (HPC) is the applications’ need for large amount of double-precision support in hardware. Hardware manufacturers, the TOP500 list, and (rarely revisited) legacy software have without doubt followed and contributed to this~view.
In this paper, we challenge that wisdom, and we do so by exhaustively comparing a large number of HPC proxy application on two processors: Intel’s Knights Landing (KNL) and Knights Mill (KNM). Although similar, the KNM and KNL architecturally deviate at one important point: the silicon area devoted to double-precision arithmetic’s. This fortunate discrepancy allows us to empirically quantify the performance impact in reducing the amount of hardware double-precision arithmetic. %this part reads strange
In this paper, we challenge that wisdom, and we do so by exhaustively comparing a large number of HPC proxy application on two processors: Intel's Knights Landing (KNL) and Knights Mill (KNM). Although similar, the KNM and KNL architecturally deviate at one important point: the silicon area devoted to double-precision arithmetics. This fortunate discrepancy allows us to empirically quantify the performance impact in reducing the amount of hardware double-precision arithmetic. %this part reads strange
Our analysis shows that this common wisdom might not always be right. We find that the investigated HPC proxy applications do allow for a (significant) reduction in double-precision with little-to-no performance implications. With the advent of a failing of Moore's law, our results partially reinforce the view taken by modern industry (e.g. upcoming Fujitsu ARM64FX) to integrate hybrid-precision hardware units.
......
......@@ -22,7 +22,7 @@ there is today an ever-increasing need to oversee how we allocate the silicon to
processors. Amongst those decisions is how we distributed the hardware support for various levels of compute-precision.
Historically, most of the compute silicon has been allocated to double-precision (64-bit) compute.
Nowadays -- in processors such as the forthcoming AA64FX~\cite{yoshida_fujitsu_2018} and Nvidia
Nowadays -- in processors such as the forthcoming AA64FX~\cite{yoshida_fujitsu_2018} and NVIDIA
Volta~\cite{choquette_volta:_2018} -- the trend, mostly driven by market/AI demands, is to replace
some of the double-precision units with lower-precision units.
Lower-precision units occupy less area (up to $\approx$3x going from double- to single-precision
......
......@@ -37,7 +37,7 @@ To understand and explore the intersection of architectures with high-amount of
CPU Model & \textbf{7210F} & \textbf{7295} & 2x E5-2650v4 \\ \hline \rC
\#\{Cores\} (HT) & \textbf{64} (4x) & \textbf{72} (4x) & 24 (2x) \\ \hline
Base Frequency & \textbf{\unit[1.3]{GHz}} & \textbf{\unit[1.5]{GHz}} & \unit[2.2]{GHz} \\ \hline \rC
Max Turbo Freq. & \textbf{\unit[1.4]{GHz}} & \textbf{\unit[1.6]{GHz}} & \unit[2.9]{GHz} \\ \hline
Max Turbo Freq. & \textbf{\unit[1.5]{GHz}} & \textbf{\unit[1.6]{GHz}} & \unit[2.9]{GHz} \\ \hline
CPU Mode & Quadrant & Quadrant & \textit{N/A} \\ \hline \rC
TDP & \textbf{\unit[230]{W}} & \textbf{\unit[320]{W}} & \unit[210]{W} \\ \hline
DRAM Size & \unit[96]{GiB} & \unit[96]{GiB} & \unit[256]{GiB} \\ \hline
......@@ -56,7 +56,7 @@ To understand and explore the intersection of architectures with high-amount of
%https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/performance-xeon-e5-v3-advanced-vector-extensions-paper.pdf
%\rowcolor[HTML]{C0C0C0}
%\multicolumn{4}{|c|}{Hardware (other) } \\ \hline
%SSD &\multicolumn{3}{c|}{} \\ \hline
%SSD &\multicolumn{3}{c|}{} \\ \hline
% \rowcolor[HTML]{CCCCCC}
% \multicolumn{4}{|c|}{Software} \\ \hline
% OS &\multicolumn{3}{c|}{Linux version 3.10.0-693.11.6.el7.x86\_64} \\ \hline
......@@ -81,7 +81,7 @@ Both processors come with two types of external memory: MCDRAM (or, Hybrid Memor
There are several policies governing where data is homed. A common high-performance configuration~\cite{gawande_scaling_2017}, which is also the one we used in our study, is the quadrant mode. Quadrant mode means that the physical cores are divided into four logical parts, where each logical part is assigned two memory controllers; each logical group is treated as a unique Non-Uniform Memory-Access (NUMA) node, allowing the operating system to perform data-locality optimizations.
Table~\ref{table:HW} surveys and contrast the processors against each other, where the main differences are highlighted. The main architectural difference -- which is also the difference and its impact we seek to empirically quantify -- is the Floating-Point Unit (FPU). In KNL, this unit features two 512-bit wide vector units (AVX), together capable of executing 32 double-precision or 64 single-precision operations per cycle, totaling~\unit[2.6]{Tflop/s} of double- and~\unit[5.3]{Tflop/s} of single-precision performance, respectively, across all 64 processing cores. In KNM, however, the FPU is redesigned to replace one 512-bit vector unit with two Virtual Neural Network Instruction (VNNI) units. Those units, although specializing in hybrid-precision FMA, can execute single-precision vector instructions, but have no support for double-precision compute. Thus, in total, the KNM can execute up to~\unit[1.7]{Tflop/s} of double-precision or~\unit[13.8]{Tflop/s} of single-precision computations. In summary, the KNM has~2.59x more single-precision compute, while the KNL have~1.54x more double-precision compute.
While both the KNL and KNM are functionally and architectural similar, there are some note-worth differences. First, the operating frequency of the processors vary: the KNL operates at a frequency of~\unit[1.3]{GHz} (and up to~\unit[1.5]{GHz} in Turbo mode), while KNM operates at~\unit[1.5]{GHz} (\unit[1.6]{GHz} turbo). Hence, KNM executes~15\% more cycles per second over KNM. Furthermore, although the cores of KNM and KNL are similar (except the FPU), the number of cores are different: KNL has~64 cores while KNM has~72 cores. Both processors are manufactured in~\unit[14]{nm} technology. Finally, the amount of on-chip last-level cache between the two processors is different, where KNM has a~\unit[4]{MiB} advantage over KNL.
While both the KNL and KNM are functionally and architectural similar, there are some note-worth differences. First, the operating frequency of the processors vary: the KNL operates at a frequency of~\unit[1.3]{GHz} (and up to~\unit[1.5]{GHz} in Turbo mode), while KNM operates at~\unit[1.5]{GHz} (\unit[1.6]{GHz} turbo). Hence, KNM executes~15\% more cycles per second over KNL. Furthermore, although the cores of KNM and KNL are similar (except the FPU), the number of cores are different: KNL has~64 cores while KNM has~72 cores. Both processors are manufactured in~\unit[14]{nm} technology. Finally, the amount of on-chip last-level cache between the two processors is different, where KNM has a~\unit[4]{MiB} advantage over KNL.
Additionally, for verification reasons, we include a modern dual-socket Xeon-based compute node in our evaluation. Despite being vastly different from the Xeon Phi systems, our Xeon Broadwell-EP (BDW) general-purpose processor is used to cross-check metrics, such as: execution time and performance (Xeon Phi should perform better), frequency-scaling experiments (BDW has more frequency domains), and performance counters (BDW exposes more performance counters).
Aside from those differences mentioned above (and highlighted in Table~\ref{table:HW}), the setup between the Xeon Phi nodes (and BDW node) is \textit{identical}, including the same operating system, software stack, and solid state disk. %(one Crucial mSSD SATA with \unit[480]{GiB} per node).
......@@ -156,12 +156,12 @@ Laghos proxy-app.
Scalable I/O proxy designed to closely mimic realistic I/O workloads of
HPC applications~\cite{dickson_replicating_2016}. Our input causes MACSio to write a total of \unit[433.8]{MB} to disk.
\paragraph{MiniAMR (MAMR)} is a adaptive mesh refinement proxy application of the Mantevo
\paragraph{MiniAMR (MAMR)} is an adaptive mesh refinement proxy application of the Mantevo
project~\cite{heroux_improving_2009} which applies a stencil computation on a 3-dimensional space,
in our case a sphere moving diagonally through a cubic medium.
%The cube is evenly distributed onto the processes, and adaptive meshing is performed for workload balancing.
\paragraph{MiniFE (MiFE)} is a reference implementation of an implicit finite elements
\paragraph{MiniFE (MiFE)} is a reference implementation of an implicit finite elements
solver~\cite{heroux_improving_2009} for scientific methods resulting in unstructured 3-dimensional grids.
For our study, we use 128$\times$128$\times$128 input dimensions for the grid.
......@@ -257,7 +257,7 @@ input for a $32^3 \times 32$ lattice discretization.
MODYLAS & Physics and Chemistry & N-body & Fortran \\ \hline
NTChem & Chemistry & Dense matrix & Fortran \\ \hline \rC
QCD & Lattice QCD & Stencil & Fortran/C \\ \hline
\end{tabular}
\end{tabular}
\vspace{-0.4em}
\end{table}
......
\section{Methodology}\label{sec:methods}
%
%\struc{small introduction sentence to this section}
In this section, we present our rigor benchmarking approach into investigating the characteristics of each architecture, and extracting the necessary information for our study.
In this section, we present our rigorous benchmarking approach into investigating the characteristics of each architecture, and extracting the necessary information for our study.
%
% goal: 1.5 page
% no changes to any code (unless fixing bug)
......@@ -312,4 +312,4 @@ as~\unit[]{Gflop/s}, will be explained on-demand in Section~\ref{sec:eval}.
% LLC: PERFを使用。
% FLOPS: timeコマンドで測った実行時間と、Intel SDEを使用。timeコマンドでは5回測ったうちの最短時間を使用。
% PCM、SDEの測定では、プログラムのカーネル部分の実行のみ対して行うようにしている。
% https://gitlab.m.gsic.titech.ac.jp/precision_experiments/mtmr/blob/master/ave_core.sh
\ No newline at end of file
% https://gitlab.m.gsic.titech.ac.jp/precision_experiments/mtmr/blob/master/ave_core.sh
......@@ -8,7 +8,7 @@ be discussed in the next Section~\ref{sec:discuss}.
%Section~\ref{ssec:eval_mem}, by itself is not a good indication about the system's bottlenecks\cJD{This full sentence is ambiguous}.
Analyzing the instruction mix, \unit[]{flop/s}, or memory throughput,
see Section~\ref{ssec:eval_ops},~\ref{ssec:eval_flops}, and~\ref{ssec:eval_mem},
in a isolated fashion is not a good indication about the system's bottlenecks,
in an isolated fashion is not a good indication about the system's bottlenecks,
and hence, especially when reasoning about FPU requirements, we also have to understand
the applications' compute-boundedness, which we evaluate in Section~\ref{ssec:eval_freq}.
Table~\ref{tb:Mtools} summarizes the primary metrics and method/tool to collect these metrics. Table~\ref{table:rest} includes additional metrics.
......@@ -281,4 +281,4 @@ a direct effect of the high-bandwidth MCDRAM in \textit{cache} mode.
HPL & 24 & 1 & 271.794 & 181484.240 & 0 & 31919.479 & 189.37 & \,~~2.280 : 122.693 & 3.9 & 10 & 3 & 2.147 \\ \hline
\end{tabular}
\end{table*}
\end{comment}
\ No newline at end of file
\end{comment}
......@@ -25,9 +25,9 @@ argue that convening on reporting relevant metrics would shift the focus of the
\end{figure}
\end{comment}
%
This paper highlights the diminishing relevance of \unit[]{flop/s} when
This paper highlights the diminishing relevance of \unit[]{flop/s} when
considering the actual requirements of representative proxy-apps.
The relevance of \unit[]{flop/s} on a given supercomputer can be further
The relevance of \unit[]{flop/s} on a given supercomputer can be further
diminished when considering the analysis of node-hours spent yearly on
different scientific domains at supercomputing facilities.
Figure~\ref{fid:disc:breakdown} summarizes the breakdown of node-hours by
......@@ -38,39 +38,39 @@ ANL's ALCF and \mbox{R-CCS's} K-computer would be achieving $\approx$14\% and
$\approx$11\%, respectively, of the peak \unit[]{flop/s} when projecting
for the annual node-hours. %oversomplification?
It is worth mentioning that the relevance of \unit[]{flop/s} is even more
of an issue for supercomputers to dedicated to specific workloads: the relevance of
of an issue for supercomputers dedicated to specific workloads: the relevance of
\unit[]{flop/s} can vary widely. For instance, a supercomputer dedicated
mainly to weather forecasting, e.g., the~\unit[18]{Pflop/s} system recently
mainly to weather forecasting, e.g., the~\unit[18]{Pflop/s} system recently
installed at Japan's Meteorological Agency~\cite{japan_meteorological_agency_jma_jma_2018},
should give minimal relevance to \unit[]{flop/s} since the proxy representing
this workload on that supercomputer achieves $\approx$6\% of the peak \unit[]{flop/s},
since those workloads are typically memory-bound. On the other hand, a
supercomputer dedicated to AI/ML such as ABCI, the world 5\textsuperscript{th}
fastest supercomputer as of June 2018, would put high emphasize on \unit[]{flop/s}
fastest supercomputer as of June 2018, would put high emphasis on \unit[]{flop/s}
since deep learning workloads rely heavily on dense matrix multiplication operations.
\subsection{Memory-bound Applications}
As demonstrated in Figure~\ref{fig:flops}, the performance of memory-bound
applications is mostly not affected by the peak \unit[]{flop/s} available.
Accordingly, investment in data-centric architectures and programming models
As demonstrated in Figure~\ref{fig:flops}, the performance of memory-bound
applications is mostly not affected by the peak \unit[]{flop/s} available.
Accordingly, investment in data-centric architectures and programming models
should take priority over paying premium for \unit[]{flop/s}-centric systems.
In one motivating instance, during the investigation that NASA Ames Research
Center conducted to identify planned upgrade of the Pleiades supercomputer in
In one motivating instance, during the investigation that NASA Ames Research
Center conducted to identify planned upgrade of the Pleiades supercomputer in
2016~\cite{saini_performance_2016}, the study concluded that the performance gain from upgrading to
Intel Haswell processors was insignificant in comparison to using the older
Ivy Bridge-based processors (the newer processor offered double the peak
\unit[]{flop/s} at almost the same memory bandwidth). And hence the choice was only do a partial upgrade to Haswell processors.
\subsection{Compute-bound Applications}
Investing more in data-centric architectures to accommodate memory-bound
applications can have a negative impact on the remaining minority of
Investing more in data-centric architectures to accommodate memory-bound
applications can have a negative impact on the remaining minority of
applications: compute-bound applications. Considering the market trends that
are already pushing away from dedicating the majority of chip area to
FP64 units, it is likely that libraries with compute-bound code (e.g., BLAS)
would support mixed precision or emulation by lower precision FPUs. The
remaining applications that do not relay on external libraries might suffer a
performance hit.
FP64 units, it is likely that libraries with compute-bound code (e.g., BLAS)
would support mixed precision or emulation by lower precision FPUs. The
remaining applications that do not rely on external libraries might suffer a
performance hit.
%\subsection{More Diversity in FPUs\cJD{not less?}}
% - option: suggest to ditch fp32 and emulate fp32 ops in fp64 units
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment