Commit 3512e38b authored by Jens Domke's avatar Jens Domke

general docu and license for the PAstudy

parent 9a5a83d7
# Links
- Spreadsheet: https://docs.google.com/spreadsheets/d/1un0TIi31LXI9yURmwobPkCOatNXTvVPofXbEu_W-SvA/
# Settings
```sh
source /opt/intel/parallel_studio_xe_2018.1.038/bin/psxevars.sh intel64 >/dev/null
export I_MPI_CC=icc
export I_MPI_CXX=icpc
export I_MPI_F77=ifort
export I_MPI_F90=ifort
ulimit -s unlimited
ulimit -n 4096
```
- `OMP_NUM_THREADS` **must** be several.
- The number of MPI processes **should** be one.
# MEMORY THROUGHPUTS : LYON0
```sh
lyon0 % perf stat -e l2_requests.miss sleep 1 2>&1 >/dev/null | grep l2_requests | tr -d ',' | awk '{printf ("%d B\n", $1 * 64)}'
487488 B
```
- Note: In order to calculate bandwidth, you must divide bytes by elapsed time outputted by `perf`.
- Each tile has LLC(L2) of 1MB.
- List of counters : https://github.com/TomTheBear/perfmondb/blob/master/KNL/KnightsLanding_core_V6.tsv
### MPI
```sh
lyon0% mkdir p
lyon0% mpiexec -n 8 bash -c 'perf stat -e mem_load_uops_retired.l3_miss sleep 1 >/dev/null 2>p/"$MPI_LOCALRANKID".txt'
lyon0% { for i in p/*.txt; do cat $i | egrep 'sec|miss' | tr -d ',' | sed -e 's/\s\+/ /g' | cut -d ' ' -f 2 | tr '\n' ' '; echo; done } | awk '{ s += $1 / $2 } END { printf ("%f GB/sec\n", s * 64 / (1000 ** 3)) }'
0.000300 GB/sec
```
# MEMORY THROUGHPUTS : mill\[0-1\] (KNM)
```sh
mill0 % perf stat -e cache-misses sleep 1 2>&1 >/dev/null | grep cache-misses | tr -d ',' | awk '{printf ("%d B\n", $1 * 64)}'
366528 B
```
- The `cache-misses` above is the same as the raw counter `r412e` of KNL's `l2_requests.miss`.
# MEMORY THROUGHPUTS : KIEV0
```sh
kiev0 % perf stat -e mem_load_uops_retired.l3_miss sleep 1 2>&1 >/dev/null | grep mem_load | tr -d ',' | awk '{printf ("%d B\n", $1 * 64)}'
10880 B
```
- Note: In order to calculate bandwidth, you must divide bytes by elapsed time outputted by `perf`.
- LLC is L3 of 30MB.
- This is following https://github.com/RRZE-HPC/likwid/blob/master/groups/broadwell/L3CACHE.txt
According to ["Detecting Memory-Boundedness with Hardware Performance Counters" Daniel Molka et al.]( http://www.readex.eu/wp-content/uploads/2017/06/ICPE2017_authors_version.pdf ) :
```
Therefore, the proportion of main memory accesses is severely underestimated by the OFFCORE_RESPONSE events.
However, the sum of the L3 hit and L3 miss events is very close to the number of L1 misses in both cases,
so the number of cache line transfers from the uncore to each core can be measured quite accurately.
```
# FLOP
- Note: `-knl` option must be replaced by `-bdw` on KIEV0.
- sde means [Intel SDE](https://software.intel.com/en-us/articles/intel-software-development-emulator)
- _Considering `FMA FLOP` and `MASKED FLOP` is future work._
- https://software.intel.com/en-us/articles/calculating-flop-using-intel-software-development-emulator-intel-sde
( https://matsulab.slack.com/files/U755Q4FC0/F8XL55459/calculate.py )
```py
#!/bin/python
##################################
#
# First, get result.txt:
# mpirun -np 1 bash -c '../../sde-external-8.12.0-2017-10-23-lin/sde64 -knl -iform 1 -omix tmp/"$MPI_LOCALRANKID".txt -- ./exe'
# for i in tmp/*.txt; do cat $i | egrep '\*total|elements' | sort -t ' ' -k1,1 -k 2rn | uniq -w 22; done >> result.txt
#
# Second, get time.txt
# (time mpirun -n 1 ./exe 2>/dev/null;)2>time.txt
#
##################################
f=open("result.txt","r")
f2=open("time.txt","r")
tmp=f.readline()
time=f2.readline()
time=f2.readline()
time=time.split('\t')
timem=time[1].split('m')[0]
times=time[1].split('m')[1]
times=times.split('s')[0]
#print(timem,times)
timem=float(timem)
times=float(times)
time=times+timem*60
element={}
while tmp!= '':
key=tmp.split(' ')[0]
number=tmp.split(' ')[-1]
number=int(number)
if element.has_key(key):
element[key]+=number
else:
element.setdefault(key, 0)
element[key]+=number
tmp=f.readline()
print('read file success!')
result={'single':0,'double':0,'int':0,'total':0}
for i in element:
temp=i.split('_')
if temp[-1] == 'masked':
temp[-1] = temp[-2]
if temp[0]=='*total':
continue
if temp[1][0]=='i':
len = int(temp[1][1:]) * int(temp[-1])
if len <= 64:
result['int']+=element[i]
else :
result['int']+=element[i] * len/64
else:
if temp[2]=='single':
result['single']+=element[i]*int(temp[-1])
else:
result['double']+=element[i]*int(temp[-1])
result['total']=result['single']+result['double']+result['int']
print(result)
print('Percentage of FP64: %2.2f'%(result['double']*100.0/result['total']))
print('Percentage of FP32: %2.2f'%(result['single']*100.0/result['total']))
print('Percentage of INT: %2.2f'%(result['int']*100.0/result['total']))
print('total time is: %fs'%(time))
print('Performance (sp): %f GFLOPS' % (result['single']/time/1000/1000/1000))
print('Performance (dp): %f GFLOPS' % (result['double']/time/1000/1000/1000))
print('TOTAL GFLOPS (sp): %f GFLOP' % (result['single']/1.0/1000/1000/1000))
print('TOTAL GFLOPS (dp): %f GFLOP' % (result['double']/1.0/1000/1000/1000))
f.close()
f2.close()
```
git clone --recurse-submodules git@gitlab.m.gsic.titech.ac.jp:Jens/precision.git
./inst/amg.sh
sudo tuned-adm profile latency-performance; tuned-adm verify
./run/amg/test.sh
BSD 3-Clause License
Copyright (c) 2018,
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
* Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
* Neither the name of the copyright holder nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
# PAstudy
# Proxy-App Study
Proxy-App Study (content coming soon)
\ No newline at end of file
## Cloning the repo
```
git clone --recurse-submodules https://gitlab.com/domke/PAstudy.git
```
## Preparation of the environment
```
cd ./PAstudy/
vim ./conf/intel.cfg # configure correct path to Intel's Parallel Studio XE
vim ./conf/host.cfg # specify hostnames and CPU freqs (replace kiev,etc.)
./inst/_init.sh # follow instructions if printed (eps. freq. settings)
```
(Note: root access and setcap will be required; setcap usually does not work
on remote file systems, so the repo should be cloned onto a local disk)
## Examplified installation of a proxy-app
```
./inst/amg.sh
```
## Examplified modification of proxy-app configurations
* allows to specify binary names, changes in input settings, number of benchmark
trials, maximum running time of the benchmark, which performance tools is
executed, and MPI/OMP sweep configurations per host system, etc
```
vim ./conf/amg.sh
```
## Running one of the proxy-apps (e.g. AMG)
```
./run/amg/test.sh # run the MPI/OMP sweep test to find best config
vim ./conf/amg.sh # replace BESTCONF with config found in last step
./run/amg/best.sh # execute the presumably best MPI/OMP config many times
./run/amg/freq.sh # run freqency scaling tests
./run/amg/prof.sh # use performance tools to extract raw metrics
```
* all logs (results+raw data) will be placed in ./log/<hostname>/<test>/<proxy>
## Author's previously colleted results for reference
* untar the files with our raw data
```
cd ./PAstudy/
tar xjf paper/data/kiev3.tar.bz2 # Broadwell
tar xjf paper/data/lyon1.tar.bz2 # KNL
tar xjf paper/data/mill1.tar.bz2 # KNM
```
* afterwards, raw data will be in ./log/<hostname>
## Additional notes:
* this repo does not include 3rd party and proprietary libraries
* if a dependency is not met, the scripts usually point it out and ask the
user to install the missing lib/tool/etc.
* the license (see ./LICENSE) applies ONLY to all source code, logs, etc.
written/produced by the authors of this repo (for everything else: please,
refer to individual licenses of the submodules (proxy-apps) and other
libraries/dependencies after pulling those)
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment