Sydney Hauke · c608e26e
--- a/LDPC-decoder.md
+++ b/LDPC-decoder.md
+This project is being developed under the SDR Makerspace activity *DVB-S2 LDPC SIMD* by the **ReDS Institute** of Yverdon-les-Bains in Switzerland.
+
+# Introduction
+
+LDPC decoding is the most computational demanding task in a DVB-S2 receiver chain. This work is intended to provide an optimized decoder of DVB-S2 LDPC codewords. The decoder must have good error correction performances, close to what the state-of-the-art is already capable of. It also must have a good throughput and a good latency.
+
+This repository hosts two software components :
+- A testing environment to validate developed decoders
+- A library of optimized LDPC decoders
+
+The [testing environment](https://github.com/blegal/Fast_LDPC_decoder_for_x86) has been developed by *Bertrand Legal et al.* to validate their own optimized LDPC decoders. It is being reused for this activity.
+This environment has multiple purposes, mainly: benchmark the performance in terms of throughput and test the error-correction performance of LDPC decoders.
+
+The library is a collection of LDPC decoders optimized with vector operations for x86 processors (SSE 1-4). The library is able to decode DVB-S2 codes with code-rates *1/2*, *8/9* and *9/10* on long codewords only (64800 bit codewords). It also proposes two decoding strategies **flooded** and **layered** where both were investigated for their performances.
+
+# How to use
+## Library
+In the root directory of the project, do :
+
+```
+mkdir -p build
+cd build
+cmake ..
+make -j4
+sudo make install
+```
+
+The header files and the library itseld should be found installed in `/usr/local/include/` and `/usr/local/lib/` respectively.
+
+To develop with the library, include the following header :
+
+```c
+#include <reds-dvb-decoder.h>
+```
+
+First declare and initialize a decoder :
+
+```c
+REDS_DVB_DECODER_Decoder_t *decoder;
+REDS_DVB_DECODER_Scheduler_t *scheduler = REDS_DVB_DECODER_SCHEDULER_FLOODED;
+REDS_DVB_DECODER_Code_t *code = REDS_DVB_DECODER_CODE_DVB_S2_64800_32400;
+REDS_DVB_DECODER_Error_t error;
+
+error = REDS_DVB_DECODER_Init(decoder, scheduler, code);
+```
+
+There are multiple available schedulers and code you can specify to the decoder initializer. The available schedulers are :
+- REDS_DVB_DECODER_SCHEDULER_FLOODED
+- REDS_DVB_DECODER_SCHEDULER_LAYERED
+
+The available codes are :
+- REDS_DVB_DECODER_CODE_DVB_S2_64800_32400
+- REDS_DVB_DECODER_CODE_DVB_S2_64800_7200
+- REDS_DVB_DECODER_CODE_DVB_S2_64800_6480
+
+Once the decoder is initialized you can pass it to the decode function like this :
+
+```c
+#define CODEWORD_LEN    64800
+
+REDS_DVB_DECODER_Decoder *decoder;
+char soft_bits[CODEWORD_LEN];
+char hard_bits[CODEWORD_LEN];
+int nb_iter = 10;
+
+/* 
+ * Fetch a codeword of soft bits encoded in signed 8 bit values 
+ */
+
+REDS_DVB_DECODER_Decode(decoder, soft_bits, hard_bits, nb_iter);
+
+/*
+ * Hard bits encoded either to 0 or 1  are returned in each byte of hard_bits
+ */
+```
+
+Note that the size of the **soft_bits and hard_bits arrays must have the correct length**, depending on the code used. For example, if you use a 64800x32400 code, the arrays must have a length of 64800 * sizeof(char) bytes. Failing to do so will lead to undefined behavior.
+
+Once you are done with the decoder, free it with :
+
+```c
+REDS_DVB_DECODER_Terminate(decoder);
+```
+
+## Testing environment
+
+**The library is integrated in the testing environment**. To make use of the testing environment in order to see the library's decoders performance, do the following to compile it :
+
+```
+mkdir -p build
+cd build
+cmake ..
+make -j4
+```
+
+Change directory to `Fast_LDPC_decoder_for_x86` :
+
+```
+cd Fast_LDPC_decoder_fox_x86
+```
+
+After compiling the testing environment, run the executable *main.icc* like so :
+
+`./main.icc -fixed -<implementation> -MS -iter <num iter> -thread <num threads> -encoder -min <min SNR> -max <max_SNR>`
+
+There are multiple implementations to choose from :
+- layered (baseline)
+- layered-sse (optimized)
+- flooded (baseline)
+- flooded-sse (optimized)
+
+For example, If you wish to run the *layered-sse* implementation, on 4 threads, 10 iterations and a SNR range of 1 to 5 dB, execute the command :
+
+`./main.icc -fixed -layered-sse -MS -iter 10 -thread 4 -encoder -min 1.0 -max 5.0`
+
+Optionally, you can specify a time constraint so that the environment runs for the specified amount of time. It also activates the throughput measurements. To do so, append the argument `-timer <time in seconds>`.
+
+# Performance
+## Error correction
+
+As seen below, the error-correction performance is highly dependent both on :
+- code-rate used
+- number of decoding iterations
+
+The data used for the plots has been extracted from the testing environment, by varying the number of iterations,  coderate and decoding strategy. Note that **OMS 0** is a reference decoder already integrated in the testing environment. This decoder does layered scheduling, with Offset-Min-Sum belief-propagation with offset 0.
+
+### Coderate 1/2
+
+Note how, at the same number of iterations, the layered and flooded decoders have different error-correction performance. When benchmarking their throughput, it is important to adjust their iteration count so that we compare throughput with the exact same error-correction beheavior.
+
+![BER_vs_SNR_coderate_1_2_5_iterations.svg](uploads/3fb708586789d97424e2b7b85f43a8bf/BER_vs_SNR_coderate_1_2_5_iterations.svg)
+
+![BER_vs_SNR_coderate_1_2_10_iterations.svg](uploads/8023982487eb7ea8e4bdc64f8c0614b8/BER_vs_SNR_coderate_1_2_10_iterations.svg)
+
+![BER_vs_SNR_coderate_1_2_20_iterations.svg](uploads/555c5f0d523e52f6aa64186f86cdea96/BER_vs_SNR_coderate_1_2_20_iterations.svg)
+
+### Coderate 8/9
+
+![BER_vs_SNR_coderate_8_9_5_iterations.svg](uploads/ff57d21b7c41d7fb73967a2b6155556f/BER_vs_SNR_coderate_8_9_5_iterations.svg)
+
+![BER_vs_SNR_coderate_8_9_10_iterations.svg](uploads/44b253aa7dbc417388e3c4f1cdbeaba5/BER_vs_SNR_coderate_8_9_10_iterations.svg)
+
+![BER_vs_SNR_coderate_8_9_20_iterations.svg](uploads/7d686b46e813fe9c58a547050cabe221/BER_vs_SNR_coderate_8_9_20_iterations.svg)
+
+### Coderate 9/10
+
+![BER_vs_SNR_coderate_9_10_5_iterations.svg](uploads/13124b6f5274736ed812199e70f58617/BER_vs_SNR_coderate_9_10_5_iterations.svg)
+
+![BER_vs_SNR_coderate_9_10_10_iterations.svg](uploads/78b49e075f6bf7886a98694de49b3ea1/BER_vs_SNR_coderate_9_10_10_iterations.svg)
+
+![BER_vs_SNR_coderate_9_10_20_iterations.svg](uploads/bb0b9fd095360b486b9f42180a600168/BER_vs_SNR_coderate_9_10_20_iterations.svg)
+
+## Throughput and latency
+TODO: comparison between optimized and baseline code
+
+The reference machine has a quad-core i7-3770 Intel CPU. Its nominal frequency is 3.40 GHz and its maximum frequency is 3.9 GHz.
+
+### Baseline performance (4 threads enabled) :
+
+| Algorithm | Latency per codeword (ms) | Throughput (Mb/s) |
+| :---- | ------ | --- |
+| Flooded 20 iterations coderate 1/2 | 45.92 |  5.10 |
+| Layered 10 iterations coderate 1/2 |  22.33 | 11.06 |
+| Flooded 20 iterations coderate 8/9 | 42.71 | 6.03 |
+| Layered 10 iterations coderate 8/9 | 72.19 | 3.69 |
+| Flooded 20 iterations coderate 9/10 | 34.03 | 4.74 |
+| Layered 10 iterations coderate 9/10 |  79.28 | 3.32 |
+
+### Optimized performance (4 threads enabled) :
+
+| Algorithm | Latency per codeword (ms) | Throughput (Mb/s) |
+| :---- | ------ | --- |
+| Flooded 20 iterations coderate 1/2 | 12.84 | 8.29  |
+| Layered 10 iterations coderate 1/2 |  8.28 | 30.16 |
+| Flooded 20 iterations coderate 8/9 | 8.59 | 13.27 |
+| Layered 10 iterations coderate 8/9 | 5.13 | 53.08 |
+| Flooded 20 iterations coderate 9/10 | 9.51 | 11.61 |
+| Layered 10 iterations coderate 9/10 | 5.13 | 53.08 |
+