Commit 7cae3cbd authored by Jaroslaw Zola's avatar Jaroslaw Zola

Release 3.001. Fully parallel, well scalable.

parent 1f89b016
-std=c++14
-ISABNAtk/include
-Iinclude
-Isrc
-I.
3.001 - 2019.01.16
* Release 3.001.
This is the last release focusing on shared memory systems.
We will be providing bug fixes if necessary, but we will not
be developing new features.
* New parallel and efficient structure search algorithm.
* OMP migrated to Intel TBB for even better parallel performance.
* Support for Windows cross-compiling removed.
2.103 - 2019.04.25
* A set of small fixes to SABNA tools.
......
CMAKE_MINIMUM_REQUIRED(VERSION 2.8)
CMAKE_MINIMUM_REQUIRED(VERSION 3.1)
PROJECT(SABNA CXX)
SET(CMAKE_CXX_STANDARD 14)
SET(CMAKE_CXX_STANDARD 17)
SET(CMAKE_CXX_STANDARD_REQUIRED ON)
SET(CMAKE_CXX_EXTENSIONS OFF)
SET(CMAKE_MODULE_PATH ${CMAKE_MODULE_PATH} "${CMAKE_SOURCE_DIR}/cmake/")
ADD_SUBDIRECTORY(SABNAtk)
IF (NOT CMAKE_BUILD_TYPE)
SET(CMAKE_BUILD_TYPE "Release")
ENDIF()
SET(CMAKE_CXX_FLAGS_DEBUG "${CMAKE_CXX_FLAGS_DEBUG}")
SET(CMAKE_CXX_FLAGS_RELEASE "-O3")
INCLUDE(CheckCXXCompilerFlag)
CHECK_CXX_SOURCE_COMPILES("int main() { __builtin_popcountl(0xFFFF); return 0; }" HAS_POPCOUNT)
SET(CMAKE_CXX_FLAGS_DEBUG "${CMAKE_CXX_FLAGS_DEBUG} -pthread -O0")
SET(CMAKE_CXX_FLAGS_RELEASE "-g -pthread -O3")
IF (HAS_POPCOUNT)
SET(CXX_NO_SSE "${CMAKE_CXX_FLAGS}")
SET(CMAKE_CXX_FLAGS "-msse4 ${CMAKE_CXX_FLAGS}")
CHECK_CXX_SOURCE_COMPILES("int main() { return 0; }" HAS_SSE)
IF (HAS_SSE)
MESSAGE(STATUS "SSE enabled via -msse4")
ELSE()
SET(CMAKE_CXX_FLAGS "${CXX_NO_SEE}")
MESSAGE(WARNING "No SSE, performance will be affected!")
ENDIF()
ELSE()
MESSAGE(FATAL_ERROR "__builtin_popcountl intrinsic required!")
ENDIF()
FIND_PACKAGE(TBB REQUIRED)
FIND_PACKAGE(OpenMP)
IF (OPENMP_FOUND OR OpenMP_FOUND)
SET(CMAKE_CXX_FLAGS "${OpenMP_CXX_FLAGS} ${CMAKE_CXX_FLAGS}")
SET(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} ${OpenMP_EXE_LINKER_FLAGS}")
IF (TBB_FOUND AND TBB_MALLOC_FOUND)
MESSAGE(STATUS "Found Intel TBB ${TBB_VERSION_MAJOR}.${TBB_VERSION_MINOR}")
SET(CMAKE_CXX_FLAGS "-I${TBB_INCLUDE_DIRS} ${CMAKE_CXX_FLAGS}")
ELSE()
MESSAGE(WARNING "No OpenMP, performance will be affected!")
MESSAGE(FATAL_ERROR "Intel TBB required!")
ENDIF()
INCLUDE_DIRECTORIES(SABNAtk/include/ include/ .)
......
The MIT License (MIT)
Copyright (c) 2016-2018 SCoRe Group http://www.score-group.org/
Copyright (c) 2016-2020 SCoRe Group http://www.score-group.org/
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
......
......@@ -15,6 +15,8 @@ Grant Iraci <[email protected]>
SABNA is an open source software suite of efficient algorithms for exact (i.e., globally optimal) Bayesian networks learning. The main idea behind the toolkit is to combine various data optimization techniques and advanced parallel algorithms to achieve scalable implementations capable of processing data instances with hundreds of variables.
All development efforts going into SABNA have been supported by the [National Science Foundation (NSF)](https://www.nsf.gov/) under the award [OAC-1845840](https://www.nsf.gov/awardsearch/showAward?AWD_ID=1845840).
## Quick Install
......@@ -25,14 +27,14 @@ Run `./build.sh` in the project's root directory. The script will create `./bin/
Follow these steps:
1. Make sure you have a recent C++ compiler with support for C++14, SIMD vectorization, and OpenMP.
2. Make sure you have `cmake` 2.8 or newer.
3. Enter SABNA root directory.
4. Optional: edit `config.h` to adjust compilation.
1. Make sure you have a recent C++ compiler that supports C++17, SIMD vectorization, and can work with the [Intel TBB Library](https://software.intel.com/en-us/tbb).
2. Make sure that you have Intel TBB installed and exposed to the compiler. Note that all major Linux distributions provide TBB out-of-box.
3. Make sure that you have `cmake` 3.1 or newer.
4. Enter SABNA root directory.
5. Invoke `build.sh` (see examples below).
6. Enjoy!
The package is implemented in C++14 and it can be compiled using any standard-conforming compiler. Support for SIMD SSE and OpenMP are highly recommended for improved performance (but are not strictly required). We tested extensively `g++` >= 7.3.0, `clang++` >= 4.0.1 and Intel `icpc` >= 18.0.2. If you are familiar with the `cmake` toolchain, you can use standard `cmake` options to customize the installation process, either by editing `build.sh` or setting environment variables.
The package is implemented in C++17 and [Intel TBB](https://software.intel.com/en-us/tbb). It can be compiled using any standard-conforming compiler provided that it supports Intel TBB. Support for SIMD SSE is highly recommended for improved performance (but is not strictly required). We tested extensively `g++` >= 7.3.0, `clang++` >= 4.0.1 and Intel `icpc` >= 18.0.2, and with Intel TBB 19.x and 20.x. If you are familiar with the `cmake` toolchain, you can use standard `cmake` options to customize the installation process, either by editing `build.sh` or setting environment variables.
Examples:
......@@ -46,7 +48,7 @@ Examples:
SABNA provides a set of standalone tools, each implementing different functionality. All tools come with an intuitive command line interface. Use `--help` to get a short description of the available CLI options for each tool.
* `sabna-exsl-mpsbuild` precomputes maximal parent sets (MPS) using selected scoring function (currently available: AIC, MDL or BDeu) for any categorical data provided in a csv file.
* `sabna-exsl-bfs-ope` finds optimal network structure for a given MPS using the standard BFS with our _Optimal Path Extension_ (OPE) technique. This is currently the most efficient structure search method.
* `sabna-exsl-ssearch` finds optimal network structure for a given MPS using our _Optimal Path Extension_ (OPE) technique combined with decomposition by strongly connected components. This is currently the most efficient structure search method.
* `sabna-order2net` converts input ordering into the corresponding DAG written in bif or sif format.
### Example
......@@ -59,14 +61,15 @@ The example below shows how to learn an exact BN under the MDL score using BFS w
#### End-to-end example
The example below shows how to learn exact BN given some categorical data (with variables and their states represented by strings). We use `asia.csv` dataset with 8 variables and 200 observations (available in `data/`).
The example below shows how to learn exact BN given some categorical data (with variables and their states represented by strings). We use `asia.csv` dataset with 8 variables and 200 observations (available in `data/`). Note that below we do not use the actual paths to the tools and data. You can see the entire working workflow in `./test.sh`.
1. Convert input data into SABNA compatible format: `./csv-prepare -H True asia.csv asia`
2. Run SABNA learning steps on the resulting data:
* `./sabna-exsl-mpsbuild --csv-file asia/asia.sabna.csv --mps-file asia.sabna.mps`
* `./sabna-exsl-bfs-ope -n 8 --mps-file asia.sabna.mps --ord-file asia.sabna.ord`
* `./sabna-order2net --csv-file asia/asia.sabna.csv --mps-file asia.sabna.mps --ord-file asia.sabna.ord --net-name asia.sabna --format net`
3. Convert final network back to annotated format: `./net-format.py --map-variables asia/asia.sabna.variables --map-states asia/asia.sabna.states asia.sabna.net asia.net`
* `./sabna-exsl-ssearch -n 8 --mps-file asia.sabna.mps --ord-file asia.sabna.ord`
* `./sabna-order2net --csv-file asia/asia.sabna.csv --mps-file asia.sabna.mps --ord-file asia.sabna.ord --net-name asia.sabna --format sif`
* `./sabna-pl-mle --csv-file asia/asia.sabna.csv --sif-file asia.sabna.sif --bif-file asia.sabna.bif`
3. Convert the final network back to the annotated format: `./bif-format.py --map-variables asia/asia.sabna.variables --map-states asia/asia.sabna.states asia.sabna.bif asia.bif`
## Priors
......
CMAKE_MINIMUM_REQUIRED(VERSION 2.8)
CMAKE_MINIMUM_REQUIRED(VERSION 3.1)
PROJECT(sabnatk CXX)
SET(CMAKE_CXX_STANDARD 14)
SET(CMAKE_CXX_STANDARD_REQUIRED ON)
SET(CMAKE_CXX_EXTENSIONS OFF)
......@@ -14,8 +15,9 @@ SET(CMAKE_CXX_FLAGS_RELEASE "-O3")
INCLUDE(CheckCXXCompilerFlag)
CHECK_CXX_SOURCE_COMPILES("int main() { __builtin_popcountl(0xFFFF); return 0; }" HAS_POPCOUNT)
CHECK_CXX_SOURCE_COMPILES("int main() { __builtin_clzll(1LL << 63); return 0; }" HAS_CLZ)
CHECK_CXX_SOURCE_COMPILES("int main() { return __builtin_popcountl(0xFFFF); }" HAS_POPCOUNT)
CHECK_CXX_SOURCE_COMPILES("int main() { return __builtin_clzll(1LL << 63); }" HAS_CLZ)
CHECK_CXX_SOURCE_COMPILES("int main() { return __builtin_ffsll(1L << 63); }" HAS_FFS)
IF (HAS_POPCOUNT)
ELSE()
......@@ -27,6 +29,11 @@ ELSE()
MESSAGE(FATAL_ERROR "__builtin_clzll intrinsic required!")
ENDIF()
IF (HAS_FFS)
ELSE()
MESSAGE(FATAL_ERROR "__builtin_ffsll intrinsic required!")
ENDIF()
SET(CXX_NO_SSE "${CMAKE_CXX_FLAGS}")
SET(CMAKE_CXX_FLAGS "-msse4 ${CMAKE_CXX_FLAGS}")
......@@ -39,22 +46,27 @@ ELSE()
MESSAGE(WARNING "No SSE, performance will be affected!")
ENDIF()
#FIND_PACKAGE(Boost 1.59 COMPONENTS python)
#IF (BOOST_STATIC)
# MESSAGE(STATUS "Using static Boost libraries")
# SET(Boost_USE_STATIC_LIBS ON)
#ENDIF()
#FIND_PACKAGE(Boost 1.67 COMPONENTS python37)
#IF (Boost_FOUND)
# INCLUDE_DIRECTORIES(${Boost_INCLUDE_DIRS})
#ELSE()
# MESSAGE(WARNING "No Boost.Python support!")
#ENDIF()
#FIND_PACKAGE(PythonLibs 2.7)
#FIND_PACKAGE(PythonLibs 3.7)
INCLUDE_DIRECTORIES(include)
IF (PYTHONLIBS_FOUND AND Boost_FOUND)
INCLUDE_DIRECTORIES(${PYTHON_INCLUDE_DIRS})
ADD_SUBDIRECTORY(src)
ADD_SUBDIRECTORY(examples)
ELSE()
#IF (PYTHONLIBS_FOUND AND Boost_FOUND)
# INCLUDE_DIRECTORIES(${PYTHON_INCLUDE_DIRS})
# ADD_SUBDIRECTORY(src)
# ADD_SUBDIRECTORY(examples)
#ELSE()
# MESSAGE(WARNING "Skipping Python support!")
# INSTALL(CODE "MESSAGE(\"Nothing to install...\")")
ENDIF()
#ENDIF()
This diff is collapsed.
......@@ -5,15 +5,100 @@ Subhadeep Karan <[email protected]>,
Matthew Eichhorn <[email protected]>,
Blake Hurlburt <[email protected]>,
Grant Iraci <[email protected]>,
Mohammad Umair <[email protected]>,
Jaroslaw Zola <[email protected]>
## About
SABNAtk is a small C++14 library, together with Python bindings, to efficiently execute counting queries over categorical data. Such queries are common to Machine Learning applications, for example, they show up in Probabilistic Graphical Models, regression analysis, etc.. In practical applications, SABNAtk [significantly outperforms](https://gitlab.com/SCoRe-Group/SABNAtk-Benchmarks) typical approaches based on e.g., hash tables or ADtrees. Currently, SABNAtk is powering [SABNA](https://gitlab.com/SCoRe-Group/SABNA-Release), our Bayesian networks learning engine. We are working on further improving performance, so stay tuned!
SABNAtk is a small C++14 library, together with Python 3.7 bindings, to efficiently execute counting queries over categorical data. Such queries are common to Machine Learning applications, for example, they show up in Probabilistic Graphical Models, regression analysis, etc. In practical applications, SABNAtk [significantly outperforms](https://gitlab.com/SCoRe-Group/SABNAtk-Benchmarks) typical approaches based on e.g., hash tables or ADtrees. Currently, SABNAtk is powering [SABNA](https://gitlab.com/SCoRe-Group/SABNA-Release), our Bayesian networks learning engine. We are working on further improving performance (including support for GPGPUs), so stay tuned!
This project has been supported by [NSF](https://www.nsf.gov/) under the award [OAC-1845840](https://www.nsf.gov/awardsearch/showAward?AWD_ID=1845840).
## User Guide
In preparation - we will provide extended documentation soon. Please, refer to `examples/` directory to see several use examples. If you have immediate questions, please do not hesitate to contact Jaric Zola <[email protected]>.
### Quick Install
* If you are impatient, run `./build.sh` in the project root directory. If the `Boost.Python` library is installed in a non-standard location, you may want to edit `build.sh` to specify `BOOST_ROOT`.
### Install
1. Make sure you have a recent C++ compiler with support for C++14 and SIMD vectorization.
2. Make sure that you have `python >= 3.7` and `Boost.Python >= 1.67` library installed, they are required if you want Python bindings.
3. Make sure that you have `cmake >= 3.1` available.
4. Run `./build.sh` in the project root directory (see Quick Install).
5. If the run is successful, you should see `sabnatk.so` object, which implements Python bindings.
6. If you do not need Python bindings you can use `include/` folder only, since from the C++ perspective `SABNAtk` is a headers-only library.
7. To integrate `SABNAtk` with your C++ project, you can make the entire `SABNAtk` source tree a subfolder in your source code tree, and then use `ADD_SUBDIRECTORY` directive in your `CMakeLists.txt`. Alternatively, you can copy `include/` directory and pass it to `-I` switch when compiling.
8. To integrate `SABNAtk` with your Python project treat `sabnatk.so` as a Python module (see documentation below).
### C++ API
If you have questions not answered by this documentation, please do not hesitate to contact Jaric Zola <[email protected]>.
To use `SABNAtk` we will follow this simple workflow (we assume that you are somewhat familiar with the original UAI 2018 paper on `SABNAtk`). First, we will create a counts enumeration object, either `BVCounter` or `RadCounter`, then we will implement a functor to aggregate our counts, finally we will combine the two and start applying queries.
1. The example below shows how to create a counts enumeration object. We recommend that you use `RadCounter` as most of the time it offers best performance:
#include <RadCounter.hpp>
// 3 variables (say X0, X1, X2), 8 observations
std::vector<uint8_t> Data{ 0, 1, 1, 0, 1, 1, 0, 0, \
0, 0, 2, 0, 1, 2, 0, 1, \
1, 1, 1, 0, 1, 1, 1, 0 };
int n = 3;
int m = 8;
const int N = 1;
auto rad = create_RadCounter<N>(n, m, std::begin(Data));
In this example, `Data` is a sequence storing our data (any sequence supporting forward iteration should do). The data must be stored row-wise (i.e., one variable per row, rows placed continuously in the memory). `n` is the number of variables (i.e., the number of rows in the data), and `m` is the number of columns (i.e., the number of instances/observations in the data). Parameter `N` tells `SABNAtk` how many words it should be using to manage the data. Specifically, we require that `n < 64 * N`. Note that once `rad` object is created, `Data` is no longer needed.
2. To process counts generated for a specific counting query we have to implement a functor that we will pass to our counts enumeration object. Consider a counting query with variables $`X_i`$ and $`Pa(X_i)`$, where $`Pa(X_i)`$ is a set of variables. `SABNAtk` extracts two types of counts: $`N_{ij}`$ and $`N_{ijk}`$. $`N_{ij}`$ is the count of data instances that support state $`j`$ of variables $`Pa(X_i)`$. For example, for `Data` and variables $`Pa(X_0)=\{X_1,X_2\}`$ we could assign state $`j=0`$ to $`(X_1=0,X_2=0)`$, $`j=1`$ to $`(X_1=0,X_2=1)`$ and so on, with the total of $`q_i=3 \times 2 = 6`$ states. Then, $`N_{00}`$ would be $`1`$, and $`N_{01}`$ would be $`3`$, with `Data` supporting $`5`$ different states of $`Pa(X_0)`$. $`N_{ijk}`$ on the other hand is the count of instances such that variables $`Pa(X_i)`$ are in state $`j`$, and variable $`X_i`$ is in state $`k`$. For example, if we assign $`k=0`$ to $`X_0=0`$ and $`k=1`$ to $`X_0=1`$, then $`N_{010}`$ will be $`2`$. `SABNAtk` does not care what you do with counts $`N_{ij}`$ and $`N_{ijk}`$, but it offers extremely fast mechanism to give you these counts. Consider a toy example of log-likelihood score defined as: $`\mathcal{L}(X_i|Pa(X_i))=\sum_{j}\sum_{k}N_{ijk}\log\left(\frac{N_{ijk}}{N_{ij}}\right)`$. We can implement $`\mathcal{L}`$ as a functor:
class L {
public:
typedef double score_type;
// Count enumeration object will call this method
// when starting counting
void init(int ri, int qi) {
score_ = 0.0; // we ignore ri and qi
}
// Count enumeration object will call this method
// to finalize counting, and qi will be the actual number
// of distinct states of Pa(Xi) found in the data
void finalize(int qi) { }
// Count enumeration object will call this method
// when count for new state j of Pa(Xi) is obtained
void operator()(int Nij) { }
// Count enumeration object will call this method
// when count for new state k of Xi is obtained
// Nij is count for the corresponding state j of Pa(Xi)
void operator()(int Nijk, int Nij) {
score_ = Nijk * log2(Nijk / Nij);
}
// Extra method to access the final result
// (not used by count enumeration oject)
score_type score() const { return score_; }
private:
score_type score_;
};
`SABNAtk` requires that user specified functor supports the following methods:
* `void init(int ri, int qi);` This method is called at the beginning of counting. Count enumeration object passes `ri`, the number of states of query variable $`X_i`$, and `qi`, the expected number of states of query variables $`Pa(X_i)`$.
* `void finalize(int qi);` This methods is called at the end of counting. Count enumeration object passes `qi`, which is the observed (in the data) number of states of variables $`Pa(X_i)`$.
* `void operator()(int Nij);` This operator is called each time the total count of instances supporting state $`j`$ of variables $`Pa(Xi)`$ is obtained.
* `void operator()(int Nijk, Nij);`
### Python API
The fastest way you learn Python API is to look at the code in `examples/mdl.py`, `examples/bdeu.py` and `examples/bnscore.py` files. This section is still under construction! If you have immediate questions, please do not hesitate to contact Jaric Zola <[email protected]>.
## References
......
......@@ -8,6 +8,7 @@ rm -rf build/*
cd build/
# Pick one of the cmake calls depending on your Boost.Python configuration
# Use -DBOOST_STATIC=True to use static version of Boost libraries
cmake ../ -DCMAKE_INSTALL_PREFIX=$DIR
#cmake ../ -DCMAKE_INSTALL_PREFIX=$DIR \
......
......@@ -16,7 +16,7 @@ if __name__ == "__main__":
for row in rd:
n = n + 1
m = len(row)
D = D + map(int, row)
D = D + list(map(int, row))
bdeu = BDeu64()
......@@ -25,23 +25,23 @@ if __name__ == "__main__":
xi = bitarray(37, endian = "little")
xi.setall(0)
xi[0] = 1;
xi[7] = 1;
xi[9] = 1;
xi[0] = 1
xi[7] = 1
xi[9] = 1
pa = bitarray(37, endian = "little")
pa.setall(0)
pa[3] = 1;
pa[5] = 1;
pa[8] = 1;
pa[29] = 1;
pa[30] = 1;
pa[31] = 1;
pa[3] = 1
pa[5] = 1
pa[8] = 1
pa[29] = 1
pa[30] = 1
pa[31] = 1
S = bdeu.score(xi.tobytes(), pa.tobytes())
print "test 0.1:", S
print("test 0.1:", S)
bdeu.alpha = 10.0
S = bdeu.score(xi.tobytes(), pa.tobytes())
print "test 10:", S
print("test 10:", S)
#!/usr/bin/env python
__author__ = "Jaroslaw Zola"
__copyright__ = "Copyright (c) 2018 SCoRe Group http://www.score-group.org/"
__license__ = "MIT"
__version__ = "1.0.0"
__maintainer__ = "Jaroslaw Zola"
__email__ = "[email protected]"
__status__ = "Development"
import argparse
import csv
import os
import sys
from bitarray import bitarray
from sabnatk.BVCounter import AIC64
from sabnatk.BVCounter import BDeu64
from sabnatk.BVCounter import MDL64
from sabnatk.BVCounter import AIC256
from sabnatk.BVCounter import BDeu256
from sabnatk.BVCounter import MDL256
def extant_file(fname):
if not os.path.isfile(fname): raise argparse.ArgumentTypeError("file {0} not found".format(fname))
return fname
if __name__ == "__main__":
if len(sys.argv) != 3:
print("usage: bnscore.py data network")
exit(-1)
parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument("data", metavar="", help = "input csv file", type = extant_file)
parser.add_argument("network", metavar="", help = "input sif file", type = extant_file)
parser.add_argument("--sep", metavar="", help = "data separator, guessed by default", nargs='?', const = None, type = str)
parser.add_argument("-s", "--score", metavar="", help = "scoring function [aic|bdeu|mdl]", type = str, default = "mdl")
if len(sys.argv)==1:
parser.print_help()
sys.exit(-1)
data = sys.argv[1]
args = parser.parse_args()
data = args.data
net = args.network
sep = args.sep
score = args.score
# read data
X = {}
n = 0
m = 0
T = {}
with open(data, "rt") as csvfile:
if not csv.Sniffer().has_header(csvfile.read(100000)):
print("error: csv header missing")
sys.exit(-1)
csvfile.seek(0)
if not sep:
dialect = csv.Sniffer().sniff(csvfile.read(100000))
csvfile.seek(0)
sep = dialect.delimiter
h = next(csvfile)
h = h.replace("\n", "").replace("\r", "").split(sep)
n = len(h)
if (n > 255):
print("too many variables")
sys.exit(-1)
for xi in range(n):
X[h[xi]] = xi
T[xi] = []
for l in csvfile:
l = l.replace("\n", "").replace("\r", "").split(sep)
for xi in range(n):
T[xi].append(l[xi])
m = m + 1
# transform data
D = []
# read data
with open(data, "rt") as cf:
rd = csv.reader(cf, delimiter = ' ')
for row in rd:
n = n + 1
m = len(row)
D = D + row
for xi in range(n):
t = 0
M = {}
for val in T[xi]:
if val not in M:
D.append(t)
M[val] = t
t = t + 1
else:
D.append(M[val])
D = list(map(int, D))
# init graph
G = []
......@@ -38,25 +105,29 @@ if __name__ == "__main__":
G = G + [u]
# read network
net = sys.argv[2]
with open(net, "rt") as nf:
for l in nf.readlines():
s, _, t = l.rstrip().split(" ")
if (s != t):
G[int(t)][int(s)] = 1;
G[X[t]][X[s]] = 1;
# init score
s = MDL64()
s = MDL256()
if (score == "aic"):
s = AIC256()
elif (score == "bdeu"):
s = BDeu256()
s.init(n, m, D)
# get score
score = 0.0
S = 0.0
for i in range(n):
xi = bitarray(n, endian = "little")
xi.setall(0)
xi[i] = 1
score = score + s.score(xi.tobytes(), G[i].tobytes())[0][0]
S = S + s.score(xi, G[i])[0][0]
print(score)
print(score + " score: " + str(S) + "\n")
from sabnatk.BVCounter import MDL64
from sabnatk.RadCounter import BDeu64
from sabnatk.BVCounter import MDL256
from sabnatk.RadCounter import BDeu256
if __name__ == "__main__":
mdl = MDL64()
bdeu = BDeu64()
mdl = MDL256()
bdeu = BDeu256()
# we use bitarray to represent sets
from bitarray import bitarray
from sabnatk.BVCounter import MDL64
......@@ -5,6 +6,7 @@ import csv
if __name__ == "__main__":
# read input data (in SABNA csv format)
dat = "data/alarm.csv"
n = 0
......@@ -16,25 +18,31 @@ if __name__ == "__main__":
for row in rd:
n = n + 1
m = len(row)
D = D + map(int, row)
D = D + list(map(int, row))
# create scoring function with support for <=64 variables
mdl = MDL64()
mdl.init(n, m, D)
# create set of Xi variables
# endian must be little
xi = bitarray(37, endian = "little")
xi.setall(0)
xi[0] = 1;
xi[7] = 1;
xi[9] = 1;
xi[0] = 1
xi[7] = 1
xi[9] = 1
# create set of Pa(Xi) that is shared by all Xi
pa = bitarray(37, endian = "little")
pa.setall(0)
pa[3] = 1;
pa[5] = 1;
pa[8] = 1;
pa[29] = 1;
pa[30] = 1;
pa[31] = 1;
pa[3] = 1
pa[5] = 1
pa[8] = 1
pa[29] = 1
pa[30] = 1
pa[31] = 1
# compute MDL(Xi|Pa(Xi)) for all Xi
# internally we use memoryview to eliminate extra memory copies
S = mdl.score(xi.tobytes(), pa.tobytes())
print "test:", S
print("test:", S)
......@@ -6,12 +6,17 @@ struct call {
// called by query engine before processing the stream
// engine passes ri, i.e. the number of states of Xi,
// and qi, the number of states parents of Xi MAY assume
void init(int ri, int qi) { count = 0; }
void init(int ri, int qi) {
std::cout << "ri=" << ri << ", qi=" << qi << std::endl;
count = 0;
} // init
// called by query engine after processing of the stream is done
// engine passes qi, the ACTUAL number of states
// that parents of Xi assumed
void finalize(int qi) { }
void finalize(int qi) {
std::cout << "qi=" << qi << std::endl;
} // finalize
void operator()(int Nij) {
std::cout << "call from CQE with Nij=" << Nij << std::endl;
......@@ -32,9 +37,9 @@ struct call {
int main(int argc, char* argv[]) {
// 3 variables (say X0, X1, X2), 8 observations
std::vector<char> D{0, 1, 1, 0, 1, 1, 0, 0, \
0, 0, 2, 0, 1, 2, 0, 1, \
1, 1, 1, 0, 1, 1, 1, 0 };
std::vector<uint8_t> D{ 0, 1, 1, 0, 1, 1, 0, 0, \
0, 0, 2, 0, 1, 2, 0, 1, \
1, 1, 1, 0, 1, 1, 1, 0 };
// use one word (64bit) because n < 64
BVCounter<1> bvc = create_BVCounter<1>(3, 8, std::begin(D));
......@@ -56,12 +61,17 @@ int main(int argc, char* argv[]) {
std::vector<char> spa{0, 1};
// callback for each xi
std::vector<call> C(1);
std::vector<call> C(set_size(xi));
// and here we go
bvc.apply(xi, pa, sxi, spa, C);
std::cout << C[0].score() << std::endl;
// and no state required
bvc.apply(xi, pa, C);
std::cout << C[0].score() << std::endl;
return 0;
} // main
......@@ -13,13 +13,13 @@ if __name__ == "__main__":
for row in rd:
n = n + 1
m = len(row)
D = D + map(int, row)
D = D + list(map(int, row))
q = Query256()
q.init(n, m, D)
L = q.run([0], [3, 5], [1], [0, 0])
print L
print(L)
L = q.run([0, 2], [3, 5], [1, 0], [0, 0])
print L
print(L)
......@@ -17,4 +17,4 @@ D = rbn(N, m)