While traditional data mining techniques have been used extensively for discovering patterns in databases, they are not always suitable for incorporating user-specified constraints. To overcome this issue, new research have began connecting Data Mining to Constraint Programming (CP). Such fertilization leads to a flexible way to tackle data mining tasks, such as itemset or association rule mining. In this paper, we introduce a new library for solving itemset mining problems with Choco Solver.
While traditional data mining techniques have been used extensively for discovering patterns in databases, they are not always suitable for incorporating user-specified constraints. To overcome this issue, new research has began connecting Data Mining to Constraint Programming (CP). Such fertilization leads to a flexible way to tackle data mining tasks, such as itemset or association rule mining. In this paper, we introduce a new library for solving itemset mining problems with Choco Solver.
## Constraint Programming (CP)
Constraint Programming (CP) is a powerful paradigm for solving combinatorial optimization problems [@rossi2006handbook]. It provides a declarative approach to problem-solving by defining a set of variables, domains, and constraints that capture the problem's requirements. CP solvers explore the space of possible solutions systematically, leveraging powerful search algorithms and constraint propagation techniques to efficiently find valid solutions. The flexibility of CP allows for modeling a wide range of problems, including scheduling [@baptiste2001constraint], resource allocation [@zhang2013constraint] and planning [@van1999cplan]. Its ability to handle complex constraints, discrete variables, and global properties makes it particularly suitable for tackling real-world problems. CP has demonstrated remarkable success in various domains, offering a high-level modeling language and a diverse set of solving techniques. Its integration with other optimization methods and technologies further enhances its applicability and effectiveness. Overall, Constraint Programming is a valuable tool for addressing challenging optimization problems, offering a powerful approach to problem modeling, solving, and decision support.
Constraint Programming (CP) is a powerful paradigm for solving combinatorial optimization problems [@rossi2006handbook]. It provides a declarative approach to problem-solving by defining a set of variables, domains, and constraints that capture the problem's requirements. CP solvers explore the space of possible solutions systematically, leveraging powerful search algorithms and constraint propagation techniques to efficiently find valid solutions. The flexibility of CP allows for modeling a wide range of problems, including scheduling [@baptiste2001constraint], resource allocation [@zhang2013constraint], and planning [@van1999cplan]. Its ability to handle complex constraints, discrete variables, and global properties makes it particularly suitable for tackling real-world problems. CP has demonstrated remarkable success in various domains, offering a high-level modeling language and a diverse set of solving techniques. Its integration with other optimization methods and technologies further enhances its applicability and effectiveness. Overall, Constraint Programming is a valuable tool for addressing challenging optimization problems, offering a powerful approach to problem modeling, solving, and decision support.
## Itemset Mining
@@ -37,21 +37,21 @@ Itemset mining is a fundamental data mining technique that aims to extract meani
In recent years, CP has been proven to be effective for modelling and solving itemset mining problems [@guns2011itemset;@lazaar2016global;@ugarte2017skypattern], and sequence mining problems [@aaai12081;@seq2pat2022;@Frontiers2022;@DPM2022]. The main advantage of using CP rather than specialised approaches for solving pattern mining problems is that the user can easily add custom constraints without having to modify the underlying system. Multiple user-specified constraints have been proposed in the literature to model and solve several pattern mining problems.
# Statement of need
Having a generic prototypical approach that can be parameterized to declaratively and efficiently discover patterns of interest using the available constraint solving tools is crucial to promote the use of CP for itemset mining. Multiple constraints designed for different mining tasks have been proposed in the recent years. However, there exists few alternatives that bring all of these constraints together in the same place. A user interested by using constraints in its own project would have to implement them from scratch, which takes time and may lead to bugs. To alleviate the burden of the user, we propose a new CP library that gathers multiple reference constraints for itemset mining in the same place.
Having a generic prototypical approach that can be parameterized to declaratively and efficiently discover patterns of interest using the available constraint solving tools is crucial to promote the use of CP for itemset mining. Multiple constraints designed for different mining tasks have been proposed in the recent years. However, few alternatives exist that bring all of these constraints together in the same place. A user interested by using constraints in their own project would have to implement them from scratch, which takes time and may lead to bugs. To alleviate the burden of the user, we propose a new CP library that gathers multiple reference constraints for itemset mining in the same place.
# Features and Functionality

We propose a new CP library called **Choco-Mining** that is based on Choco-solver [@prud2022choco]. The architecture of the library is illustrated in \autoref{fig:app}. As we can see, multiple constraints dedicated to different itemset mining tasks are available in Choco-Mining library. Each constraint takes as input a transactional database $D$ and a vector of Boolean variables $x$ used for representing itemsets, where $x[i]$ represents the presence/absence of the item $i$ in the searched itemset. These constraints are then used to define the problem at hand in terms of constraint programming. The following constraints are available in Choco-Mining:
We propose a new CP library called **Choco-Mining** that is based on Choco-solver [@prud2022choco]. The architecture of the library is illustrated in \autoref{fig:app}. As we can see, multiple constraints dedicated to different itemset mining tasks are available in the Choco-Mining library. Each constraint takes as input a transactional database $D$ and a vector of Boolean variables $x$ used for representing itemsets, where $x[i]$ represents the presence/absence of the item $i$ in the searched itemset. These constraints are then used to define the problem at hand in terms of constraint programming. The following constraints are available in Choco-Mining:
- $CoverSize_{D}(x,f)$ [@SchausAG17]: Given an integer variable $f$ that represents the frequency (noted $freq$) of an itemset $x$, the constraint ensures that $f = freq(x)$.
- $CoverClosure_{D}(x)$ [@SchausAG17]: The constraint ensures that $x$ is closed w.r.t. the frequency, i.e. $\nexists ~y \supset x: freq(x) = freq(y)$.
- $AdequateClosure_{D,M}(x)$ [@ijcai2022p0261]: Given a set of measures $M$, the constraint ensures that $x$ is closed w.r.t. $M$, i.e. $\nexists~ y \supset x$ such that for all $m \in M : m(x) = m(y)$.
- $FrequentSubs_{D,s}(x)$ [@Belaid2BL19]: Given a frequency threshold $s$, the constraint ensures that all the subsets of $x$ are frequent, i.e. $\forall y \subset x : freq(y) \ge s$.
- $InfrequentSupers_{D,s}(x)$ [@Belaid2BL19]: Given a frequency threshold $s$, the constraint ensures that all the supersets of $x$ are infrequent, i.e. $\forall y \supset x : freq(y) < s$.
- $Generator_{D}(x)$ [@BelaidBL19]: The constraint ensures that $x$ is a generator, i.e. $\nexists ~y \subset x : freq(y) = freq(x)$.
- $ClosedDiversity_{D,\mathcal{H},j,s}(x)$ [@HienLALLOZ20]: Given a history of itemsets $\mathcal{H}$, a diversity threshold $j$ and a minimum frequency threshold $s$, the constraint ensures that $x$ is a diverse itemset (i.e. $\nexists ~y \in \mathcal{H} : jaccard(x,y) \ge j$), $x$ is closed w.r.t. the frequency and $freq(x) \ge s$.
- $CoverClosure_{D}(x)$ [@SchausAG17]: The constraint ensures that $x$ is closed w.r.t. the frequency, i.e., $\nexists ~y \supset x: freq(x) = freq(y)$.
- $AdequateClosure_{D,M}(x)$ [@ijcai2022p0261]: Given a set of measures $M$, the constraint ensures that $x$ is closed w.r.t. $M$, i.e., $\nexists~ y \supset x$ such that for all $m \in M : m(x) = m(y)$.
- $FrequentSubs_{D,s}(x)$ [@Belaid2BL19]: Given a frequency threshold $s$, the constraint ensures that all the subsets of $x$ are frequent, i.e., $\forall y \subset x : freq(y) \ge s$.
- $InfrequentSupers_{D,s}(x)$ [@Belaid2BL19]: Given a frequency threshold $s$, the constraint ensures that all the supersets of $x$ are infrequent, i.e., $\forall y \supset x : freq(y) < s$.
- $Generator_{D}(x)$ [@BelaidBL19]: The constraint ensures that $x$ is a generator, i.e., $\nexists ~y \subset x : freq(y) = freq(x)$.
- $ClosedDiversity_{D,\mathcal{H},j,s}(x)$ [@HienLALLOZ20]: Given a history of itemsets $\mathcal{H}$, a diversity threshold $j$ and a minimum frequency threshold $s$, the constraint ensures that $x$ is a diverse itemset (i.e., $\nexists ~y \in \mathcal{H} : jaccard(x,y) \ge j$), $x$ is closed w.r.t. the frequency and $freq(x) \ge s$.
We can model different problems using these constraints. \autoref{fig:app} shows examples of mining tasks (in blue) with the constraints (in red) involved in their modelling: