CartiClus: Cartification-based Subspace Cluster Finder
This project is an implementation of CartiClus algorithm which is introduced in the paper:
"Cartification: A Neighborhood Preserving Transformation for Mining High Dimensional Data" by Emin Aksehirli, Bart Goethals, Emmannuel Müller, and Jilles Vreeken in Data Mining, 2013. ICDM 2013. Thirteenth IEEE International Conference on, 2013 IEEE
CartiClus is packaged as a runnable .jar file (carticlus.jar). The .jar file includes source code as well. You can run the application on command line with the commands,
java -jar carticlus.jar data-file k minsup numOfdimensions [cartLog] [outputfile]
Following command can be used to cartify the file without the mining step. It will create the cartified files in the same directory with the source file.
java -cp carticlus.jar cart.CartifierDriver data-file k
CartiClus accepts parameters as command line arguments in a specified order. If the optional parameters are omitted, their default values will be used instead.
data-file: Path to the multi dimensional datafile that will be cartified. Please find the properties of the data file below.
k: Parameter for the k nearest neighbors.
minsup: Minimum support count for the mining. This is the actual count not a percentage.
numOfdimensions: The number of dimensions in the data-file.
cartLog: (optional) Direct the output of the mining step instead of /dev/null
outputfile: (optional) Direct output to this file instead of standart output
About data file:
- it includes space separated real values,
- each row represents an instance,
- each column represents an feature/dimension,
- does not include a(ny) header row(s),
- all instances have the same number of features,
- there are no missing values,
- real values should be in the USA locale (use . as decimal separator)
CartiClus outputs the found clusters to the standard output. Each line of output represents a subspace cluster. Output format:
Subspaces for cluster [Size of cluster] Objects of the cluster
1 1 0 0 0 1 0  0 1 2 3 4 5 6 7 8 9 means a cluster is
detected at 1st, 2nd and 6th subspaces and it has '10' objects, i.e.,
0 1 2 3 4 5 6 7 8 9
java -jar carticlus.jar data/10c20d.mime 125 300 20
Link for the artificial datasets with many irrelevant dimensions are given [here][datasets]. For the datasets from Opesnsubspace please refer to http://dme.rwth-aachen.de/en/OpenSubspace/evaluation
Code repository of the project is located at https://gitlab.com/adrem/carticlus