README.md 2.13 KB
Newer Older
Frank Schoeneman's avatar
Frank Schoeneman committed
1
2
3
4
5
6
7
# SparkIsomap: Scalable Manifold Learning with Apache Spark

**Authors:**
Frank Schoeneman <fvschoen@buffalo.edu>,
Jaroslaw Zola <jaroslaw.zola@hush.com>

## About
8
SparkIsomap is a tool to efficiently execute Isomap for learning manifolds from high-dimensional data. Isomap remains an important data analytics technique, as the vast majority of big data, coming from, e.g., high-performance high-fidelity numerical simulations, high resolution scientific instruments or Internet of Things streams, is a result of complex non-linear processes that can be characterized by complex manifolds. SparkIsomap can be used to process datasets with tens to hundreds of thousands of high-dimensional points, using relatively small Spark cluster. The method uses Apache Spark, and is implemented entirely in PySpark with compute intensive linear algebra routines offloaded to BLAS. 
Frank Schoeneman's avatar
Frank Schoeneman committed
9
10
11

## User Guide

12
SparkIsomap is Python 2.7 application self-contained in `SparkIsomap.py`. When executing SparkIsomap, you must provide the following command line parameters:
Frank Schoeneman's avatar
Frank Schoeneman committed
13
14
15

* `-f` Input data (in .tsv format).
* `-o` Output file name. 
16
* `-e` Spark event log directory (must be created). 
Frank Schoeneman's avatar
Frank Schoeneman committed
17
18
19
20
21
22
23
24
* `-C` Spark checkpoint directory. 
* `-b` Submatrix block size. 
* `-p` Number of partitions. 
* `-n` Number of points. 
* `-D` Input data dimensionality. 
* `-k` Neighborhood size.
* `-d` Reduced dimensionality. 
* `-l` Maximum iterations for power iteration. 
25
* `-t` Convergence threshold for power iteration (provide $EXPONENT$, such that $10^{-EXPONENT}$ is convergence threshold).
Frank Schoeneman's avatar
Frank Schoeneman committed
26

27
Example invocation (please make sure that Spark is using Python 2.7):
Frank Schoeneman's avatar
Frank Schoeneman committed
28

29
`spark-submit SparkIsomap.py -k 10 -n 50000 -D 3 -d 2 -l 100 -t 9 -b 1250 -p 1431 -f swiss50k.tsv -o swiss50k_d2.tsv -C chkpt_swiss50 -e elogs/
Frank Schoeneman's avatar
Frank Schoeneman committed
30
31
32
33
34
35
36
37
`

If you have immediate questions regarding the method or software, please do not hesitate to contact Jaric Zola <jaroslaw.zola@hush.com>.

## References

To cite SparkIsomap, refer to this repository and our paper:

38
* F. Schoeneman, J. Zola, _Scalable Manifold Learning for Big Data with Apache Spark_, IEEE International Conference on Big Data (IEEE BigData), pp. 272-281, 2018. <http://arxiv.org/abs/1808.10776>.