Commit f74b2ec4 authored by Frank Schoeneman's avatar Frank Schoeneman

Created branch.

parents
# SparkIsomap: Scalable Manifold Learning with Apache Spark
**Authors:**
Frank Schoeneman <fvschoen@buffalo.edu>,
Jaroslaw Zola <jaroslaw.zola@hush.com>
## About
SparkIsomap is a tool to efficiently execute Isomap for learning manifolds from high-dimensional data. Isomap remains an important data analytics technique, as the vast majority of big data, coming from, for example, high performance high fidelity numerical simulations, high resolution scientific instruments or Internet of Things streams and feeds, is a result of complex non-linear processes that can be characterized by complex manifolds. SparkIsomap can be used to process data sets with tens to hundreds of thousands of high dimensional points using relatively small Spark cluster. The method uses Apache Spark, implemented entirely in PySpark, and offloads compute intensive linear algebra routines to BLAS.
## User Guide
SparkIsomap is entirely self-contained in `SparkIsomap.py`. When executing SparkIsomap, provide the following command line parameters:
* `-f` Input data (in .tsv format).
* `-o` Output file name.
* `-e` Spark event log directory.
* `-C` Spark checkpoint directory.
* `-b` Submatrix block size.
* `-p` Number of partitions.
* `-n` Number of points.
* `-D` Input data dimensionality.
* `-k` Neighborhood size.
* `-d` Reduced dimensionality.
* `-l` Maximum iterations for power iteration.
* `-t` Convergence threshold for power iteration (provide EXPONENT, such that 10^-EXPONENT is convergence threshold).
Example invocation:
`Python SparkIsomap -k 10 -n 50000 -D 3 -d 2 -l 100 -t 9 -b 1250 -p 1431 -f swiss50k.tsv -o swiss50k_d2.tsv -C chkpt_swiss50 -e elogs/
`
If you have immediate questions regarding the method or software, please do not hesitate to contact Jaric Zola <jaroslaw.zola@hush.com>.
## References
To cite SparkIsomap, refer to this repository and our paper:
* F. Schoeneman, J. Zola, _Scalable Manifold Learning for Big Data with Apache Spark_, 2018. <http://arxiv.org/abs/1808.10776>.
This diff is collapsed.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment