Commit 76ea38bb authored by Jaroslaw Zola's avatar Jaroslaw Zola

Small code/documentation cleaning.

parent 5e91c102
......@@ -5,36 +5,34 @@ Frank Schoeneman <fvschoen@buffalo.edu>,
Jaroslaw Zola <jaroslaw.zola@hush.com>
## About
SparkIsomap is a tool to efficiently execute Isomap for learning manifolds from high-dimensional data. Isomap remains an important technique as the vast majority of big data, coming from, for example, high performance high fidelity numerical simulations, high resolution scientific instruments (microscopes, DNA sequencers, etc.) or Internet of Things streams and feeds, is a result of complex non-linear processes. In commonly used benchmark machine learning datasets, SparkIsomap has demonstrated excellent scalability. Spark Isomap uses Apache Spark, implemented entirely in PySpark, and offloads compute intensive linear algebra routines to BLAS.
SparkIsomap is a tool to efficiently execute Isomap for learning manifolds from high-dimensional data. Isomap remains an important data analytics technique, as the vast majority of big data, coming from, for example, high performance high fidelity numerical simulations, high resolution scientific instruments or Internet of Things streams and feeds, is a result of complex non-linear processes that can be characterized by complex manifolds. SparkIsomap can be used to process data sets with tens to hundreds of thousands of high dimensional points using relatively small Spark cluster. The method uses Apache Spark, implemented entirely in PySpark, and offloads compute intensive linear algebra routines to BLAS.
## User Guide
SparkIsomap is entirely self-contained in `SparkIsomap.py`. When executing SparkIsomap, provide the following command line parameters:
SparkIsomap is entirely self-contained in file SparkIsomap.py. When executing SparkIsomap, provide the following command line parameters:
* `-f` Input data (in .tsv format).
* `-o` Output file name.
* `-e` Spark event log directory.
* `-C` Spark checkpoint directory.
* `-b` Submatrix block size.
* `-p` Number of partitions.
* `-n` Number of points.
* `-D` Input data dimensionality.
* `-k` Neighborhood size.
* `-d` Reduced dimensionality.
* `-l` Maximum iterations for power iteration.
* `-t` Convergence threshold for power iteration (provide EXPONENT, such that 10^-EXPONENT is convergence threshold).
* -f Input data. (.tsv format)
* -o Output file name.
* -e Spark event log directory.
* -C Spark Checkpoint directory.
* -b Submatrix block size.
* -p Number of partitions.
* -n Number of points.
* -D Input data dimensionality.
* -k Neighborhood size.
* -d Reduced dimensionality.
* -l Maximum iterations for power iteration.
* -t Convergence threshold for power iteration. (Provide EXPONENT, such that 10^-EXPONENT is convergence threshold.)
Example invocation:
Example invocation:
```
Python SparkIsomap -k 10 -n 50000 -D 3 -d 2 -l 100 -t 9 -b 1250 -p 1431 -f swiss50k.tsv -o swiss50k_d2.tsv -C chkpt_swiss50 -e elogs/
```
`Python SparkIsomap -k 10 -n 50000 -D 3 -d 2 -l 100 -t 9 -b 1250 -p 1431 -f swiss50k.tsv -o swiss50k_d2.tsv -C chkpt_swiss50 -e elogs/
`
If you have immediate questions, please do not hesitate to contact Jaric Zola <jaroslaw.zola@hush.com>.
If you have immediate questions regarding the method or software, please do not hesitate to contact Jaric Zola <jaroslaw.zola@hush.com>.
## References
To cite SparkIsomap, refer to this repository and our paper:
* F. Schoeneman, J. Zola, _Scalable Manifold Learning for Big Data with Apache Spark_, In Proc. IEEE International Conference on Big Data (BigData), 2018. <https://arxiv.org/abs/xxxxxxx>.
\ No newline at end of file
* F. Schoeneman, J. Zola, _Scalable Manifold Learning for Big Data with Apache Spark_, 2018. <http://arxiv.org/abs/1808.10776>.
\ No newline at end of file
......@@ -83,17 +83,17 @@ if __name__ == "__main__":
t_init = time.time()
parser = argparse.ArgumentParser()
parser.add_argument("-f", help="Input data. (.tsv format)")
parser.add_argument("-o", help="Output file name. ")
parser.add_argument("-f", help="Input data (.tsv format).")
parser.add_argument("-o", help="Output file name.")
parser.add_argument("-e", help="Spark event log directory.")
parser.add_argument("-C", help="Spark Checkpoint Dir.")
parser.add_argument("-C", help="Spark checkpoint directory.")
parser.add_argument("-b", help="Submatrix block size.")
parser.add_argument("-p", help="Number of partitions.")
parser.add_argument("-n", help="Number of points.")
parser.add_argument("-D", help="Input data dimensionality.")
parser.add_argument("-k", help="Neighborhood size.")
parser.add_argument("-k", help="Neighborhood size.")
parser.add_argument("-d", help="Reduced dimensionality.")
parser.add_argument("-l", help="Maximum iterations for power iteration.")
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment