Commit 2e2920f2 authored by Alan's avatar Alan

Ready for first release

parent 277477b4
# Change Log
All notable changes to this project will be documented in this file. This change log follows the conventions of [keepachangelog.com](http://keepachangelog.com/).
## [Unreleased]
### Changed
- Add a new arity to `make-widget-async` to provide a different widget shape.
## [0.1.1] - 2018-08-30
### Changed
- Documentation on how to make the widgets.
### Removed
- `make-widget-sync` - we're all async, all the time.
### Fixed
- Fixed widget maker to keep working when daylight savings switches over.
## 0.1.0 - 2018-08-30
### Added
- Files from the new template.
- Widget maker public API - `make-widget-sync`.
[Unreleased]: https://github.com/your-name/clj-boost/compare/0.1.1...HEAD
[0.1.1]: https://github.com/your-name/clj-boost/compare/0.1.0...0.1.1
## 0.0.1 - 2018-09-23
### First release
- Released on Clojars
This diff is collapsed.
......@@ -4,14 +4,162 @@ A Clojure wrapper for [XGBoost4J](https://github.com/dmlc/xgboost/tree/master/jv
## Rationale
Clojure is a great language for doing many things, but there's a field where it could shine and it doesn't: **data science** & **machine learning**. The main reason is the lack of domain libraries that would help practitioners to use off-the-shelf algorithms and solutions to odo their work.
Clojure is a great language for doing many things, but there's a field where it could shine and it doesn't: **data science** & **machine learning**. The main reason is the lack of domain libraries that would help practitioners to use off-the-shelf algorithms and solutions to do their work.
Python didn't become the leader in the field because it's inherently better or more performant, but because of [scikit-learn](http://scikit-learn.org/stable/), [pandas](https://pandas.pydata.org) and so on. While as Clojurists we don't really need [pandas](https://pandas.pydata.org) (dataframes) or similar stuff (everything is just a **map**, or if you care more about memory and performance a **record**) we don't have something like [scikit-learn](http://scikit-learn.org/stable/) that makes really use to train many kind of machine learning models and somewhat easier to deploy them.
Python didn't become the leader in the field because it's inherently better or more performant, but because of [scikit-learn](http://scikit-learn.org/stable/), [pandas](https://pandas.pydata.org) and so on. While as Clojurists we don't really need [pandas](https://pandas.pydata.org) (dataframes) or similar stuff (everything is just a **map**, or if you care more about memory and performance a **record**) we don't have something like [scikit-learn](http://scikit-learn.org/stable/) that makes really easy to train many kind of machine learning models and somewhat easier to deploy them.
**clj-boost** clearly isn't a shot at [scikit-learn](http://scikit-learn.org/stable/) - something like that would require years of development - but it's a way to give people a better way to test and deploy their models. Clojure is robust, reliable and fast enough for most of the possible uses out there.
## Disclaimer
This project is at a very early stage and it's not ready for production stuff. By battle-testing and improving it all together we will be able to get the best out of it and make data science with Clojure more reliable and funnier.
## Installation
-
## Usage
Start by requiring `clj_boost.core` in your namespace
```clojure
(ns tutorial
(:require [clj-boost.core :refer :all]))
```
**XGBoost** forces us to use its data structures in exchange for speed and performance. So the first thing to do is to transform your data to a **DMatrix**. You can pass to `dmatrix` various data structures:
### Map
It is possible to pass a map with `:x` and optionally `:y` keys, their values must be either a sequence of sequences or a vector of vectors for `:x` and a flat vector or sequence for `:y`. From now on everytime I use `x` and `y` I mean: `x` -> training data, `y` -> the objective to learn (required for training the model, optional for prediction)
```clojure
(dmatrix {:x [[0 1 0]
[1 1 0]]
:y [1 0]})
(dmatrix {:x [[0 1 0]
[1 1 0]]})
```
### Vector
The input can also be a vector of vectors/sequence of sequences for `x` and optionally a flat vector/sequence for `y`.
```clojure
(dmatrix [[[0 1 0]
[1 1 0]]
[1 0]])
(dmatrix [[0 1 0]
[1 1 0]])
```
### String
When given a string `dmatrix` tries to load a stored `dmatrix` on disk from the given path.
```clojure
(dmatrix "data/train-set.dmatrix")
```
There's not much we can do with a **DMatrix**, for instance once it is created it is impossible to go back to a regular data structure. At the moment the only possible operation is to get the number of rows from it:
```clojure
(nrow (dmatrix data))
;; 50
```
Now fitting a model is just a matter of calling `fit` on the **DMatrix** and as second argument a *config* map with parameters for the model. Parameters are the same for every **XGBoost** declination, so the advice is to use [this page](https://xgboost.readthedocs.io/en/latest/parameter.html) as a reference.
```clojure
(fit (dmatrix data)
{:params {:eta 0.1
:objective "binary:logistic"}
:rounds 2
:watches {:train (dmatrix data)
:valid (dmatrix valid)}
:early-stopping 10})
```
`fit` returns an **XGBoost** model instance, or a **Booster** for friends, that can be stored, used for prediction or as a baseline for further training. For the latter option just pass `:booster` to the parameters map with an already trained **Booster** instance.
```clojure
(fit (dmatrix data)
{:params {:eta 0.1
:objective "binary:logistic"}
:rounds 2
:watches {:train (dmatrix data)
:valid (dmatrix valid)}
:early-stopping 10
:booster my-booster})
```
`cross-validation` is basically the same, only you don't get a **Booster** in return, but the cross-validation results:
```clojure
(cross-validation (dmatrix data)
{:params {:eta 0.1
:objective "binary:logistic"}
:rounds 2
:nfold 3})
```
To get predictions there's the `predict` function that takes a model (a **Booster** instance) and data to predict.
```clojure
(ns boost.core
(-> (fit (dmatrix data)
{:params {:eta 0.1
:objective "binary:logistic"}
:rounds 2
:watches {:train (dmatrix data)
:valid (dmatrix valid)}
:early-stopping 10})
(predict (dmatrix test-data))
```
Let's say that you're working either with large data or you're building an automated pipeline. Of course you would want to `persist` your models and your data for later use or as intermediate results. Finally, you will be able to `predict` new data by using `load-model` and getting ready for the data to come in:
```clojure
(persist (dmatrix data) "path/to/my-data")
(persist (dmatrix new-data) "path/to/my-new-data")
(-> (dmatrix "path/to/my-data")
(fit
{:params {:eta 0.1
:objective "binary:logistic"}
:rounds 2
:watches {:train (dmatrix data)
:valid (dmatrix valid)}
:early-stopping 10})
(persist "path/to/my-model"))
(-> (load-model "path/to/my-model")
(predict (dmatrix "path/to/my-new-data")))
```
Since this is a common pattern you might want to take a look at the `pipe` function: it takes *train-dmatrix*, *test-dmatrix*, *config* and optionally a *path*. `pipe` will train a model by using *config* as parameters, make predictions on given test data and if a *path* is given it will store the model at *path*.
```clojure
(pipe (dmatrix data)
(dmatrix new-data)
{:params {:eta 0.1
:objective "binary:logistic"}
:rounds 2
:watches {:train (dmatrix data)
:valid (dmatrix valid)}
:early-stopping 10}
"path/to/my-model")
```
## To do
- [ ] Make some tutorials and posts about `clj-boost` usage
- [ ] Add a method to generate *config* programmatically (atom?)
- [ ] Add a way to perform grid search over parameters
- [ ] Some facilities (like accuracy, confusion matrix, etc)???
## License
© Alan Marazzi, 2018. Licensed under an [Apache-2](https://gitlab.com/alanmarazzi/clj-boost/blob/master/LICENSE) license.
(ns tutorial
(:require [clj-boost.core :as boost :reload true]
(:require [clj-boost.core :as boost]
[clojure.java.io :as io]
[clojure.data.csv :as csv]))
......
# Introduction to clj-boost
TODO: write [great documentation](http://jacobian.org/writing/what-to-write/)
(defproject clj-boost "0.0.1-SNAPSHOT"
:description "FIXME: write description"
:url "http://example.com/FIXME"
:license {:name "Eclipse Public License"
:url "http://www.eclipse.org/legal/epl-v10.html"}
(defproject clj-boost "0.0.1"
:description "A Clojure wrapper for XGBoost"
:url "https://gitlab.com/alanmarazzi/clj-boost"
:license {:name "Apache 2.0"
:url "https://www.apache.org/licenses/LICENSE-2.0"}
:dependencies [[org.clojure/clojure "1.9.0"]
[org.clojure/data.csv "0.1.4"]
[ml.dmlc/xgboost-jvm "0.72" :extension "pom"]
[ml.dmlc/xgboost4j "0.72" :extension "pom"]
[ml.dmlc/xgboost4j-example "0.72" :extension "pom"]])
[ml.dmlc/xgboost4j-example "0.72" :extension "pom"]]
:plugins [[lein-cloverage "1.0.13"]])
(ns clj-boost.core
"The main clj-boost namespace.
You shouldn't need anything else than this to productively use clj-boost.
By requiring this namespace you get all main functions for serializing,
training and predicting."
By requiring this namespace you get all functions for serializing,
training and predicting XGBoost models."
(:import (ml.dmlc.xgboost4j.java XGBoost
Booster
DMatrix)
......@@ -145,8 +143,7 @@
"Train an XGBoost model on the given data.
The training set must be a `DMatrix` instance (see
`clj-boost.dmatrix/dmatrix`) while `config` is a regular
map.
`dmatrix`) while `config` is a regular map.
It returns a trained model (a `Booster` instance) that can
be used for prediction or as a base margin for further training.
......@@ -189,8 +186,7 @@
"Perform cross-validation on the training set.
The training set must be a `DMatrix` instance (see
`clj-boost.dmatrix/dmatrix`) while `config` is a regular
map.
`dmatrix`) while `config` is a regular map.
It returns a sequence of maps containing train and test
error for every round as defined in the `config` map.
......
(ns clj-boost.dmatrix
"All the `DMatrix` serialization helpers."
(:import (ml.dmlc.xgboost4j.java
DMatrix)
(clojure.lang PersistentArrayMap
PersistentVector
LazySeq)))
(defn- coll->array
"Helper function to convert a nested vector to a flat array."
[coll]
(float-array (flatten coll)))
(defmulti dmatrix
"Serializes given data to `DMatrix`.
It is required by the XGBoost API to serialize data structures to use
the library (https://xgboost.readthedocs.io/en/latest/jvm/java_intro.html).
`dmatrix` tries to make the process as painless as possible internally
dealing with types and variable arguments.
It is possible to pass a map with `:x` and optionally `:y` keys,
their values must be either a sequence of sequences or a vector of vectors
for `:x` and a flat vector or sequence for `:y`.
The input can also be a vector of vectors/sequence of sequences for `x`
and optionally a flat vector/sequence for `y`.
If given a string as input `dmatrix` tries to load a `DMatrix` from the
given path in string form.
`y` is required only for training, not for prediction.
Examples:
(def map-with-y {:x [[1 0] [0 1]] :y [1 0]})
(dmatrix m)
(def map-without-y {:x [[1 0] [0 1]]})
(dmatrix m)
(def vec-x [[1 0] [0 1]])
(def vec-y [1 0])
(dmatrix vec-x vec-y)
(dmatrix vec-x)
(def seq-x '((1 0) (0 1)))
(def seq-y '(1 0))
(dmatrix seq-x seq-y)
(dmatrix seq-x)
(dmatrix \"path/to/stored/dmatrix\")"
(fn [d & other] (type d)))
(defmethod dmatrix PersistentArrayMap
[{:keys [x y]}]
(if y
(doto
(DMatrix.
(coll->array x)
(count y)
(count (first x)))
(.setLabel (coll->array y)))
(DMatrix.
(coll->array x)
(count x)
(count (first x)))))
(defmethod dmatrix PersistentVector
([x]
(DMatrix.
(coll->array x)
(count x)
(count (first x))))
([x y]
(doto
(DMatrix.
(coll->array x)
(count y)
(count (first x)))
(.setLabel (coll->array y)))))
(defmethod dmatrix LazySeq
([x]
(DMatrix.
(coll->array x)
(count x)
(count (first x))))
([x y]
(doto
(DMatrix.
(coll->array x)
(count y)
(count (first x)))
(.setLabel (coll->array y)))))
(defmethod dmatrix String
[path]
(DMatrix. path))
(defn nrow
"Returns the number of rows in a `DMatrix`.
Example:
(nrow (dmatrix train))
;; 50"
[dmatrix]
(.rowNum dmatrix))
(ns clj-boost.utils
"Utilities to make life better with clj-boost."
(:require [clojure.data.csv :as csv]
[clojure.java.io :as io])
(:import [ml.dmlc.xgboost4j.java
DMatrix
Booster]))
(defmulti persist
"Save datasets and models in a format suited for XGBoost.
Save either a `DMatrix` or a `Booster` instance to retrieve it for later use.
Example:
(persist (dmatrix dataset) \"path/to/dataset\")
(persist (fit (dmatrix dataset) config) \"path/to/model\")"
(fn [d other] (type d)))
(defmethod persist DMatrix [dmatrix path]
(.saveBinary dmatrix path))
(defmethod persist Booster [model path]
(.saveModel model path))
......@@ -171,14 +171,13 @@
:metrics ["error" "auc"]}))))))
(deftest persist-dmatrix-test
(testing "Testing (persist DMatrix)"
(are [d fname] (do
(swap! filename conj fname)
(persist (dmatrix d) (last @filename))
(.exists (io/file (last @filename))))
train-map "resources/dmat"
(first train-vec) "resources/dmat2"
[[]] "resources/dmat3")))
(are [d fname] (do
(swap! filename conj fname)
(persist (dmatrix d) (last @filename))
(.exists (io/file (last @filename))))
train-map "resources/dmat"
(first train-vec) "resources/dmat2"
[[]] "resources/dmat3"))
(deftest persist-booster-test
(are [d fname] (do
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment