Netlify status - main site:
Netlify status - mobile site:
NAIL-BITER ALERTER - FSDL PROJECT Spring 2021 - README
The project proposal:
The big disclaimer:
This project has been designed to address some privacy issues and in particular the non-use of server-side deep learning technology but only client-side or at edge with limited computing resources. To facilitate the review of this project, many web accesses are 'open' (MLFlow server, Flask server, unencrypted image, logging test images in weights & biases...) but would not be in real life.
The Loom Video: about the final project
https://www.loom.com/share/9a5309ab6cb247e7943e76cffb69f628
What has been accomplished for this Capstone project:
This project provides:
- a MLFLOW Server deployed on Google Cloud Kubernetes Autopilot Cluster with Cloud SQL backend and GCP Objet Storage in a complete CICD manner
- Trained a classifier for Nail-Biting alerting, with Transfer Learning, using Mobilenet (target usage is Edge/low compute ressources)
- Use MLFLow Tracking, Model, Model Registry, (Projects: not finished)
- Use Weights & Biases in comparison to MLFlow
- tf.keras model trained model conversion to TFlite, TFlite quantization & flat buffer for google Coral dev board (or coral USB)
- Use CICD to deploy on GCP (MLFow Server, API, Netlify site)
- logging is operate with Sentry (error logging and Capture message; an API checker is UP with Flask (logging with Sentry is really usefull when dealing with Kubernetes Pod)
Project documentation
This project is automatically documented using sphinx:
- Go to the docs directory:
cd docs
- Make automatically .rst files in docs/source:
sphinx-apidoc -o source ../nailbiter/
- Make the html files:
make html
- Open the index file in docs/build/html
Project deployment
This project is deployed on Kubernetes Engine using Gitlab CI/CD.
It has two docker containers: one for MLFlow Server and a load balancer. It also sets up 3 Kubernetes deployments (development / staging / production) using Kubernetes namespaces, one Kubernetes Job that needs to be triggered manually from Gitlab, and a Kubernetes loadbalancer service to access the app.
Continuous integration
There are two steps to do tests in our project:
- one for unit tests from local files to test functions locally (test-app)
- one to test the API in the development environment (test-deploy-dev)
For now, those steps are disable and need to be debugged.
Continuous delivery
The continuous delivery has three steps:
- one step to build and push the 2 docker images
- one step to deploy in the development environment automatically
- one step to deploy manually to the staging and production environment
As of now, there are no steps to destroy deployments but it could be implemented to free cluster resources automatically and in a centralized way.
Accesses
All accesses are hosted in Kubernetes secrets but managed and created from gitlab ci/cd.
How to run the project locally
To be able to run the project locally you need to have access to Google Cloud Storage.
###1. Set up the Python environment
We use conda
for managing Python and pip-tools
for managing Python package dependencies.
We add a Makefile
for making setup simple.
First: Install the environment
First, go to the root folder of the project.
Run make conda-update
to create an environment called nail-biter
, as defined in environment.yml
.
This environment will provide us with the right Python version.
If you edit environment.yml
, just run make conda-update
again to get the latest changes.
Next, activate the conda environment.
conda activate nail-biter
IMPORTANT: every time you work in this directory, make sure to start your session with conda activate nail-biter
.
Next: install Python packages
Next, install all necessary Python packages by running make pip-tools
If you add, remove, or need to update versions of some requirements, edit the .in
files, and simply run make pip-tools
again.
Quick python environment set up
Once you have conda
and your environment is active, you can just run make all
to execute:
-
make conda-update
creates/updates the conda env -
make pip-tools
resolves and install all Python packages
###2. Build Docker images
For this section, you need to have installed Docker
Build the docker image to set up the app and mlflow servr:
docker build -t nailbiter . -f ./docker/app/Dockerfile
docker build -t nailbiter . -f ./docker/mlflow/Dockerfile
###3. Set up GCloud SQL Proxy
Download the proxy and allow it to execute. Log in to
Run ./cloud_sql_proxy -instances=mlops-jmp:europe-west1:mlops-jmp=tcp:5432
###4. Run the API
Run the app
docker run -p 5000:5000 nailbiter
In your browser, go to localhost:5000 and experiment with the swagger
Run the MLFlow server locally Start proxy cloudsql mandatory:
./cloud_sql_proxy -instances=mlops-jmp:europe-west1:mlops-jmp=tcp:5432
mlflow server --backend-store-uri postgresql+psycopg2://postgres:postgres@127.0.0.1:5432/mlflow_fsdl --default-artifact-root gs://mlops-2021-jmp/mlflow-fsdl --host 0.0.0.0 --port 5000
In your browser, go to localhost:5000
Access to the MLFLOW remote server (on GCP)
Model Tracking record all parameters experimentation:
I realise I missed adding the Classification Report logger as an artifact...
Model Registry is accessible with: http://34.76.75.116:5000/#/models
The Staging and Production version are available and can be served directly:
By design, all my trained model are registered at the end of training as 'Staging' stage.
Access to Weights & Biases
https://wandb.ai/macfly1202/my-keras-nail_biter-integration
*! crash with Sweep due to a bug with Wandb & fit_transform; visible in logs; https://fullstackdeeplearning.slack.com/archives/C01NYV9M8DR/p1620763864129000 *
Acces to Netlify site
Netlify help deploy modern static websites in few clic or with CICD.
Netlify has an acces token to Gitlab which trigger publish directory.
In this project, publish directory are:
nailbiter/application/export_models/netlify
is deploy on:
https://nailbiting-alerter.netlify.app/
On Desktop computer:
On Mobile and Smartphones:
For a best rendering on smartphones, I made a custom Canvas size
<!> Disclaimer: Having Reactive sites is a better approach, but I'm not javascript or neither Front end developper.
nailbiter/application/export_models/netlify_mobile
is deploy on:
https://nailbiting-alerter-mobile.netlify.app/
NETLIFY sites (main and mobile optimized) are automatically publish and deploy with Gitlab CICD.
local test for html/js:
Chrome extension: Web server for chrome:
https://chrome.google.com/webstore/detail/web-server-for-chrome/ofhbbkphhbklhfoeikjpcbhemlocgigb/related
P5.js editor:
https://editor.p5js.org/
Acces to AWS S3 static hosting (another advantage to tensorflowjs)
Amazon S3 bucket can be render as static site. CORS policy must be adapted.
https://demo-poc-ml-nail-biter-alerter.s3.eu-west-3.amazonaws.com/index.html
Google Coral dev board
A development board to quickly prototype on-device ML products. Scale from prototype to production with a removable system-on-module (SoM)
https://coral.ai/products/dev-board/
Tensorflow model must be converted as a quantized flat buffer model.
Before an usage on Google Coral Dev Board, *.tflite model must be converted. The Edge TPU Compiler (edgetpu_compiler) is a command line tool that compiles a TensorFlow Lite model (.tflite file) into a file that's compatible with the Edge TPU.
More infos:
https://coral.ai/docs/edgetpu/models-intro/#compiling
However, this .tflite file still uses floating-point values for the parameter data, and we need to fully quantize the model to int8 format. To fully quantize the model, we need to perform post-training quantization with a representative dataset, which requires a few more arguments for the TFLiteConverter, and a function that builds a dataset that's representative of the training dataset. We are doing this post-training quantization in representative_data_gen() function.
Coral model in this project is generate into: nailbiter/application/export_models/quant_compile
Coral models are stored to https://github.com/macfly1202/quantizedmodels_coral (private)
On google coral dev board shell:
mdt shell
Testing with simple images
cd ~/tflite/python/examples/classification/nail/quantizedmodels_coral
git pull
cd ~/tflite/python/examples/classification
(or cd ../../
)
python3 classify_image.py --model nail/quantizedmodels_coral/model_quant_input_saved_modelname_edgetpu.tflite --labels nail/quantizedmodels_coral/labels.txt --input nail/quantizedmodels_coral/images/no-nail-biting-\!-3033.jpg
python3 classify_image.py --model nail/quantizedmodels_coral/model_quant_input_saved_modelname_edgetpu.tflite --labels nail/quantizedmodels_coral/labels.txt --input nail/quantizedmodels_coral/images/nail-biting-\!-3033.jpg
<!> This is not a test of the algorithm; it's only for testing the Python Coral TPU API <!>
Testing the Live feed with TPU coral classify server side
edgetpu_classify_server --model nailbiter_detector/quantizedmodels_coral/r13/model_quant_input_saved_modelname_edgetpu.tflite --labels nailbiter_detector/quantizedmodels_coral/labels.txt
Sheel give the feedback of Edge TPU vision server launched:
INFO:edgetpuvision.streaming.server:Listening on ports tcp: 4665, web: 4664, annexb: 4666
On local desktop, go to 192.168.1.23:4664
Server render feedback:
On local dektop:
Short Loom of Coral dev board demo with quick comparison with Netlify site & S3 (tf.js)
https://www.loom.com/share/7745aafb869747ed8b2a51f4282a266c
Access to API Flask with load balancer
This API provide:
- an health checker
- A Sentry log checker (launch a div by zero to trigger error that could be check in Sentry)
- MLFLOW render URL
Performance side/results comments:
- Tf.js give extremly stable and fast results in comparison of Edgt TPU
- false detection could occur with mug/coffee sometime
TODO / IMPROVEMENTS
- add tensorflow.js training client side with tf.js (I miss few hours to accomplish this)
- create GCP mlflow bucket, S3 bucket and Cloud SQL with Deployement Manager (and CICD)
- add MLFLOW TRACK URI in gitlab CI variables
- add a Streamlit app with kubernetes deployment for testing
- change MLFLOW serving Stage model ('Production') with API
- track preprocessing and all conversion accuracy and artifact for Edge / TF.js with MLFLow /Wandb
- learn markdown to center and align images in Readme :-)
To improve the 'privacy by design' approach
- encrypt all photos use in project
- derivate the teachable machine from google https://github.com/google-coral/project-teachable
- use imprinting method or retrain ! Due to mismatching version of Coral Python API, I was unable to test PyCoral new API on this project and reinstalling Coral, backuping all existing project and all tooling on Edge TPU will take multiples days !
What I regret about this project:
- Using Coral Weights Imprinting (https://coral.ai/docs/edgetpu/retrain-classification-ondevice/)
- Use new PyCoral API instead of current used in this project (Edge TPU Python API: https://coral.ai/docs/edgetpu/api-intro/#install-the-library)
- Not using GPIO on Google TPU for visual alerting. I didn't find solution yet but I will find. As an anecdote, I had the same kind of problem with the GPIO of a Nvidia Jetson Nano for the use of a DIY velocity meter; it took me several hours to find a solution. The GPIOs on these ML boards are not as simple as those on Raspberry boards (there are simple only on the paper but not in real life) ... My goal is to use WS2812B Led strip to display custom effect according to type of detection (frequency, duration, confidence )
- to have more clean code (linting & all that was present in lab)
- to add working unit tests
- debug Sphinx Doc build - I was unable to successfully build html ...
- definitely use VScode instead Pycharm which is more and more deceptive(PyCharm 2021.1.1 (Community Edition) has multiples bug & slow strange behavior)
- have better general coherence between function/class: due to deploy on Edge, Desktop, AWS, ... too much ambitious project but I learn so so much in this 4 week sprint
- on the classifier: add counter for tracking events in time and measure kpi (time of day when most frequent event occur, duration, clustering, prediction with Prophet). Eventually send them this metrics with Kafka to the Postgresql DB on GCP.
- on collected metrics, apply dataviz & data analytics with Apache Superset (v1.1) deploy as K8 Pod on GCP (like the MLFlow server, FlaskRestX API and their own load balancer)
- not having enough time and so interesting !
... The FSDL deadline project does not mark the end of this project...
... I will continue to improve it in the next few weeks on all of the points mentioned above!
A short list of resources I used for this project
- FSDL github & courses
- 'Deep Learning for computer Vision - ImageNet Bundle' from Adrian Rosebrock (https://www.pyimagesearch.com/)
- MLFLow site & tutorials from Jules Djami of Databricks (Strata Workshop)
- 'Deep Learning with Python from Francois Cholet' (1st and 2nd editions)
- 'Docker in Practice, Second Edition' from Eli Stevens, Luca Antiga, and Thomas Viehmann
- Kubernetes website & 'Learn Kubernetes in a Month of Lunches' by Elton Stoneman
- Hyperopt & MLFLow: https://dzlab.github.io/ml/2020/08/16/mlflow-hyperopt/
- Coral website & github
Thank you very much to the FSDL team
I would like to thank so so much all FSDL team, especially Sergey Karayev, Josh Tobin and Pieter Abbeel.
Sergey impressed me so much for staying so calm with all kind of strange questions/remarks during Thursday QA and presented all concepts in a passionate way to learn them and digg them !
My first big wave in Deep Learning's area was in 2017. FSDL 2021 was so so cool and interesting that it's probably my second biggest wave in Deep Learning area.
I strongly recommend to sign up for this paid courses to have deadlines, meet passionate people on Slack and work seriously on a nice and challenging project