Skip to content
Snippets Groups Projects
Select Git revision
  • develop default
  • feat/wandb_debug
  • feat/hyperop
  • feat/MLproject
  • master protected
  • staging
6 results

nail-biter

  • Clone with SSH
  • Clone with HTTPS
  • Netlify status - main site:

    Netlify Status

    Netlify status - mobile site:

    Netlify Status

    NAIL-BITER ALERTER - FSDL PROJECT Spring 2021 - README


    The project proposal:

    https://docs.google.com/presentation/d/14Y0atAdc7y_BYAAuyFJmtsEVaY8HVy0fNlAbJ9FnnxY/edit#slide=id.gce5cd8f037_0_5


    The big disclaimer:

    This project has been designed to address some privacy issues and in particular the non-use of server-side deep learning technology but only client-side or at edge with limited computing resources. To facilitate the review of this project, many web accesses are 'open' (MLFlow server, Flask server, unencrypted image, logging test images in weights & biases...) but would not be in real life.


    The Loom Video: about the final project

    https://www.loom.com/share/9a5309ab6cb247e7943e76cffb69f628


    What has been accomplished for this Capstone project:

    This project use Gitlab CICD: 2021-05-15_22-08-16.png

    This project provides:

    • a MLFLOW Server deployed on Google Cloud Kubernetes Autopilot Cluster with Cloud SQL backend and GCP Objet Storage in a complete CICD manner
    • Trained a classifier for Nail-Biting alerting, with Transfer Learning, using Mobilenet (target usage is Edge/low compute ressources)
    • Use MLFLow Tracking, Model, Model Registry, (Projects: not finished)
    • Use Weights & Biases in comparison to MLFlow
    • tf.keras model trained model conversion to TFlite, TFlite quantization & flat buffer for google Coral dev board (or coral USB)
    • Use CICD to deploy on GCP (MLFow Server, API, Netlify site)
    • logging is operate with Sentry (error logging and Capture message; an API checker is UP with Flask (logging with Sentry is really usefull when dealing with Kubernetes Pod)

    Project documentation

    This project is automatically documented using sphinx:

    1. Go to the docs directory:cd docs
    2. Make automatically .rst files in docs/source: sphinx-apidoc -o source ../nailbiter/
    3. Make the html files: make html
    4. Open the index file in docs/build/html

    Project deployment

    This project is deployed on Kubernetes Engine using Gitlab CI/CD.

    It has two docker containers: one for MLFlow Server and a load balancer. It also sets up 3 Kubernetes deployments (development / staging / production) using Kubernetes namespaces, one Kubernetes Job that needs to be triggered manually from Gitlab, and a Kubernetes loadbalancer service to access the app.

    Continuous integration

    There are two steps to do tests in our project:

    • one for unit tests from local files to test functions locally (test-app)
    • one to test the API in the development environment (test-deploy-dev)

    For now, those steps are disable and need to be debugged.

    Continuous delivery

    The continuous delivery has three steps:

    • one step to build and push the 2 docker images
    • one step to deploy in the development environment automatically
    • one step to deploy manually to the staging and production environment

    As of now, there are no steps to destroy deployments but it could be implemented to free cluster resources automatically and in a centralized way.

    Accesses

    All accesses are hosted in Kubernetes secrets but managed and created from gitlab ci/cd.

    How to run the project locally

    To be able to run the project locally you need to have access to Google Cloud Storage.

    ###1. Set up the Python environment

    We use conda for managing Python and pip-tools for managing Python package dependencies.

    We add a Makefile for making setup simple.

    First: Install the environment

    First, go to the root folder of the project.

    Run make conda-update to create an environment called nail-biter, as defined in environment.yml. This environment will provide us with the right Python version.

    If you edit environment.yml, just run make conda-update again to get the latest changes.

    Next, activate the conda environment.

    conda activate nail-biter

    IMPORTANT: every time you work in this directory, make sure to start your session with conda activate nail-biter.

    Next: install Python packages

    Next, install all necessary Python packages by running make pip-tools

    If you add, remove, or need to update versions of some requirements, edit the .in files, and simply run make pip-tools again.

    Quick python environment set up

    Once you have conda and your environment is active, you can just run make all to execute:

    • make conda-update creates/updates the conda env
    • make pip-tools resolves and install all Python packages

    ###2. Build Docker images

    For this section, you need to have installed Docker

    Build the docker image to set up the app and mlflow servr:

    docker build -t nailbiter . -f ./docker/app/Dockerfile
    docker build -t nailbiter . -f ./docker/mlflow/Dockerfile

    ###3. Set up GCloud SQL Proxy

    Download the proxy and allow it to execute. Log in to

    Run ./cloud_sql_proxy -instances=mlops-jmp:europe-west1:mlops-jmp=tcp:5432

    ###4. Run the API

    Run the app

    docker run -p 5000:5000 nailbiter

    In your browser, go to localhost:5000 and experiment with the swagger

    Run the MLFlow server locally Start proxy cloudsql mandatory:

    ./cloud_sql_proxy -instances=mlops-jmp:europe-west1:mlops-jmp=tcp:5432

    mlflow server --backend-store-uri postgresql+psycopg2://postgres:postgres@127.0.0.1:5432/mlflow_fsdl --default-artifact-root gs://mlops-2021-jmp/mlflow-fsdl --host 0.0.0.0 --port 5000

    In your browser, go to localhost:5000


    Access to the MLFLOW remote server (on GCP)

    http://34.77.223.218:5000/

    Model Tracking record all parameters experimentation:

    imagesWiki/2021-05-14_14-05-26.png

    imagesWiki/img.png

    imagesWiki/2021-05-14_14-07-50.png

    imagesWiki/2021-05-14_14-09-51.png

    imagesWiki/2021-05-14_14-10-56.png

    I realise I missed adding the Classification Report logger as an artifact...

    Model Registry is accessible with: http://34.76.75.116:5000/#/models

    The Staging and Production version are available and can be served directly:

    By design, all my trained model are registered at the end of training as 'Staging' stage.

    imagesWiki/2021-05-14_13-46-07.png


    Access to Weights & Biases

    https://wandb.ai/macfly1202/my-keras-nail_biter-integration

    imagesWiki/2021-05-14_14-16-16.png

    *! crash with Sweep due to a bug with Wandb & fit_transform; visible in logs; https://fullstackdeeplearning.slack.com/archives/C01NYV9M8DR/p1620763864129000 *


    Acces to Netlify site

    Netlify help deploy modern static websites in few clic or with CICD.
    Netlify has an acces token to Gitlab which trigger publish directory.
    In this project, publish directory are:
    nailbiter/application/export_models/netlify
    is deploy on:
    https://nailbiting-alerter.netlify.app/

    imagesWiki/2021-05-14_18-39-42.png

    On Desktop computer:

    On Mobile and Smartphones:

    For a best rendering on smartphones, I made a custom Canvas size
    <!> Disclaimer: Having Reactive sites is a better approach, but I'm not javascript or neither Front end developper.

    nailbiter/application/export_models/netlify_mobile is deploy on:
    https://nailbiting-alerter-mobile.netlify.app/

    NETLIFY sites (main and mobile optimized) are automatically publish and deploy with Gitlab CICD.

    imagesWiki/2021-05-14_12-12-40.png

    local test for html/js:

    Chrome extension: Web server for chrome:
    https://chrome.google.com/webstore/detail/web-server-for-chrome/ofhbbkphhbklhfoeikjpcbhemlocgigb/related

    P5.js editor:
    https://editor.p5js.org/


    Acces to AWS S3 static hosting (another advantage to tensorflowjs)

    Amazon S3 bucket can be render as static site. CORS policy must be adapted.

    https://demo-poc-ml-nail-biter-alerter.s3.eu-west-3.amazonaws.com/index.html


    Google Coral dev board

    A development board to quickly prototype on-device ML products. Scale from prototype to production with a removable system-on-module (SoM)
    https://coral.ai/products/dev-board/

    IMG_8976.jpg

    Tensorflow model must be converted as a quantized flat buffer model.

    Before an usage on Google Coral Dev Board, *.tflite model must be converted. The Edge TPU Compiler (edgetpu_compiler) is a command line tool that compiles a TensorFlow Lite model (.tflite file) into a file that's compatible with the Edge TPU.

    More infos:
    https://coral.ai/docs/edgetpu/models-intro/#compiling

    imagesWiki/imagesWiki/edge_tpu_compile-workflow.png

    Figure above illustrate the basic process to create a model that's compatible with the Edge TPU. Most of the workflow uses standard TensorFlow tools. Once you have a TensorFlow Lite model, you then use our Edge TPU compiler to create a .tflite file that's compatible with the Edge TPU.

    imagesWiki/imagesWiki/compile-tflite-to-edgetpu.png

    The compiler creates a single custom op for all Edge TPU compatible ops, until it encounters an unsupported op; the rest stays the same and runs on the CPU

    However, this .tflite file still uses floating-point values for the parameter data, and we need to fully quantize the model to int8 format. To fully quantize the model, we need to perform post-training quantization with a representative dataset, which requires a few more arguments for the TFLiteConverter, and a function that builds a dataset that's representative of the training dataset. We are doing this post-training quantization in representative_data_gen() function.

    imagesWiki/2021-05-14_13-19-06.png

    Coral model in this project is generate into: nailbiter/application/export_models/quant_compile

    Coral models are stored to https://github.com/macfly1202/quantizedmodels_coral (private)

    On google coral dev board shell:

    mdt shell

    Testing with simple images

    cd ~/tflite/python/examples/classification/nail/quantizedmodels_coral git pull

    cd ~/tflite/python/examples/classification (or cd ../../)

    python3 classify_image.py --model nail/quantizedmodels_coral/model_quant_input_saved_modelname_edgetpu.tflite --labels nail/quantizedmodels_coral/labels.txt --input nail/quantizedmodels_coral/images/no-nail-biting-\!-3033.jpg

    python3 classify_image.py --model nail/quantizedmodels_coral/model_quant_input_saved_modelname_edgetpu.tflite --labels nail/quantizedmodels_coral/labels.txt --input nail/quantizedmodels_coral/images/nail-biting-\!-3033.jpg

    <!> This is not a test of the algorithm; it's only for testing the Python Coral TPU API <!>

    Testing the Live feed with TPU coral classify server side

    edgetpu_classify_server --model nailbiter_detector/quantizedmodels_coral/r13/model_quant_input_saved_modelname_edgetpu.tflite --labels nailbiter_detector/quantizedmodels_coral/labels.txt
    Sheel give the feedback of Edge TPU vision server launched:
    INFO:edgetpuvision.streaming.server:Listening on ports tcp: 4665, web: 4664, annexb: 4666

    On local desktop, go to 192.168.1.23:4664 Server render feedback: imagesWiki/imgTPUshell.png

    On local dektop:

    imagesWiki/2021-05-14_17-51-13.png

    imagesWiki/2021-05-14_17-51-33.png

    Short Loom of Coral dev board demo with quick comparison with Netlify site & S3 (tf.js)

    https://www.loom.com/share/7745aafb869747ed8b2a51f4282a266c


    Access to API Flask with load balancer

    http://35.241.207.58:5000

    This API provide:

    • an health checker
    • A Sentry log checker (launch a div by zero to trigger error that could be check in Sentry)
    • MLFLOW render URL

    imagesWiki/2021-05-14_23-16-11.png


    Performance side/results comments:

    • Tf.js give extremly stable and fast results in comparison of Edgt TPU
    • false detection could occur with mug/coffee sometime

    TODO / IMPROVEMENTS

    • add tensorflow.js training client side with tf.js (I miss few hours to accomplish this)
    • create GCP mlflow bucket, S3 bucket and Cloud SQL with Deployement Manager (and CICD)
    • add MLFLOW TRACK URI in gitlab CI variables
    • add a Streamlit app with kubernetes deployment for testing
    • change MLFLOW serving Stage model ('Production') with API
    • track preprocessing and all conversion accuracy and artifact for Edge / TF.js with MLFLow /Wandb
    • learn markdown to center and align images in Readme :-)

    To improve the 'privacy by design' approach

    • encrypt all photos use in project
    • derivate the teachable machine from google https://github.com/google-coral/project-teachable
    • use imprinting method or retrain ! Due to mismatching version of Coral Python API, I was unable to test PyCoral new API on this project and reinstalling Coral, backuping all existing project and all tooling on Edge TPU will take multiples days !

    What I regret about this project:

    • Using Coral Weights Imprinting (https://coral.ai/docs/edgetpu/retrain-classification-ondevice/)
    • Use new PyCoral API instead of current used in this project (Edge TPU Python API: https://coral.ai/docs/edgetpu/api-intro/#install-the-library)
    • Not using GPIO on Google TPU for visual alerting. I didn't find solution yet but I will find. As an anecdote, I had the same kind of problem with the GPIO of a Nvidia Jetson Nano for the use of a DIY velocity meter; it took me several hours to find a solution. The GPIOs on these ML boards are not as simple as those on Raspberry boards (there are simple only on the paper but not in real life) ... My goal is to use WS2812B Led strip to display custom effect according to type of detection (frequency, duration, confidence )
    • to have more clean code (linting & all that was present in lab)
    • to add working unit tests
    • debug Sphinx Doc build - I was unable to successfully build html ...
    • definitely use VScode instead Pycharm which is more and more deceptive(PyCharm 2021.1.1 (Community Edition) has multiples bug & slow strange behavior)
    • have better general coherence between function/class: due to deploy on Edge, Desktop, AWS, ... too much ambitious project but I learn so so much in this 4 week sprint
    • on the classifier: add counter for tracking events in time and measure kpi (time of day when most frequent event occur, duration, clustering, prediction with Prophet). Eventually send them this metrics with Kafka to the Postgresql DB on GCP.
    • on collected metrics, apply dataviz & data analytics with Apache Superset (v1.1) deploy as K8 Pod on GCP (like the MLFlow server, FlaskRestX API and their own load balancer)
    • not having enough time and so interesting !

    ... The FSDL deadline project does not mark the end of this project...
    ... I will continue to improve it in the next few weeks on all of the points mentioned above!

    A short list of resources I used for this project

    • FSDL github & courses
    • 'Deep Learning for computer Vision - ImageNet Bundle' from Adrian Rosebrock (https://www.pyimagesearch.com/)
    • MLFLow site & tutorials from Jules Djami of Databricks (Strata Workshop)
    • 'Deep Learning with Python from Francois Cholet' (1st and 2nd editions)
    • 'Docker in Practice, Second Edition' from Eli Stevens, Luca Antiga, and Thomas Viehmann
    • Kubernetes website & 'Learn Kubernetes in a Month of Lunches' by Elton Stoneman
    • Hyperopt & MLFLow: https://dzlab.github.io/ml/2020/08/16/mlflow-hyperopt/
    • Coral website & github

    Thank you very much to the FSDL team

    I would like to thank so so much all FSDL team, especially Sergey Karayev, Josh Tobin and Pieter Abbeel.

    Sergey impressed me so much for staying so calm with all kind of strange questions/remarks during Thursday QA and presented all concepts in a passionate way to learn them and digg them !

    My first big wave in Deep Learning's area was in 2017. FSDL 2021 was so so cool and interesting that it's probably my second biggest wave in Deep Learning area.

    I strongly recommend to sign up for this paid courses to have deadlines, meet passionate people on Slack and work seriously on a nice and challenging project