...
 
Commits (4)
# Parcours data SHS
Lancez le notebook en cliquant sur le bouton [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gl/open-scientist%2Fparcours-data-shs/master?urlpath=lab%2Ftree%2Fnotebooks%2Fsession4%2Findex.ipynb)
Lancez le notebook en cliquant sur le bouton [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gl/open-scientist%2Fparcours-data-shs/master?urlpath=lab%2Ftree%2Fnotebooks%2Fsession5%2Findex.ipynb)
This diff is collapsed.
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Session 5"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## K-mean clustering\n",
"\n",
"[Clustering hiérarchique](k-means-clustering.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Régression linéaire\n",
"\n",
"[Exercice avec la population par pays](world_population_prediction.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Extraction de données en XML\n",
"\n",
"[Démo avec les questions à l'Assemblée Nationale](pandas-questions-answers.ipynb)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"! cp ../session4/Untitled.ipynb ."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
This diff is collapsed.
This diff is collapsed.
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Prédire la population de chaque pays"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Le jeu de données est issu des données de la banque mondiale : évolution de la population mondiale, pays par pays depuis 1960 : https://data.worldbank.org/indicator/SP.POP.TOTL\n",
"\n",
"Pour les pays dont la croissance est monotone, on peut réaliser une régression linéaire et classer les pays par catégorie.\n",
"\n",
"Cet exercice est dans une démarche d'exploration des données."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Récupération des données"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!wget http://api.worldbank.org/v2/en/indicator/SP.POP.TOTL?downloadformat=csv \\\n",
" -O ../../data/raw/API_SP.POP.TOTL_DS2_en_csv_v2_566132.zip"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!cd ../../data/raw/ && unzip API_SP.POP.TOTL_DS2_en_csv_v2_566132.zip"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"world_population_notindexed = pandas.read_csv(\"../../data/raw/API_SP.POP.TOTL_DS2_en_csv_v2_713131.csv\", skiprows=4)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas\n",
"from matplotlib import pyplot as plt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exercice"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Extraire les données et les traiter\n",
"\n",
"- Retirer les colonnes non indispensables\n",
"- indexer par pays"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Visualiser les données\n",
"\n",
"- présenter l'évolution de la population pour un pays\n",
"- présenter l'évolution de la population pour un groupe de pays sur le même graphique\n",
"- présenter l'évolution de la population pour un groupe de pays sur des graphiques en grille"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Réaliser une régression linéaire sur l'évolution de la population de la Suède\n",
"\n",
"- ajuster une régression linéaire sur l'évolution de la population de la Suède (Sweden)\n",
"- écrire une fonction qui écrit réalise cette régression linéaire sur tous les pays"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Extraire les coefficients de la régression linéaire dans une dataframe"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Réaliser un clustering des pays par rapport à leur coefficient de croissance relatif à la population totale"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Étude d'un second jeu de données"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Un jeu de données sur différents pays du monde est mis à disposition sur la plateforme Kaggle.\n",
"\n",
"À noter qu'il faut créer un compte pour pouvoir télécharger ces données. Ces données sont déjà disponibles dans le répertoire data-public.\n",
"\n",
"https://www.kaggle.com/fernandol/countries-of-the-world#countries%20of%20the%20world.csv\n",
"\n",
"\n",
"## Exercice : reproduire en Python l'analyse suivante (en R)\n",
"\n",
"https://rpubs.com/aphalin11/clust_country"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Navigation : [index](index.ipynb) [session pandas](session-pandas.ipynb)\n",
"\n",
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Tutoriel xpath"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from lxml import etree"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Un document XML est un arbre, similaire à une arborescence de dossiers."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tree = etree.parse(\"data/questions-reponses-AN/QUESTION_ECRITE20110040.xml\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"type(tree)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Pour atteindre la racine de l'arbre, lxml fournit la méthode `getroot()`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tree."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tree.getroot()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"jupyter": {
"outputs_hidden": true
}
},
"outputs": [],
"source": [
"!cat data/questions-reponses-AN/QUESTION_ECRITE20110040.xml"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Cet élément a pour tag `QUESTION_ECRITE`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"La sélection d'une liste d'éléments (contenant possiblement un seul élément) se fait à l'aide de la méthode xpath :\n",
"\n",
"`tree.xpath(XPATH_EXPRESSION)`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Les expressions xpath sont soit absolues, soit relatives :\n",
"\n",
"- absolues : commencent par `/`\n",
"- relatives : commencent par `//`\n",
"\n",
"Ensuite est placé un tag."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Accès aux éléments ayant le tag `QUESTION_ECRITE` et placés juste en dessous de la racine"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tree.xpath(\"/QUESTION_ECRITE\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Accès au éléments ayant le tag `QUESTION_ECRITE`, quel que soit leur parent"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tree.xpath(\"//QUESTION_ECRITE\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"C'est la même liste comprenant un seul élément. Le schéma est fait soigneusement."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tree.xpath(\"/QUESTION_ECRITE\")[0]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"el = tree.xpath(\"/QUESTION_ECRITE\")[0]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"type(el)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"el.tag"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Accéder aux 10 premiers enfants de cet élément sélectionné : la sélection d'un élément sélectionne le sous-arbre."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tree.xpath(\"/QUESTION_ECRITE\")[0].getchildren()[0:10]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tree.xpath(\"/QUESTION_ECRITE/QE/DONNEES\")[0].getchildren()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Relative"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"el2 = tree.xpath(\"//QE/DONNEES/RUBRIQUE\")[0]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"el2.tag"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"el2.text"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Itération sur les éléments pour afficher le tag :"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"for element in tree.xpath(\"/QUESTION_ECRITE\")[0].getchildren()[:10]:\n",
" print(element.tag)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Attributs"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data = etree.fromstring(\"\"\"\n",
"<xml>\n",
" <noeud name=\"Tom\">\n",
" connexion\n",
" </noeud>\n",
" <node name=\"Bob\">\n",
" déconnexion\n",
" </node>\n",
"</xml>\n",
"\"\"\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data.xpath(\"//noeud[@name='Tom']\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data.xpath(\"//node[@name='Tom']\")[0].text"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lorsqu'un namespace est défini, le reproduire :"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`ns = {\"alias\": \"http://url.example.org\"}`\n",
"\n",
"`data.xpath(\"//alias:tag\")`"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}