...
 
Commits (4)
# Parcours data SHS
Lancez le notebook en cliquant sur le bouton [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gl/open-scientist%2Fparcours-data-shs/master?urlpath=lab%2Ftree%2Fnotebooks%2Fsession3%2Findex.ipynb)
Lancez le notebook en cliquant sur le bouton [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gl/open-scientist%2Fparcours-data-shs/master?urlpath=lab%2Ftree%2Fnotebooks%2Fsession4%2Findex.ipynb)
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Navigation : [index](index.ipynb) [session pandas](session-pandas.ipynb)\n",
"\n",
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Tutoriel xpath"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from lxml import etree"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Un document XML est un arbre, similaire à une arborescence de dossiers."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tree = etree.parse(\"data/questions-reponses-AN/QUESTION_ECRITE20110040.xml\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"type(tree)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Pour atteindre la racine de l'arbre, lxml fournit la méthode `getroot()`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tree."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tree.getroot()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"jupyter": {
"outputs_hidden": true
}
},
"outputs": [],
"source": [
"!cat data/questions-reponses-AN/QUESTION_ECRITE20110040.xml"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Cet élément a pour tag `QUESTION_ECRITE`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"La sélection d'une liste d'éléments (contenant possiblement un seul élément) se fait à l'aide de la méthode xpath :\n",
"\n",
"`tree.xpath(XPATH_EXPRESSION)`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Les expressions xpath sont soit absolues, soit relatives :\n",
"\n",
"- absolues : commencent par `/`\n",
"- relatives : commencent par `//`\n",
"\n",
"Ensuite est placé un tag."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Accès aux éléments ayant le tag `QUESTION_ECRITE` et placés juste en dessous de la racine"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tree.xpath(\"/QUESTION_ECRITE\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Accès au éléments ayant le tag `QUESTION_ECRITE`, quel que soit leur parent"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tree.xpath(\"//QUESTION_ECRITE\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"C'est la même liste comprenant un seul élément. Le schéma est fait soigneusement."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tree.xpath(\"/QUESTION_ECRITE\")[0]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"el = tree.xpath(\"/QUESTION_ECRITE\")[0]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"type(el)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"el.tag"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Accéder aux 10 premiers enfants de cet élément sélectionné : la sélection d'un élément sélectionne le sous-arbre."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tree.xpath(\"/QUESTION_ECRITE\")[0].getchildren()[0:10]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tree.xpath(\"/QUESTION_ECRITE/QE/DONNEES\")[0].getchildren()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Relative"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"el2 = tree.xpath(\"//QE/DONNEES/RUBRIQUE\")[0]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"el2.tag"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"el2.text"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Itération sur les éléments pour afficher le tag :"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"for element in tree.xpath(\"/QUESTION_ECRITE\")[0].getchildren()[:10]:\n",
" print(element.tag)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Attributs"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data = etree.fromstring(\"\"\"\n",
"<xml>\n",
" <noeud name=\"Tom\">\n",
" connexion\n",
" </noeud>\n",
" <node name=\"Bob\">\n",
" déconnexion\n",
" </node>\n",
"</xml>\n",
"\"\"\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data.xpath(\"//noeud[@name='Tom']\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data.xpath(\"//node[@name='Tom']\")[0].text"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lorsqu'un namespace est défini, le reproduire :"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`ns = {\"alias\": \"http://url.example.org\"}`\n",
"\n",
"`data.xpath(\"//alias:tag\")`"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}