Commit 4be7008b authored by frankie's avatar frankie 💬

web scrapping with beautiful soup

parent 96133441
......@@ -22,6 +22,47 @@ scripts and notes of the workshop
**web05_server.py** : listing files uploaded & providing links to files
### scrapping
**beautifulsoup00.py** : simple example on how to extract content from a webpage
requires the installation of python-bs4, on linux:
`
sudo apt install python-bs4
`
**beautifulsoup01.py** : advanced example using selenium's webdriver
requirements:
`
sudo apt install python-bs4
sudo apt-get install libxss1 libappindicator1 libindicator7
wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
sudo dpkg -i google-chrome*.deb
sudo apt-get install -f
sudo apt-get install xvfb
sudo apt-get install unzip
wget -N http://chromedriver.storage.googleapis.com/2.26/chromedriver_linux64.zip
unzip chromedriver_linux64.zip
chmod +x chromedriver
sudo mv -f chromedriver /usr/local/share/chromedriver
sudo ln -s /usr/local/share/chromedriver /usr/local/bin/chromedriver
sudo ln -s /usr/local/share/chromedriver /usr/bin/chromedriver
sudo -H python3 -m pip install pyvirtualdisplay selenium
`
source: https://christopher.su/2015/selenium-chromedriver-ubuntu/
test your installation with the script **selenium_chrome.py**
### during course
**webserver.py** : script developped during the first day
## resources
- [Simple HTTP server](https://docs.python.org/2/library/simplehttpserver.html)
......@@ -36,4 +77,12 @@ scripts and notes of the workshop
- [All possible Content-Type for HTTP headers](https://stackoverflow.com/questions/23714383/what-are-all-the-possible-values-for-http-content-type-header#37416922)
all images of folder *html/images* are from https://libreshot.com/tag/red/ and uses a CC0 license
\ No newline at end of file
- [Installation of selenium with chrome driver](https://christopher.su/2015/selenium-chromedriver-ubuntu/)
all images of folder *html/images* are from https://libreshot.com/tag/red/ and uses a CC0 license
`
sudo python3 -m pip install
sudo -H pip3 install --upgrade pip
sudo -H python3 -m pip install selenium
`
\ No newline at end of file
# sudo apt install python-bs4
'''
simple demo of beautiful soup on the home page of imal
'''
from bs4 import BeautifulSoup
import requests
URL = "https://imal.org"
r = requests.get(URL)
soup = BeautifulSoup(r.content)
#print( soup.prettify() )
print( soup.title.string )
for link in soup.find_all('a'):
print(link.get('href'))
for div in soup.find_all('div',{'class':'content'}):
divc = link.get('class')
print( '####################################' )
#print(div.encode_contents())
print(div.contents)
\ No newline at end of file
# see README.md for dependencies
'''
for page rendered with javascript, we need a 'headless' browser to render the page
'''
from bs4 import BeautifulSoup
from selenium import webdriver
url="https://www.flipkart.com/hp-pentium-quad-core-4-gb-1-tb-hdd-dos-15-be010tu-notebook/product-reviews/itmeprzhy4hs4akv?page1&pid=COMEPRZBAPXN2SNF"
browser = webdriver.Chrome()
browser.get(url)
html_source = browser.page_source
browser.quit()
soup = BeautifulSoup(html_source, "html.parser")
print( soup.prettify() )
print( soup.title.string )
for link in soup.find_all('a'):
print(link.get('href'))
for div in soup.find_all('div',{'class':'content'}):
divc = link.get('class')
print( '####################################' )
#print(div.encode_contents())
print(div.contents)
\ No newline at end of file
from pyvirtualdisplay import Display
from selenium import webdriver
display = Display(visible=0, size=(800, 600))
display.start()
driver = webdriver.Chrome()
driver.get('http://christopher.su')
print( driver.title )
\ No newline at end of file
......@@ -48,7 +48,6 @@ class Handler(SimpleHTTPRequestHandler):
except:
self.boom()
self.send_response(200)
self.send_header('Content-Type', 'text/html')
self.end_headers()
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment