Commit 28818893 authored by everet's avatar everet

Refactor code and clarify docs.

parent 18605e8d
# TODO
* single quotes don't work in CL searches. Must use double quotes only. Is there
a way to clean input so single quotes are treated as double quotes? Or perhaps
add a `exact_phrase` boolean option?
* single quotes don't work in CL searches. Must use double quotes only. Is
there a way to clean input so single quotes are treated as double quotes? Or
perhaps add a `exact_phrase` boolean option?
* Let user specify ALL possible CL search options: Price range, ZIP code, beds,
baths, etc. Don't forget: Availability and Open House (sale_date) dates.
* Let user specify which variables, how many pages, and how many ads to scrape.
* Decide: Should timestamps reflect when each ad was scraped, or when the search
began? The former differs across *pages* when `deep=False` and differs across
*ads* when `deep=True`. But `post_id` already uniquely IDs ads and I don't need
an identifer for pages. If I change timestamps to reflect search begin time,
they'll differ across search objects. Is that desirable?
* Decide: Should timestamps reflect when each ad was scraped, or when the
search began? The former differs across *pages* when `deep=False` and differs
across *ads* when `deep=True`. But `post_id` already uniquely IDs ads and I
don't need an identifer for pages. If I change timestamps to reflect search
begin time, they'll differ across search objects. Is that desirable?
* Replace `requests` dependency with `urllib3`? Because minimalism.
* No more CSS selectors?
* CLI
......@@ -190,7 +190,7 @@ class CLSearch:
finally:
self.next_page_url = self._build_url(sfx) if sfx else None
def _get_info_from(self, css, attr=None, pat=".*"):
def _get_info_from(self, css, attr=None, pat=None):
"""Scrape HTML nodes.
Scrapes data from nodes identified by given CSS selector, HTML
......@@ -200,8 +200,11 @@ class CLSearch:
A list, where each element contains info from a node.
"""
nodes = self.soup.select(css)
info = [n.text if attr is None else n[attr] for n in nodes]
if pat != ".*":
if attr:
info = [n[attr] for n in nodes]
else:
info = [n.text for n in nodes]
if pat:
info = ["".join(findall(pat, i)) for i in info]
info = [None if i == "" else i for i in info]
return info or [None]
......
......@@ -18,7 +18,7 @@ URL = 'https://gitlab.com/everetr/craigapts'
EMAIL = ''
AUTHOR = 'Everet Rummel'
REQUIRES_PYTHON = '>=3.7.0'
VERSION = '2020.3.7.9000'
VERSION = '2020.3.16'
# What packages are required for this module to be executed?
REQUIRED = [
......
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import sys
sys.path.append("/PATH/TO/craigapts/craigapts")
sys.path.append("/PATH/TO/craigapts")
import pandas as pd
from search import CLSearch
import craigapts as cg
nnj_ns8 = CLSearch(geo="newjersey", query='"no section 8"')
nnj_ns8 = cg.CLSearch(geo="newjersey", query='"no section 8"')
df = nnj_ns8.data
any(df.duplicated("post_id"))
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment