expand and cleanup date validation routines

We used to have a long list of fields. Now we just iterate over the
item, then the feed, and look for the fields we want. It's cleaner
visually and might even make some feeds validate, as we now look for
`created_parsed` from feeds as well.

Order should be otherwise unchanged.

This was useful in diagnosing issues with invalid dates, as before
this change, we couldn't tell which field was picked for the item
date.

See #7.
parent 2e339c79
Pipeline #48220536 passed with stage
in 2 minutes and 10 seconds
......@@ -304,10 +304,7 @@ class Feed(feedparser.FeedParserDict):
we do the following operation:
1. add more defaults to item dates (`issue #113
<https://github.com/kurtmckee/feedparser/issues/113>`_):
* created_parsed of the item
* updated_parsed of the feed
<https://github.com/kurtmckee/feedparser/issues/113>`_)
2. missing GUID in some feeds (`issue #112
<https://github.com/kurtmckee/feedparser/issues/112>`_)
......@@ -316,11 +313,26 @@ class Feed(feedparser.FeedParserDict):
where feeds are /foo instead of https://github.com/foo.
unreported for now.
"""
# 1. add more defaults (issue #113)
def pick_first_date():
"""find a valid date in item or feed"""
fields = ('updated_parsed', 'published_parsed', 'created_parsed')
# first check the item itself, then fallback on the field
for scope in (item, self):
# all the fields to inspect
for field in fields:
if scope.get(field, False):
logging.debug('picked field %s for item %s: %s',
field, item.get('id'), scope.get(field))
return scope.get(field)
# ignore deprecation warnings from feedparser:
# https://github.com/kurtmckee/feedparser/issues/151
with warnings.catch_warnings():
warnings.simplefilter("ignore")
item['updated_parsed'] = item.get('updated_parsed', item.get('published_parsed', item.get('created_parsed', self.get('updated_parsed', self.get('published_parsed', False))))) # noqa
assert item.get('updated_parsed') is not None
item['updated_parsed'] = pick_first_date()
if not item.get('updated_parsed'):
logging.warning('no parseable date found in feed item %s from feed %s, using current time instead',
item.get('id'), self.get('url'))
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment