...
 
Commits (4)
---
layout: post
title: Podcasts Dataset
tags:
- podcasts
- open data
- dataset
---
This March several podcast publishers are participating in
"[Try pod](http://www.npr.org/about-npr/516454568/top-podcast-hosts-ask-their-listeners-to-try-a-pod),"
a campaign to encourage people to give podcasts a try. Unlike most people using the
[#trypod hashtag](https://twitter.com/hashtag/trypod), I'm not going to use the event as
an excuse to hawk my own podcast (I don't have one.) Instead, I am publishing a large
of podcasts and episodes on [data.world](https://data.world/brandon-telle/podcasts-dataset)
in the hopes that it will inspire some cool analytics or tools.
In this post, I'll demonstrate some of the [ETL logic](https://github.com/btelle/podcasts-dataset)
used to create the dataset and run some queries on the dataset using Google BigQuery.
## ETL
The first step in creating this dataset was to create a large list of podcast feeds to
scrape. Since the de facto podcast provider, iTunes, doesn't expose any public API I
turned to remnants of the early days of podcasts, feed directories.
The [build_lists](https://github.com/btelle/podcasts-dataset/tree/master/build_lists)
directory contains scripts that scrape some of the largest podcast directories I found
that were still operating, including
[All Podcasts](http://www.allpodcasts.com/podcast-directory.html),
[Podcastpedia](https://www.podcastpedia.org/categories), and
[Godcast1000](http://www.godcast1000.com/). Fearing those directories are out-of-date,
I also wrote a script to scrape the top podcasts in each category from
[iTunes](https://github.com/btelle/podcasts-dataset/blob/master/build_lists/itunes_rss_feed_extract.py).
### Shows extract
Once the list of feeds was compiled, I wanted to extract and transform show objects. The
[transform script](https://github.com/btelle/podcasts-dataset/blob/master/podcast_lib.py#L118)
iterates over each tag element in the feed and creates a show object with it.
Since the list of feeds I extracted basically spanned the entire history of podcasting,
there were *a lot* of feeds that didn't follow today's [best practices](https://help.apple.com/itc/podcasts_connect/#/itc2b3780e76)
or even use valid XML. Trying to accomodate those broken feeds led to a pretty messy
transform step with lots of cases like the following:
```python
# handle different date formats
elif tag == 'lastbuilddate':
try:
if re.search('[+-][0-9]+$', child.text.strip()):
dt = datetime.datetime.strptime(child.text.strip()[0:-5].strip(), '%a, %d %b %Y %H:%M:%S')
else:
dt = datetime.datetime.strptime(child.text.strip().strip(), '%a, %d %b %Y %H:%M:%S')
except (ValueError,AttributeError):
dt = datetime.datetime.now()
obj['last_build_date'] = dt.strftime('%Y-%m-%d %H:%M:%S')
```
### Episodes extract
Using the unique table of transformed shows, I then [extracted episodes](https://github.com/btelle/podcasts-dataset/blob/master/podcast_lib.py#L34)
from the feeds. As with the shows transform step, the logic is full of try/catches looking
for missing or malformed elements.
```python
# some feeds use seconds, some use [HH:]MM:SS
elif tag == 'duration':
if child.text and ':' in child.text:
lengths = child.text.split(':')[::-1]
duration = 0
for i in range(0, len(lengths)):
try:
duration += max((i*60), 1) * int(float(lengths[i]))
except (ValueError, TypeError):
pass
else:
try:
duration = int(child.text)
except (ValueError, TypeError):
duration = None
obj['length'] = duration
```
## Analysis
To run analytics on the dataset, I loaded the CSV files into Google BigQuery. One of the
advantages of using BigQuery is that it makes aggregations on a single column, even over
millions of rows, absurdly fast. Running some of the same aggregations I ran below took
10 times as long on my local MySQL database.
*Note*: the `episodes_flat` table referenced below is a materialized view joining shows
and episodes. Joined logical views in BQ tend to be more expensive to query than the cost
of just unloading the results into a physical table since they do full-table scans to
perform the join.
### Completeness
The first question I want to answer with this dataset is how much of the podcast
universe did I manage to scrape and ingest? According to a 2015 [Myndset article](http://myndset.com/2015/10/how-many-podcasts-are-there/),
there are somewhere between 100,000 and 200,000 podcasts. My extract found and ingested
**32,832** shows, so I found 15-30% of that total number.
As a check against that number, I wanted to see how many of the shows I subscribe to are
included in the dataset. I used [Exofudge's Pocketcasts API lib](https://github.com/exofudge/Pocket-Casts)
to get a list of the 94 shows I subscribe to on [Pocketcasts](https://www.shiftyjelly.com/pocketcasts/)
and compared it to the shows dataset:
```sql
SELECT in_dataset, COUNT(*)
FROM (
SELECT
subs.title, case when shows.id is null then false else true end as in_dataset
FROM podcasts.pocketcasts_subs subs
LEFT JOIN podcasts.shows shows ON subs.title = shows.title
GROUP BY 1,2
)
GROUP BY 1;
```
<table>
<tr>
<th>in_dataset</th>
<th>count</th>
</tr>
<tr>
<td>true</td><td>62</td>
</tr>
<tr>
<td>false</td><td>32</td>
</tr>
</table>
So the dataset contains **66%** of the shows I subscribe to. That number is likely skewed
higher by the fact that I listed to mainly top-100 shows on iTunes, which one extract was
specifically intended to cover. In any case, I'm pleased with the coverage I achieved.
### Basic aggregations
How many episodes are marked explicit?
```sql
SELECT
episodes_explicit as explicit,
COUNT(*) as episode_count
FROM podcasts.episodes_flat
GROUP BY 1 ORDER BY 2;
```
<table>
<tr>
<th>explicit</th>
<th>count</th>
</tr>
<tr>
<td>true</td><td>127,784</td>
</tr>
<tr>
<td>false</td><td>1,135,940</td>
</tr>
</table>
What's the average time between episodes?
```sql
SELECT
avg((episodes_pub_date - prev_pub_date)/(24000000)) as diff_in_hours,
avg((episodes_pub_date - prev_pub_date)/(24000000 * 60 * 60)) as diff_in_days
FROM (
SELECT
episodes_id,
show_id,
episodes_pub_date,
LAG(episodes_pub_date, 1) OVER (PARTITION BY show_id ORDER BY episodes_pub_date) as prev_pub_date
FROM podcasts.episodes_flat
);
```
<table>
<tr>
<th>diff_in_hours</th>
<th>diff_in_days</th>
</tr>
<tr>
<td>38,647.74</td><td>10.73</td>
</tr>
</table>
On average, how many episodes are in a feed?
```sql
SELECT AVG(episode_count) as avg, MIN(episode_count) as min, MAX(episode_count) as max
FROM (
SELECT show_id, COUNT(*) as episode_count
FROM podcasts.episodes_flat
GROUP BY 1
);
```
<table>
<tr>
<th>avg</th>
<th>min</th>
<th>max</th>
</tr>
<tr>
<td>82.954181436261</td>
<td>1</td>
<td>24,350</td>
</tr>
</table>
The feed with **24,350** episodes is the [TSN.ca](http://www.tsn.ca/) podcast, which has
published an average of 13 episodes a day since 2011. They keep their entire podcast
history through 2011 in their feed, which is a whopping **31 MB**.
### It's not a data post without some charts
What are the most popular audio encodings?
```sql
SELECT episodes_audio_mime_type, COUNT(*)
FROM podcasts.episodes_flat
GROUP BY 1 HAVING COUNT(*) > 1
ORDER BY 2 DESC;
```
<div id="audio-mime"><svg height="400"></svg></div>
Episode category over time
```sql
SELECT DATE(episodes_pub_date), categories.category, count(*)
FROM (
SELECT category, COUNT(*)
FROM podcasts.shows
WHERE category is not null
GROUP BY 1
ORDER BY 2 DESC
LIMIT 15) categories
INNER JOIN podcasts.episodes_flat ep ON categories.category=ep.show_category
WHERE episodes_pub_date > timestamp('2006-11-27') and episodes_pub_date < timestamp('2016-11-27')
GROUP BY 1, 2
ORDER BY 1 DESC, 2;
```
<div id="categories-histogram"><svg height="500"></svg></div>
<script src="/js/posts/podcasts-dataset.js"></script>
## Further work
There is a lot of unstructured text data in this dataset, using a tool like Elastic Search
to mine insights from the episode descriptions could yield some interesting information.
This would also be a great dataset to join with popularity data, but as far as I know
there is no good source for this available.
This dataset could also provide a good starting point for a show recommendation engine
based on NLP processing of descriptions and tag analysis.
Making something cool with this data? [Let me know!](mailto:brandon.telle+edd@gmail.com)
[
{
"label": "audio/mpeg",
"value" : 1105111
},
{
"label": "audio/x-m4a",
"value" : 35545
},
{
"label": "video/mp4",
"value" : 35440
},
{
"label": "audio/x-mpg",
"value" : 30203
},
{
"label": "video/x-m4v",
"value" : 12686
},
{
"label": "video/x-mp4",
"value" : 7754
},
{
"label": "audio/mp3",
"value" : 7621
},
{
"label": "audio/mpeg3",
"value" : 5290
},
{
"label": "audio/mp4",
"value" : 4886
},
{
"label": "video/mpeg",
"value" : 4287
}
]
\ No newline at end of file
This diff is collapsed.
d3.json('/data/podcasts-dataset/audio-mime-types.json', function(data) {
nv.addGraph(function() {
var chart = nv.models.pieChart()
.x(function(d) { return d.label })
.y(function(d) { return d.value/1248823 })
.showLabels(false) //Display pie labels
.donut(true) //Turn on Donut mode. Makes pie chart look tasty!
.donutRatio(0.25) //Configure how big you want the donut hole size to be.
;
chart.tooltip.contentGenerator(function (data) {
return "<p><strong>" + data.data['label'] + "</strong>: "+data.data['value']+"</p>";
});
d3.select("#audio-mime svg")
.datum(data)
.transition().duration(350)
.call(chart);
return chart;
})
});
d3.csv('/data/podcasts-dataset/output_categories-over-time.csv', function(data) {
nv.addGraph(function() {
var chart = nv.models.multiBarChart()
.x(function(d) { return d[0] })
.y(function(d) { return d[1] })
.color(d3.scale.category10().range())
.stacked(true)
.showControls(false);
;
chart.yAxis
.tickFormat(function(d) {return d});
chart.xAxis
.tickFormat(function(d) {
return d3.time.format('%x')(new Date(d))
});
chart.tooltip.contentGenerator(function (data) {
return "<p><strong>" + data.data['key'] + " " + d3.time.format('%x')(new Date(data.data[0])) + "</strong>: "+data.data[1]+"</p>";
});
d3.select('#categories-histogram svg')
.datum(csv_to_histogram(data))
.call(chart);
return chart;
});
});
var csv_to_histogram = function(data) {
var categories = {};
for(i in data) {
row = data[i];
if(!categories[row['category']]) {
categories[row['category']] = [];
}
categories[row['category']].push([(new Date(row['date'])).getTime(), parseInt(row['count'])]);
}
var ret = [];
for(key in categories) {
ret.push({"key": key, "values": categories[key]});
}
return ret;
}
\ No newline at end of file