...
 
Commits (2)
---
layout: post
title: Cruise Location Dataset
---
I just got back from my honeymoon, my first cruise to Alaska on the Explorer of the Seas.
Aside from having a fantastic time with my wife, I gained a new fascination with the
breadth and scale of cruise travel. Our ship housed upwards of 2,000 passengers and was
often one of two or three ships docked in a port. Multiply that by the 100+ ships that
are currently running cruises, you can have 200,000 people on the ocean at a time visiting
dozens of ports across the globe.
When we got back, I began looking into cruise tracking sites and found one site that
offers historical ship location data for 200+ ships going back to 2003. Using that data, I
hoped to compile a dataset of ships, cities and locations for analysis and publishing.
## Extract
The scripts used to extract and clean the raw data are available on
[Github](https://github.com/btelle/cruise-scraper). They are quick hacks, but they got the
job done.
## Transformation
Once the raw data was loaded into BigQuery, I began transforming it. First step, I want to
join the location stamps to port cities and find the closest port city to that location.
I tried using a Cartesian join on the whole dataset:
```sql
SELECT callsign, created_at, loc.lat as lat, loc.lon as lon, wind_from, wind_knots,
wind_gust, barometer, air_temp, dew_point, city.lat as city_lat, city.lon as city_lon,
city.id as city_id
FROM `cruise_scraper.locations_clean` loc,
`cruise_scraper.port_cities` city
```
Since the raw dataset is rather large, about 4.7 million location stamps, BigQuery
[couldn't quite handle it](/img/cruise-scraper/bq-failure.png). I limited the query to a
few months worth of data and that worked, so I ran the same query 47 times and finally
ended up with 2.5 billion rows of Cartesian joined goodness.
The next step is to eliminate the rows where the location stamp is paired with any city
other than the one closest to it:
```sql
SELECT callsign, created_at, lat, lon, city_id as nearest_port_city, distance_to_port
FROM (
SELECT
callsign, created_at, lat, lon, wind_from, wind_knots, wind_gust, barometer, air_temp,
dew_point, city_lat as city_lat, city_lon as city_lon, city_id as city_id,
cartesianDistance(lat, lon, city_lat, city_lon) as distance_to_port,
row_number() over(partition by callsign, created_at order by cartesianDistance(lat, lon, city_lat, city_lon)) as rnk
FROM `cruise_scraper.locations_city_join_*`
GROUP BY 1,2,3,4,5,6,7,8,9,10,11,12,13,14
)
WHERE rnk=1
```
The `cartesianDistance` function in the above query is a UDF. BigQuery accepts Javascript
UDF's, this one computes the distance in meters between two (lat,lon) pairs:
```sql
CREATE TEMPORARY FUNCTION cartesianDistance(lat1 FLOAT64, lon1 FLOAT64, lat2 FLOAT64, lon2 FLOAT64)
RETURNS FLOAT64
LANGUAGE js AS """
var R = 6378.137; // Radius of earth in KM
var dLat = lat2 * Math.PI / 180 - lat1 * Math.PI / 180;
var dLon = lon2 * Math.PI / 180 - lon1 * Math.PI / 180;
var a = Math.sin(dLat/2) * Math.sin(dLat/2) +
Math.cos(lat1 * Math.PI / 180) * Math.cos(lat2 * Math.PI / 180) *
Math.sin(dLon/2) * Math.sin(dLon/2);
var c = 2 * Math.atan2(Math.sqrt(a), Math.sqrt(1-a));
var d = R * c;
return d * 1000; // meters
""";
```
Using the nearest city location stamp data, I aggregated down to daily granularity:
```sql
SELECT
trim(ships_v.name) as ship_name,
trim(loc.callsign) as callsign,
trim(ships_v.major_cruise_line) as major_cruise_line,
date,
cartesianDistance(x1, y1, x2, y2) as meters_traveled,
x1 as lat_start,
y1 as lon_start,
x2 as lat_end,
y2 as lon_end,
case when stopped_minutes >= 180 then TRUE else FALSE end as in_port, -- 3 hours spent stationary
stopped_minutes,
num_periods,
port_city as port_city_id,
cities.city as port_city_name,
cities.country_code as port_city_country
FROM (
SELECT
callsign,
date(created_at) as date,
max(case when day_start=1 then lat else null end) as x1,
max(case when day_start=1 then lon else null end) as y1,
max(case when day_end=1 then lat else null end) as x2,
max(case when day_end=1 then lon else null end) as y2,
sum(case when abs(lat-last_lat) < 0.1 and abs(lon-last_lon) < 0.1 then timestamp_diff(created_at, case when last_created_at is null then timestamp(date(created_at)) else last_created_at end, MINUTE) else 0 end) as stopped_minutes,
sum(timestamp_diff(created_at, last_created_at, MINUTE)) as total_minutes,
max(case
when last_lat is null and last_lon is null then nearest_port_city
when abs(lat-last_lat) < 0.1 and abs(lon-last_lon) < 0.1 then nearest_port_city
else null end) as port_city,
count(*) as num_periods
FROM (
SELECT
callsign,
created_at,
lat,
lon,
nearest_port_city,
distance_to_port,
lag(lat) over(partition by callsign, date(created_at) order by created_at) as last_lat,
lag(lon) over(partition by callsign, date(created_at) order by created_at) as last_lon,
lag(created_at) over(partition by callsign, date(created_at) order by created_at) as last_created_at,
row_number() over (partition by callsign, date(created_at) order by created_at) as day_start,
row_number() over (partition by callsign, date(created_at) order by created_at desc) as day_end
FROM cruise_scraper.locations_with_city
-- WHERE created_at > timestamp('2017-06-01')
GROUP BY 1,2,3,4,5,6
)
GROUP BY 1,2
) loc
LEFT JOIN `cruise_scraper.ships_v` ships_v ON LOWER(TRIM(loc.callsign))=LOWER(TRIM(ships_v.callsign))
LEFT JOIN `cruise_scraper.port_cities` cities ON port_city = cities.id
```
## Analysis
Which cruise line is operating the most ships:
```sql
SELECT major_cruise_line, COUNT(*)
FROM cruise_scraper.ships_v
GROUP BY 1
ORDER BY 2 DESC
```
<table>
<tr>
<th>major_cruise_line</th>
<th>count</th>
</tr>
<tr>
<td>Other</td>
<td>149</td>
</tr>
<tr>
<td>Royal Caribbean International</td>
<td>25</td>
</tr>
<tr>
<td>Carnival Cruise Line</td>
<td>24</td>
</tr>
<tr>
<td>Norwegian Cruise Line</td>
<td>16</td>
</tr>
<tr>
<td>Holland America Line</td>
<td>15</td>
</tr>
<tr>
<td>Costa Crociere</td>
<td>15</td>
</tr>
<tr>
<td>Princess Cruises</td>
<td>14</td>
</tr>
<tr>
<td>Celebrity Cruises</td>
<td>11</td>
</tr>
<tr>
<td>Silversea Cruises</td>
<td>4</td>
</tr>
<tr>
<td>Disney Cruises</td>
<td>3</td>
</tr>
</table>
How many day/ship combinations are missing data?
```sql
SELECT
COUNT(*)
FROM(
SELECT
ship_name,
date,
lag(date) over(partition by callsign order by date) as last_date
FROM `home-data-warehouse.cruise_scraper.daily_ship_location`
)
WHERE date_diff(date, last_date, DAY) > 1
```
<blockquote>83,106</blockquote>
What are the most popular port cities?
```sql
SELECT
port_city_name as city,
port_city_country as country,
COUNT(*) as count
FROM cruise_scraper.daily_ship_location
WHERE major_cruise_line <> 'Other' and in_port = True
GROUP BY 1,2
ORDER BY 3 DESC
```
<table>
<tr>
<th>city</th>
<th>country</th>
<th>count</th>
</tr>
<tr>
<td>Fort Lauderdale (Port Everglades)</td>
<td>US</td>
<td>3705</td>
</tr>
<tr>
<td>Vancouver</td>
<td>CA</td>
<td>3236</td>
</tr>
<tr>
<td>Cozumel</td>
<td>MX</td>
<td>3074</td>
</tr>
<tr>
<td>San Diego</td>
<td>US</td>
<td>2762</td>
</tr>
<tr>
<td>St. Maarten</td>
<td>SX</td>
<td>2046</td>
</tr>
</table>
## Data Issues
The raw data comes from an unmonitored data pipeline, there may be data missing.
Additionally, it looks like ships do not report location details when they are outside
of service areas. This leads to many days where there's only one reported location.
Since my definition of an in-port day depends on a time difference between stationary
locations, these days cannot be shown as in-port. This may be leading to under-reported
port days.
\ No newline at end of file