A Long-Running Flight Scraper on AWS
A year ago I setup a long-running data scraper that would fetch flight info from the website FlightRadar24.com. I created a free Amazon VM that ran the scraper every six minutes for about eight months last year. In terms of raw data, it pulled about 200MB of data a day and collected a total of 56GB over the eight months. Now that I've retired the scraper, I thought I'd comment on how it worked.
In Search of Data
As I've written before, I started getting interested in geospatial tracks last year when I read about how different websites let you see where planes and ships are located in real time. Thinking that it'd be fun to get into geospatial data analysis, I started looking around for open datasets I could use for my own experiments. I quickly realized that there aren't that many datasets available, and that data owners like the FAA only provide their feeds to companies that have a legitimate business need for the data (and are willing to pay a subscription fee). Luckily, I stumbled upon a question posted on Stack Overflow where someone else wanted to know where they could get data. Buried in the replies was a comment from someone that noted that FlightRadar24.com aggregates crowd-sourced airline data and that you could get a current listing of plane locations in a JSON format just by querying a URL. I tried it out and was surprised to find a single wget operation returned a JSON file with the locations and stats of thousands of airplanes. The simplicity of it all was the kind of things scrapers dream about.
Cleaning the Data
I setup a simple cron job on my desktop at home to retrieve the JSON data every six minutes, and let it run from 7am-11pm over a long weekend. The fields weren't labeled in the data, so I had to do a lot of comparisons to figure out what everything meant. Fortunately, FR24's website has a nice graphical mode that lets you click on a plane and see its instantaneous stats. I grabbed some JSON data, picked out a specific flight to work with, and then compared the scraped data to the labeled fields the website gui was reporting. It was a little tricky since the stats on the website are continuously changing as the plane moves, but it was enough to identify each field in the JSON data array.
The next challenge was converting the instantaneous data into actual tracks (ie, instead of sorting by time, I wanted to sort by plane id). It was a perfect use case for Go: I wrote something that read in a day's worth of data, parsed the JSON data using an existing library, did the regrouping, and then dumped the output into a new format that would be easier for me to use (eg, one plane per line, with track points listed in a WKT linestring). The conversion reduces the 200MB of a day's data down to about 68MB of (ascii) track data. These track files are convenient for me because I can use command line tools (grep, awk, or python) to filter, group, and plot the things I want without having to do much thinking.
Running in Amazon for Free
Amazon sounded like the right place to run the scraper for longer runs. Since I hadn't signed up for AWS before, I found I qualified for their free usage tier, which lets you run a micro instance continuously for a year for free. The main limitation of the micro instance for me was storage- the instance only had about 5GB of storage, so I had to be careful about compressing the data and remembering to retrieve it off Amazon every few weeks. The latter wound up becoming a problem in November. I got tied up with Thanksgiving week and forgot to move data off the instance. The instance ran out of space and dropped data for a few weeks (during the busiest and most interesting time of year for US avionics, unfortunately). On the bright side I was able to get the previous data out of the system and restart it all without much trouble.
Below is the script I used in the instance to go fetch data. After I noticed the initial grabber was slipping by a few seconds every run, I added something to correct the sleep interval by the fetch delay.
#!/bin/bash MIN_SECONDS_DELAY=360 URL="http://www.flightradar24.com/zones/full_all.json" while true; do oldsec=`date +%s` mytime=`date +%F/%F-%H%M` mydir=`date +%F` wget -q $URL if [ ! -d "$mydir" ]; then mkdir $mydir fi mv full_all.json $mytime.json bzip2 -9 $mytime.json # Find how long this took to grab, then subtract from sleep interval newsec=`date +%s` gap=$((newsec - oldsec)) left=$((MIN_SECONDS_DELAY - gap)) if [ "$left" -gt 0 ]; then sleep $left fi done
Migrating Data Off Amazon
The next thing I wrote was a script to repack the data on the instance. It looked at a single day and converted the data from a series of bzip'd files to a bzip'd tar of uncompressed files. Repacking gave better compression for the download and was more convenient for later use. I ran this process by hand every few weeks. Below is the script I wrote. It was a big help using date's built-in day math- I wish I'd realized earlier that you can use it to walk through date ranges in bash.
#!/bin/bash CURRENT=$(date +%Y-%m-%d --date="1 week ago") END=$(date +%Y-%m-%d) # Ask the user for the beginning date read -p "Start Date [$CURRENT] " if [ "$REPLY" != "" ]; then CURRENT=$REPLY fi # Loop over all days since then while [ "$END" != "$CURRENT" ]; do echo $CURRENT if [[ ! -e "out/$CURRENT.tar" ]]; then echo "packaging" tar -cvf out/$CURRENT.tar $CURRENT else echo "skipping" fi CURRENT=$(date +%Y-%m-%d -d "$CURRENT +1 day") done
Finally, I had to write something for my desktop to go out and pull the archives off the instance. Amazon gives you an ssh key file for logging into your instance, so all I had to do was just scp the files I needed.
#!/bin/bash HOST=ec2-user@my-long-instance-name.compute.amazonaws.com KEY=../my-aws-key.pem # Set range to be from a week ago, stopping before today CURRENT=$(date +%Y-%m-%d --date="1 week ago") END=$(date +%Y-%m-%d) read -p "Start Date [$CURRENT] " if [ "$REPLY" != "" ]; then CURRENT=$REPLY fi # Grab each day while [ "$END" != "$CURRENT" ]; do echo $CURRENT if [[ ! -d "$CURRENT" ]]; then echo "Downloading $CURRENT" scp -i $KEY \ $HOST:data/out/$CURRENT.tar downloads/ tar xf downloads/$CURRENT.tar bunzip2 $CURRENT/\*.bz2 tar cjf bups/$CURRENT.tar.bz2 $CURRENT fi CURRENT=$(date +%Y-%m-%d -d "$CURRENT +1 day") done
While my approach meant that I had to manually check on things periodically, it worked pretty well for what I needed. By compressing the data, I found I only had to check the instance about once a month.
End of the Scraper
I turned the scraper off back in February for two reasons: FR24 started shutting down its API and my free AWS instance was expiring soon. The FR24 shutdown was kind of interesting. January 20th the interface worked, but all the flight IDs became "F2424F" and one of the the fields said "UPDATE-YOUR-FR24-APP". February 3 or so, the API stopped working altogether. Given that FR24 still has a webapp, I'll bet you can still retrieve data from them somehow. However, I'm going to respect their interface and not dig into it.