Saturday Scrape — Surf Reports

It’s Saturday. The surf is bad today which is why I decided to write a surfline.com scraper. It’s a freestyle post written while I code so don’t sweat the small stuff. This is entirely a quick, dirty solution to getting some data to play with.

Tools used during this session:

  1. Ruby 1.9.2
  2. Nokogiri (HTML Parsing)
  3. Typhoeus (HTTP Library for Fetching HTML)

First… Why write a scraper? No API exists for SurfLine.com and I want data.

SurfLine.com offers two ways of accessing data:

  1. Consume the web page HTML
  2. Consume the widget HTML

Each present their own problems. As I go through this short tutorial I’ll show you when things change and why knowing how to pivot is important.

The data that I want (right now) is the height of the waves for a surfing spot I frequent. The URL is http://www.surfline.com/surf-report/38th-ave-central-california_4191/ Hey, cool, turns out the height is wrapped in a nice identifiable DOM element:

<p id="text-surfheight">1-2 ft</p>

So a quick XPath selection on the Nokogiri::HTML document gets what we want…

elem = Nokogiri::HTML(page).search("//p[@id = 'text-surfheight']")

elem now contains an array of the elements we found from our search. Let’s pull the first one out and grab the inner_text

elem.first.inner_text

We’re done right? Unfortunately, surf reports are user reported and not always in the format we’d expect. I quickly discovered some pages don’t contain a text-surfheight id, but instead a short sentence describing the height:

<p class="text-data bottom-space-10">Inconsistent occ. 2 ft. </p>

That’s frustrating since now our code can’t simply look for the same element every time. So we improvise. Instead of spending time figuring out how to triangulate what I want out of this big page I start to look and see if there is a widget or API that could give me the surf report; there is a widget service. It makes a JavaScript call to load an HTML iFrame up. Great. So I jump right in and check out the new HTML page I’m looking at. The good thing about this widget is it’s only the surf report and not a bunch of web site features, videos, and links that I don’t need to look at. And, most importantly, the widget displays something about the wave height _always_. Unfortunately, the widget HTML is disgustingly ugly and has no apparent patterns. Sometimes the surf report height is contained in a span element and other times it’s thrown into a div; neither have ids. Iterating over a few different surf report pages I find that the widget does have one pattern: CSS Styling. (I know. Yucky)

But, the nice thing about extensive hardcoded styling in HTML is that it can actually serves as uniquely identifiable keys when looking at a small amount of html (like a widget!). So we can write an XPath search:

# Helper method to take a Nokogiri search and return nil
# or the value of a non-empty element
def inner_text nokogiri_search
  nokogiri_search.first.inner_text rescue nil # exist or set to nil
end
 
# spot_id comes out of a hash. Check out the full code linked @ the
# bottom of this page to see more
n = Nokogiri::HTML(
  grab_page(
    "http://www.surfline.com/widgets2/widget_camera_mods.cfm?id=#{spot_id}&amp;mdl=0111&amp;ftr=&amp;units=e&amp;lan=en"
  )
)
 
height = inner_text(n.xpath("//span[@style='font-size:21px;font-weight:bold']")) ||
  inner_text(n.xpath("//div[@style='font-size:12px;padding-left:10px;margin-bottom:7px;']")) ||
  "Report Not Available"

If the first clause passes than we have a wave height. If the second conditional passes we have a short sentence describing the surf conditions. If we can’t find anything we just default to “Report Not Available”

Okay, so it is not pretty but we’ve now got a decent way to identify wave heights from surf reports on Surfline.com’s Widgets which I’ve tested across 10 surf spots and seems to work OK for this initial prototype.

What’s next? Adding in the Tides table on the widget. It’s also a fun trickster since you have to look for the test “TIDE,” take first search result, and grab the parent element:

tides = n.xpath("//div//small[contains(text(),'TIDES')]").first.parent

Which gives us:

"\nTIDES:\n\n \n \n \n \n 02/24\u00A0\u00A0\u00A005:48AM\u00A0\u00A0\u00A01.23ft.\u00A0\u00A0\u00A0LOW\n \n \n \n \n 02/24\u00A0\u00A0\u00A011:46AM\u00A0\u00A0\u00A04.45ft.\u00A0\u00A0\u00A0HIGH\n \n \n \n \n 02/24\u00A0\u00A0\u00A005:49PM\u00A0\u00A0\u00A01.07ft.\u00A0\u00A0\u00A0LOW\n \n \n \n \n 02/25\u00A0\u00A0\u00A012:07AM\u00A0\u00A0\u00A04.97ft.\u00A0\u00A0\u00A0HIGH\n"

That looks ugly. Why is there Unicode in there? Let’s pull out just what we want…

prettier_tides = tides.text.gsub("\u00A0\u00A0\u00A0"," ").scan(/\d(.*?)\n/)
# => ["02/24 05:48AM 1.23ft. LOW", "02/24 11:46AM 4.45ft. HIGH", "02/24 05:49PM 1.07ft. LOW", "02/25 12:07AM 4.97ft. HIGH"]

What you do with this data is now up to you. I store it in a SQLite database and run the script every hour or so to get updates form 8 am to 2 pm PST.

ShouldISurf GitHub and for the scraping code you should look at lib/grab_reports.rb


As of 4/12/2012 this code has been running daily for almost 2 months and serving up surf tides on shouldisurf.com. Let me go knock on wood. Okay, back. The code base is small and effective. I’m glad I didn’t invest any time in making a robust solution now!

Ruby OpenURI open() returns StringIO & FileIO

Ahh, the little things in life. I was hacking out some code the other day and I was doing something like…

report_data = open(report_url)
data_set = FasterCSV.read(report_data.path)
data_set.each { |row| coolness(row) }

And I ran into an error coming out of FasterCSV:

TypeError: can't convert nil into String

After a quick headache or two I realized calling .path on an unknown Class type might be a problem. While in my test code and production code I was always seeing a FileIO object returned from the open() method, the particular use case I was now going through was returning a StringIO from open(). StringIO does not have a .path method, obviously. The realization of why this was happening came from digging into the implementation of open-uri.rb in Ruby 1.8.7:

  class Buffer # :nodoc:
    def initialize
      @io = StringIO.new
      @size = 0
    end
    attr_reader :size
 
    StringMax = 10240
    def <<(str)
      @io << str
      @size += str.length
      if StringIO === @io && StringMax < @size
        require 'tempfile'
        io = Tempfile.new('open-uri')
        io.binmode
        Meta.init io, @io if @io.respond_to? :meta
        io << @io.string
        @io = io
      end
    end

The Buffer implementation for open-uri checks the size of the object before creating a Tempfile. Anything under 10k and you’re looking at a StringIO object.

Fortunately, FasterCSV will operate on an IO object…

# from
FasterCSV.read(report_data.path)
# to
FasterCSV.read(report_data)

It was still a little startling to see such a behavior (/optimization?) going on in open-uri.rb. Pretty cool, but this reminded me that I need a few more test cases to uncover behaviors on different data set sizes.