Saturday Scrape — Surf Reports

It’s Saturday. The surf is bad today which is why I decided to write a scraper. It’s a freestyle post written while I code so don’t sweat the small stuff. This is entirely a quick, dirty solution to getting some data to play with.

Tools used during this session:

  1. Ruby 1.9.2
  2. Nokogiri (HTML Parsing)
  3. Typhoeus (HTTP Library for Fetching HTML)

First… Why write a scraper? No API exists for and I want data. offers two ways of accessing data:

  1. Consume the web page HTML
  2. Consume the widget HTML

Each present their own problems. As I go through this short tutorial I’ll show you when things change and why knowing how to pivot is important.

The data that I want (right now) is the height of the waves for a surfing spot I frequent. The URL is Hey, cool, turns out the height is wrapped in a nice identifiable DOM element:

<p id="text-surfheight">1-2 ft</p>

So a quick XPath selection on the Nokogiri::HTML document gets what we want…

elem = Nokogiri::HTML(page).search("//p[@id = 'text-surfheight']")

elem now contains an array of the elements we found from our search. Let’s pull the first one out and grab the inner_text


We’re done right? Unfortunately, surf reports are user reported and not always in the format we’d expect. I quickly discovered some pages don’t contain a text-surfheight id, but instead a short sentence describing the height:

<p class="text-data bottom-space-10">Inconsistent occ. 2 ft. </p>

That’s frustrating since now our code can’t simply look for the same element every time. So we improvise. Instead of spending time figuring out how to triangulate what I want out of this big page I start to look and see if there is a widget or API that could give me the surf report; there is a widget service. It makes a JavaScript call to load an HTML iFrame up. Great. So I jump right in and check out the new HTML page I’m looking at. The good thing about this widget is it’s only the surf report and not a bunch of web site features, videos, and links that I don’t need to look at. And, most importantly, the widget displays something about the wave height _always_. Unfortunately, the widget HTML is disgustingly ugly and has no apparent patterns. Sometimes the surf report height is contained in a span element and other times it’s thrown into a div; neither have ids. Iterating over a few different surf report pages I find that the widget does have one pattern: CSS Styling. (I know. Yucky)

But, the nice thing about extensive hardcoded styling in HTML is that it can actually serves as uniquely identifiable keys when looking at a small amount of html (like a widget!). So we can write an XPath search:

# Helper method to take a Nokogiri search and return nil
# or the value of a non-empty element
def inner_text nokogiri_search
  nokogiri_search.first.inner_text rescue nil # exist or set to nil
# spot_id comes out of a hash. Check out the full code linked @ the
# bottom of this page to see more
n = Nokogiri::HTML(
height = inner_text(n.xpath("//span[@style='font-size:21px;font-weight:bold']")) ||
  inner_text(n.xpath("//div[@style='font-size:12px;padding-left:10px;margin-bottom:7px;']")) ||
  "Report Not Available"

If the first clause passes than we have a wave height. If the second conditional passes we have a short sentence describing the surf conditions. If we can’t find anything we just default to “Report Not Available”

Okay, so it is not pretty but we’ve now got a decent way to identify wave heights from surf reports on’s Widgets which I’ve tested across 10 surf spots and seems to work OK for this initial prototype.

What’s next? Adding in the Tides table on the widget. It’s also a fun trickster since you have to look for the test “TIDE,” take first search result, and grab the parent element:

tides = n.xpath("//div//small[contains(text(),'TIDES')]").first.parent

Which gives us:

"\nTIDES:\n\n \n \n \n \n 02/24\u00A0\u00A0\u00A005:48AM\u00A0\u00A0\u00A01.23ft.\u00A0\u00A0\u00A0LOW\n \n \n \n \n 02/24\u00A0\u00A0\u00A011:46AM\u00A0\u00A0\u00A04.45ft.\u00A0\u00A0\u00A0HIGH\n \n \n \n \n 02/24\u00A0\u00A0\u00A005:49PM\u00A0\u00A0\u00A01.07ft.\u00A0\u00A0\u00A0LOW\n \n \n \n \n 02/25\u00A0\u00A0\u00A012:07AM\u00A0\u00A0\u00A04.97ft.\u00A0\u00A0\u00A0HIGH\n"

That looks ugly. Why is there Unicode in there? Let’s pull out just what we want…

prettier_tides = tides.text.gsub("\u00A0\u00A0\u00A0"," ").scan(/\d(.*?)\n/)
# => ["02/24 05:48AM 1.23ft. LOW", "02/24 11:46AM 4.45ft. HIGH", "02/24 05:49PM 1.07ft. LOW", "02/25 12:07AM 4.97ft. HIGH"]

What you do with this data is now up to you. I store it in a SQLite database and run the script every hour or so to get updates form 8 am to 2 pm PST.

ShouldISurf GitHub and for the scraping code you should look at lib/grab_reports.rb

As of 4/12/2012 this code has been running daily for almost 2 months and serving up surf tides on Let me go knock on wood. Okay, back. The code base is small and effective. I’m glad I didn’t invest any time in making a robust solution now!

HelloBirthday Grows Up, Goes Private

It must have been two years ago that I missed my friends birthday and she was super upset with me. Naturally, I devoted 48 hours to building an app that would never let me forget a birthday. (I think this may have pissed her off even more…)

Fast forward… I’ve decided to change HelloBirthday so that as of January 25th, 2012 new users will only see a forecast of birthdays and wishing will not occur. This only affects new users; current users you’re OK.

HelloBirthday still has the capabilities to automate wishing and I’m letting friends, family, and people who know me to continue to use it. If you want to use HelloBirthday please add it and then e-mail me (