Building Learnivore.com: a Ruby screencasts aggregator

December 15, 2009

Learnivore

Learnivore.com was a side-project I built over a few months to aggregate Ruby and Rails screencasts from publishers like PragProgs, RailsCasts, PeepCode, BDDCasts, ThinkCode.tv and more. It offered full-text search, filtering by tag, publisher, pricing (free/paid) and language.

I presented the technical architecture at the 15th Paris.rb meetup in November 2009. Here’s a write-up of how it worked.

Technical stack

The app was built with Ramaze (after an initial attempt with Merb), ActiveRecord, and Thin for serving. Full-text search was powered by ThinkingSphinx + Sphinx with faceted search. Other notable libraries included Hpricot for HTML/XML parsing, ActsAsTaggableOnSteroids for tagging, WillPaginate, and Craken for cron scheduling.

The data model

The core model was an Item (a screencast), with ThinkingSphinx indexing for faceted search:

class Item < ActiveRecord::Base
  acts_as_taggable

  validates_presence_of :title, :url, :source,
    :summary, :thumbnail_img_tag, :pricing
  validates_uniqueness_of :url # our key

  define_index do
    indexes title
    indexes summary
    indexes source, :facet => true
    indexes tags.name, :as => :tag, :facet => true
    indexes pricing, :facet => true
    indexes language, :facet => true
    has update_date
    where "update_date <= curdate()"

    set_property :delta => true
  end
end

Import pipeline

The data aggregation was built on TinyTL, a lightweight ETL DSL I wrote for this project (a precursor to what would later become Kiba ETL). The pipeline ran hourly and for each source would:

  • fetch an RSS feed or scrape an HTML page
  • normalise and conform each item
  • clean and conform tags
  • upsert (update or insert) using the target URL as key

Source declarations

Each source was declared with a concise DSL. Here’s how PeepCode was scraped:

source(:peepcode) {
  fresh_get(PEEPCODE_HOST + '/products').
    at('ul.products').search('li a').
    map { |e| PEEPCODE_HOST + e['href'] }
}

Source-specific processing

Each source had its own processing block to handle the differences in data format. For instance, BDDCasts combined RSS metadata with HTML scraping:

each_row(:bddcasts) { |row|
  # resource from rss
  row[:title] = grab(row[:content], 'title')
  row[:update_date] =
    Date.parse(grab(row[:content], 'published'))
  row[:url] =
    grab(row[:content], 'feedburner:origlink')
  row[:summary] = SimpleSanitizer.sanitize(
    CGI.unescapeHTML(
      grab(row[:content], 'content')))

  # resource from page
  page = get(row[:url])
  img_src = "http://bddcasts.com" +
    page.at("div.episode div.image img")
      .attributes['src']
  tags = grab_all(page,
    "div.episode div.details p.tags a", ', ')
  episode_css_classes =
    page.at("div.episode").attributes['class']
  pricing = case episode_css_classes
    when /orange/; 'paid'
    when /green/; 'free'
    else raise "Bddcasts pricing guess failed: " \
      "episode css class: '#{episode_css_classes}'"
  end

  row[:thumbnail_img_tag] =
    "<img src='#{img_src}' height='90' " \
    "style='border: 1px solid #aaa'/>"
  row[:pricing] = pricing
  row[:tag_list] = tags
}

This pattern of mixing RSS feeds with HTML scraping to fill in missing data was common across sources. Each publisher had a different format, and the ETL had to adapt to each one.

Tag cleaning

Tags from different sources (publishers + Delicious) had to be conformed and deduplicated. I used a spreadsheet-driven approach: export all tags to a CSV, decide on mappings (e.g. activerecord -> active-record, admin -> sysadmin), and import those rules back into the pipeline.

Data validation (“Screens”)

To catch regressions when sources changed format, I wrote validation assertions that ran against the imported data:

assert_include %w(free paid), row[:pricing]
assert_match(/https{0,1}:\/\//, row[:url], :url)
assert_not_nil row[:update_date]
assert row[:update_date].is_a?(Date)

Content distribution

The aggregated data was distributed through three channels:

  • Web with faceted search and full-text (ThinkingSphinx)
  • RSS sorted by freshness (which required grabbing update dates from each screencast)
  • Twitter semi-automatically via bit.ly

Looking back

Working on this side-project was an amazingly rewarding learning experience. It also brought me money indirectly, by giving me ideas and reusable code for my customers.

The key takeaways from this project:

  • Partner with your data sources — contacting publishers directly helped get better data
  • ThinkingSphinx is powerful — faceted search with delta indexing worked great
  • Code is a small portion of the work — most of the effort went into understanding, cleaning and conforming the data
  • All data needs cleaning before consumption — a lesson I’ve applied to every ETL project since