Building Learnivore.com: a Ruby screencasts aggregator

December 15, 2009

Learnivore

Learnivore.com was a side-project I built over a few months to aggregate Ruby and Rails screencasts from publishers like PragProgs, RailsCasts, PeepCode, BDDCasts, ThinkCode.tv and more. It offered full-text search, filtering by tag, publisher, pricing (free/paid) and language.

I presented the technical architecture at the 15th Paris.rb meetup in November 2009. Here’s a write-up of how it worked.

Technical stack

The app was built with Ramaze (after an initial attempt with Merb), ActiveRecord, and Thin for serving. Full-text search was powered by ThinkingSphinx + Sphinx with faceted search. Other notable libraries included Hpricot for HTML/XML parsing, ActsAsTaggableOnSteroids for tagging, WillPaginate, and Craken for cron scheduling.

The data model

The core model was an Item (a screencast), with ThinkingSphinx indexing for faceted search:

class Item < ActiveRecord::Base
  acts_as_taggable

  validates_presence_of :title, :url, :source,
    :summary, :thumbnail_img_tag, :pricing
  validates_uniqueness_of :url # our key

  define_index do
    indexes title
    indexes summary
    indexes source, :facet => true
    indexes tags.name, :as => :tag, :facet => true
    indexes pricing, :facet => true
    indexes language, :facet => true
    has update_date
    where "update_date <= curdate()"

    set_property :delta => true
  end
end

Import pipeline

The data aggregation was built on TinyTL, a lightweight ETL DSL I wrote for this project (a precursor to what would later become Kiba ETL). The pipeline ran hourly and for each source would:

fetch an RSS feed or scrape an HTML page
normalise and conform each item
clean and conform tags
upsert (update or insert) using the target URL as key

Source declarations

Each source was declared with a concise DSL. Here’s how PeepCode was scraped:

source(:peepcode) {
  fresh_get(PEEPCODE_HOST + '/products').
    at('ul.products').search('li a').
    map { |e| PEEPCODE_HOST + e['href'] }
}

Source-specific processing

Each source had its own processing block to handle the differences in data format. For instance, BDDCasts combined RSS metadata with HTML scraping:

each_row(:bddcasts) { |row|
  # resource from rss
  row[:title] = grab(row[:content], 'title')
  row[:update_date] =
    Date.parse(grab(row[:content], 'published'))
  row[:url] =
    grab(row[:content], 'feedburner:origlink')
  row[:summary] = SimpleSanitizer.sanitize(
    CGI.unescapeHTML(
      grab(row[:content], 'content')))

  # resource from page
  page = get(row[:url])
  img_src = "http://bddcasts.com" +
    page.at("div.episode div.image img")
      .attributes['src']
  tags = grab_all(page,
    "div.episode div.details p.tags a", ', ')
  episode_css_classes =
    page.at("div.episode").attributes['class']
  pricing = case episode_css_classes
    when /orange/; 'paid'
    when /green/; 'free'
    else raise "Bddcasts pricing guess failed: " \
      "episode css class: '#{episode_css_classes}'"
  end

  row[:thumbnail_img_tag] =
    "<img src='#{img_src}' height='90' " \
    "style='border: 1px solid #aaa'/>"
  row[:pricing] = pricing
  row[:tag_list] = tags
}

This pattern of mixing RSS feeds with HTML scraping to fill in missing data was common across sources. Each publisher had a different format, and the ETL had to adapt to each one.

Tag cleaning

Tags from different sources (publishers + Delicious) had to be conformed and deduplicated. I used a spreadsheet-driven approach: export all tags to a CSV, decide on mappings (e.g. activerecord -> active-record, admin -> sysadmin), and import those rules back into the pipeline.

Data validation (“Screens”)

To catch regressions when sources changed format, I wrote validation assertions that ran against the imported data:

assert_include %w(free paid), row[:pricing]
assert_match(/https{0,1}:\/\//, row[:url], :url)
assert_not_nil row[:update_date]
assert row[:update_date].is_a?(Date)

Content distribution

The aggregated data was distributed through three channels:

Web with faceted search and full-text (ThinkingSphinx)
RSS sorted by freshness (which required grabbing update dates from each screencast)
Twitter semi-automatically via bit.ly

Looking back

Working on this side-project was an amazingly rewarding learning experience. It also brought me money indirectly, by giving me ideas and reusable code for my customers.

The key takeaways from this project:

Partner with your data sources — contacting publishers directly helped get better data
ThinkingSphinx is powerful — faceted search with delta indexing worked great
Code is a small portion of the work — most of the effort went into understanding, cleaning and conforming the data
All data needs cleaning before consumption — a lesson I’ve applied to every ETL project since