
Learnivore.com was a side-project I built over a few months to aggregate Ruby and Rails screencasts from publishers like PragProgs, RailsCasts, PeepCode, BDDCasts, ThinkCode.tv and more. It offered full-text search, filtering by tag, publisher, pricing (free/paid) and language.
I presented the technical architecture at the 15th Paris.rb meetup in November 2009. Here’s a write-up of how it worked.
Technical stack
The app was built with Ramaze (after an initial attempt with Merb), ActiveRecord, and Thin for serving. Full-text search was powered by ThinkingSphinx + Sphinx with faceted search. Other notable libraries included Hpricot for HTML/XML parsing, ActsAsTaggableOnSteroids for tagging, WillPaginate, and Craken for cron scheduling.
The data model
The core model was an Item (a screencast), with ThinkingSphinx indexing for faceted search:
class Item < ActiveRecord::Base
acts_as_taggable
validates_presence_of :title, :url, :source,
:summary, :thumbnail_img_tag, :pricing
validates_uniqueness_of :url # our key
define_index do
indexes title
indexes summary
indexes source, :facet => true
indexes tags.name, :as => :tag, :facet => true
indexes pricing, :facet => true
indexes language, :facet => true
has update_date
where "update_date <= curdate()"
set_property :delta => true
end
endImport pipeline
The data aggregation was built on TinyTL, a lightweight ETL DSL I wrote for this project (a precursor to what would later become Kiba ETL). The pipeline ran hourly and for each source would:
- fetch an RSS feed or scrape an HTML page
- normalise and conform each item
- clean and conform tags
- upsert (update or insert) using the target URL as key
Source declarations
Each source was declared with a concise DSL. Here’s how PeepCode was scraped:
source(:peepcode) {
fresh_get(PEEPCODE_HOST + '/products').
at('ul.products').search('li a').
map { |e| PEEPCODE_HOST + e['href'] }
}Source-specific processing
Each source had its own processing block to handle the differences in data format. For instance, BDDCasts combined RSS metadata with HTML scraping:
each_row(:bddcasts) { |row|
# resource from rss
row[:title] = grab(row[:content], 'title')
row[:update_date] =
Date.parse(grab(row[:content], 'published'))
row[:url] =
grab(row[:content], 'feedburner:origlink')
row[:summary] = SimpleSanitizer.sanitize(
CGI.unescapeHTML(
grab(row[:content], 'content')))
# resource from page
page = get(row[:url])
img_src = "http://bddcasts.com" +
page.at("div.episode div.image img")
.attributes['src']
tags = grab_all(page,
"div.episode div.details p.tags a", ', ')
episode_css_classes =
page.at("div.episode").attributes['class']
pricing = case episode_css_classes
when /orange/; 'paid'
when /green/; 'free'
else raise "Bddcasts pricing guess failed: " \
"episode css class: '#{episode_css_classes}'"
end
row[:thumbnail_img_tag] =
"<img src='#{img_src}' height='90' " \
"style='border: 1px solid #aaa'/>"
row[:pricing] = pricing
row[:tag_list] = tags
}This pattern of mixing RSS feeds with HTML scraping to fill in missing data was common across sources. Each publisher had a different format, and the ETL had to adapt to each one.
Tag cleaning
Tags from different sources (publishers + Delicious) had to be conformed and deduplicated. I used a spreadsheet-driven approach: export all tags to a CSV, decide on mappings (e.g. activerecord -> active-record, admin -> sysadmin), and import those rules back into the pipeline.
Data validation (“Screens”)
To catch regressions when sources changed format, I wrote validation assertions that ran against the imported data:
assert_include %w(free paid), row[:pricing]
assert_match(/https{0,1}:\/\//, row[:url], :url)
assert_not_nil row[:update_date]
assert row[:update_date].is_a?(Date)Content distribution
The aggregated data was distributed through three channels:
- Web with faceted search and full-text (ThinkingSphinx)
- RSS sorted by freshness (which required grabbing update dates from each screencast)
- Twitter semi-automatically via bit.ly
Looking back
Working on this side-project was an amazingly rewarding learning experience. It also brought me money indirectly, by giving me ideas and reusable code for my customers.
The key takeaways from this project:
- Partner with your data sources — contacting publishers directly helped get better data
- ThinkingSphinx is powerful — faceted search with delta indexing worked great
- Code is a small portion of the work — most of the effort went into understanding, cleaning and conforming the data
- All data needs cleaning before consumption — a lesson I’ve applied to every ETL project since