How to write solid data-processing code?

April 05, 2015

When dealing with data tasks, didn’t you already find yourself writing nested loops with deeply imbricated conditionals right inside your Rakefile, a bit similar to what’s below?

upsert = Upsert.new connection, 'customers'
CSV.foreach('customers.csv', headers: true) do |row|
  case row[:contract_start_year]
  when 1
    row[:customer_key] = # SNIP
  when 2
    # SNIP
  else
    if row[:original_row_type] == 'CUSTOMER'
      # SNIP
    end
  end
  upsert.row(row) if row[:compliance_check] == '1'
end

While such code is common and works decently well for one-off & simple tasks, the situation degrades if:

  • You need to write many similar scripts & want to keep things DRY.
  • Your code must remain easy to maintain & able to evolve for several years.
  • Your processing needs just became more complex.

One technique available to keep a sane codebase in such cases is what I could call “declarative row-based processing”.

A different way to structure data-processing code

First introduced by Anthony Eden’s activewarehouse-etl in 2006, today reimplemented from scratch in my gem Kiba (more background here), declarative row-based processing describes the processing as a pipeline, in Ruby:

source CsvSource, 'incoming/customers_*.csv'

transform GeocodeTransform, [:adress1, :zip, :city], [:lat, :lon, :geocoding_status]
transform { |row| row[:geocoding_status] ? row : nil } # dismiss non geocoded rows
transform VerifyCompliance
transform { |row| row[:compliance_check] == '1' ? row : nil }

destination UpsertDestination, config[:reporting], table: 'customers', key: :cust_key

This is a Domain-Specific-Language, where the Domain is very narrow: “data row processing”. Your declaration gets parsed by the ETL, then at execution time, the ETL fetches rows from the source(s), passing them to each transform and ultimately to the destination(s).

Depending on your taste, this may either look appealing or weird! Either way, it has a number of interesting characteristics from a maintenance & code-writing point of view:

Well-defined & testable components

In your declaration, most features of Ruby, including require are allowed. This makes it possible to use your own classes for sources, transforms & destinations.

Since the responsabilities of each part are well defined and you can define them as classes with a very narrow scope, it makes it easier to unit-test your ETL components, make sure they do not regress over time & reuse them across many ETL scripts, just like regular code.

Quick reusable helpers for DRY code

Sometimes a class isn’t necessary, but you’ll want a quick reusable helper, like:

def parse_french_date(string)
  Date.strptime(string,'%d/%m/%Y')
end

This allows for script-level reuse, or even quick sharing between ETL scripts.

Macro-transform definitions

It is also possible to define a macro-transform like this:

def my_macro_transform(field)
  transform ThisThing, field
  transform { |row| do_that }
  transform ThatThing, :a, :b, :c
end

Then later call my_macro_transform instead of a transform: it will register the series of transforms as expected in lieu of the macro transform.

This allows to reuse subparts inside a single ETL script or throughout all your scripts via require.

Meta-declaration

Since the DSL is mostly regular Ruby, you can do things at parsing time that will affect your declaration, e.g.:

Dir['incoming/cust*.csv'].each do |f|
  if f =~ /this|that/
    source CsvSource, f
  end
end

Even if this means that the files will be listed at parsing time, it still comes handy.

Ease of maintenance (even for non-rubyists)

Rails gives structure to a project, so you mostly know where things are. In declarative row-based processing, it’s even more notable, because the structure is fairly constraining.

This makes it very easy to maintain ETL scripts that are in production for several years (as I discovered), and it also makes it much easier for non-rubyists (or beginners) to update the processing without breaking everything.

Isn’t that very similar to “good code” anyway?

You could say that all those characteristics are those of quality coding anyway, and I’d agree with you!

It’s just that when it comes to data processing, the very nature of the flow of rows make us more likely to write “quick scripts” rather than properly structured code, because we’re less used to data processing.

The “declarative row-based processing” is just a technique to bring back the quality you’re used to in regular code, right into data processing tasks.

What’s next?

In my upcoming posts, I’ll use Ruby & this type of syntax (and more specifically, Kiba) to share more ETL patterns and techniques with you. Stay tuned!