When dealing with data tasks, didn’t you already find yourself writing nested loops with deeply imbricated conditionals right inside your
Rakefile, a bit similar to what’s below?
While such code is common and works decently well for one-off & simple tasks, the situation degrades if:
- You need to write many similar scripts & want to keep things DRY.
- Your code must remain easy to maintain & able to evolve for several years.
- Your processing needs just became more complex.
One technique available to keep a sane codebase in such cases is what I could call “declarative row-based processing”.
A different way to structure data-processing code
First introduced by Anthony Eden’s activewarehouse-etl in 2006, today reimplemented from scratch in my gem Kiba (more background here), declarative row-based processing describes the processing as a pipeline, in Ruby:
This is a Domain-Specific-Language, where the Domain is very narrow: “data row processing”. Your declaration gets parsed by the ETL, then at execution time, the ETL fetches rows from the source(s), passing them to each transform and ultimately to the destination(s).
Depending on your taste, this may either look appealing or weird! Either way, it has a number of interesting characteristics from a maintenance & code-writing point of view:
Well-defined & testable components
In your declaration, most features of Ruby, including
require are allowed. This makes it possible to use your own classes for sources, transforms & destinations.
Since the responsabilities of each part are well defined and you can define them as classes with a very narrow scope, it makes it easier to unit-test your ETL components, make sure they do not regress over time & reuse them across many ETL scripts, just like regular code.
Quick reusable helpers for DRY code
Sometimes a class isn’t necessary, but you’ll want a quick reusable helper, like:
This allows for script-level reuse, or even quick sharing between ETL scripts.
It is also possible to define a macro-transform like this:
Then later call
my_macro_transform instead of a transform: it will register the series of transforms as expected in lieu of the macro transform.
This allows to reuse subparts inside a single ETL script or throughout all your scripts via
Since the DSL is mostly regular Ruby, you can do things at parsing time that will affect your declaration, e.g.:
Even if this means that the files will be listed at parsing time, it still comes handy.
Ease of maintenance (even for non-rubyists)
Rails gives structure to a project, so you mostly know where things are. In declarative row-based processing, it’s even more notable, because the structure is fairly constraining.
This makes it very easy to maintain ETL scripts that are in production for several years (as I discovered), and it also makes it much easier for non-rubyists (or beginners) to update the processing without breaking everything.
Isn’t that very similar to “good code” anyway?
You could say that all those characteristics are those of quality coding anyway, and I’d agree with you!
It’s just that when it comes to data processing, the very nature of the flow of rows make us more likely to write “quick scripts” rather than properly structured code, because we’re less used to data processing.
The “declarative row-based processing” is just a technique to bring back the quality you’re used to in regular code, right into data processing tasks.
In my upcoming posts, I’ll use Ruby & this type of syntax (and more specifically, Kiba) to share more ETL patterns and techniques with you. Stay tuned!