Update: check out kiba-common EnumerableExploder
(available with Kiba v2) for a better way to achieve this today.
When processing data, it’s common to be willing to explode this:
(where buyers
is known as a multivalued attribute) into this:
In this article you are going to see how to achieve this with Kiba ETL, ultimately with some very reusable ETL components.
A first way to normalize data from a source
In a first attempt, you could write a Kiba source that reads the data, splits the buyers
field then yields one row per buyer, like this:
deep_copy (coming from the facets gem) is used to make sure each sub-row is not sharing any value with the other ones. It relies on Marshal so be careful with performance if relevant here.
You can use such a source this way:
This works, yet you now have a single class dealing with two concerns at once (1/ CSV parsing 2/ normalization). This makes it less likely to be reusable.
What if we could separate the parsing and the normalization completely?
Splitting the parsing and the normalization
A Kiba source is simply a class that responds to each
and takes a couple of parameters at instantiation time (see documentation). As such, you can create a RowNormalizer
source that is able to instantiate any source provided and apply a given normalization logic, like this:
You can also transform your CSV source into a more general purpose one, like:
This new source has no knowledge whatsoever about the normalization process.
Now you can rewrite your ETL script this way:
Reusing your reusable ETL components
Now equipped with reusable components, you can recycle them to transform data for other scenarios, such as transforming this:
into this:
At this point you just need the following code:
Note that you can also switch the source at this stage, for instance to read the data from some database instead of CSV!
Credits
Many thanks to Lucas Di Cioccio, Lukas Fittl, Jordon Bedwell and Franck Verrot for the inspiration!
Thank you for sharing this article!