Update: check out kiba-common
EnumerableExploder (available with Kiba v2) for a better way to achieve this today.
When processing data, it’s common to be willing to explode this:
buyers is known as a multivalued attribute) into this:
In this article you are going to see how to achieve this with Kiba ETL, ultimately with some very reusable ETL components.
A first way to normalize data from a source
In a first attempt, you could write a Kiba source that reads the data, splits the
buyers field then yields one row per buyer, like this:
deep_copy (coming from the facets gem) is used to make sure each sub-row is not sharing any value with the other ones. It relies on Marshal so be careful with performance if relevant here.
You can use such a source this way:
This works, yet you now have a single class dealing with two concerns at once (1/ CSV parsing 2/ normalization). This makes it less likely to be reusable.
What if we could separate the parsing and the normalization completely?
Splitting the parsing and the normalization
A Kiba source is simply a class that responds to
each and takes a couple of parameters at instantiation time (see documentation). As such, you can create a
RowNormalizer source that is able to instantiate any source provided and apply a given normalization logic, like this:
You can also transform your CSV source into a more general purpose one, like:
This new source has no knowledge whatsoever about the normalization process.
Now you can rewrite your ETL script this way:
Reusing your reusable ETL components
Now equipped with reusable components, you can recycle them to transform data for other scenarios, such as transforming this:
At this point you just need the following code:
Note that you can also switch the source at this stage, for instance to read the data from some database instead of CSV!
Thank you for sharing this article!