Also published the same day: What's ahead for Kiba ETL? (announcing Kiba Pro and an upcoming ebook).
Today’s scenario: you’re working for a thriving US company who just bought a French company to expand to the EU market.
You are responsible to build an aggregated sales reports and you’d like to integrate the French company sales data, coming as an original CSV flavor:
As a US company, you would like to have this kind of format instead:
The key differences are:
- the columns are separated by a semi-colon (not a comma)
- the prices are using a comma for the decimal separator (instead of the dot)
- the column names are in French (rather than in English)
- the dates are formatted as 8/3/2015 (we’d like ISO instead)
You'd usually would have to transcode data from ISO-8859-1 to UTF-8 too. I'll leave that for a future post.
The aim of this post is to show you how to convert the data to your target format, using Kiba ETL.
There are plenty of other tools to manipulate CSV! I'm just using CSV as an example of how Kiba works as a general-purpose ETL tool.
How plain Ruby and Kiba ETL compare?
A quick, one-off version conversion script using plain Ruby could look like this:
As this kind of script grows though, they get harder and harder to maintain. It’s easy to introduce an error in there! And the various parts are not really easily reusable nor testable.
To fix this, I’m going to take you step by step, incrementally, so we rewrite and refactor the ETL script with Kiba, to end up this way:
Don't be fooled! This version requires a lot more (but: reusable, testable) code that will sit in common.rb.
Setup your Kiba ETL project
First create a Gemfile
to specify our dependencies:
Install the dependencies:
Create our script:
Verify things work as expected:
You’re good to go!
You can also use the .rb extension, this is regular Ruby. I often change the extension to make sure the file won't get loaded by mistake inside a given Rails app, for instance.
Create a reusable CSV source
Kiba ETL focuses on creating reusable, maintainable ETL components. We’re going to do just that for the source of rows.
Create a repository of reusable bits:
Edit it and add this simple CSV source declaration:
You can read more about the syntax used below in the Kiba documentation.
If your project becomes much larger, you could put the CsvSource in its own file, with proper unit tests etc.
Now modify your convert-csv.etl
file to use exactly this instead:
The way we declare the source here makes it easier to quickly modify the script to use an alternate implementation (eg: using JRuby CSV reader). We could even declare the source conditionally based on what is currently running.
Try out your brand new CSV source
Create a commandes.csv
file with the data provided at the beginning of the post, then run Kiba:
At this stage, nothing should show up. How do we know we have what we need?
Let’s use awesome_print
right after the declaration to debug our pipeline:
Run the ETL script again:
You should get:
This means we’ve been able to parse our data made of semi-colon separated values, yay!
Since using ap
here seems handy, let’s make a reusable helper in common.rb:
Now you can quickly enjoy calling show_me!
whenever you want it.
Remember, we’ve been able to parse the semi-colon data, but we still have a number of differences in the format that we want to smooth out. Let’s tackle this!
Parse the numbers
The montant_eur
field (amount in euros) uses the comma as the decimal separator.
Let’s fix this and also rename the field at the same time:
Make sure to understand float precision and validate the input as needed, depending on your needs.
Now verify that things work as expected:
Looks good again! Now that we have this working, maybe we can extract a reusable component again?
Add this to common.rb
:
I'm using Ruby 2.1+ required keywords arguments here. This allows flexible yet fail-fast code.
You may want to make this source more robust or configurable depending on your needs.
We can now update the ETL script:
Much better!
Another way to tackle this is to separate the parsing and the renaming into 2 separate steps. This is very common in ETL scripts. I'm keeping a common step here to make sure we see the old and new values at once, for learning purposes.
Reformat your dates
Let’s parse our invoicing date in convert-csv.etl
now:
Again run kiba and check the output:
Looks like things work as expected! We could keep the date as a Ruby date and convert down the road (if we need to apply more transforms) or just convert it back to a string:
Run again and we’ll have what we want:
Let’s now extract a reusable component. Put this in common.rb
:
Then update the ETL script:
Run again, and you’ll see we’re good to go forward!
Rename the remaining column
"Rename the remaining". It's a bit difficult to pronounce. Say it loud 10 times.
The previous transforms already took care to assign a target name to the fields they were processing. We still have to rename one column though. Let’s do this:
Although that’s a bit overkill (but may not be, if you have many columns to process), we can refactor the code this way:
Add this to common.rb
:
Protect your script from input changes
It’s not unusual for a data source to suddenly change of format. It’s a good practice to put guards in place to detect such changes and fail fast.
For instance, we could just verify the source columns are there and provide a non-blank value, and raise an error otherwise.
Edit the Gemfile (we’re going to add a useful gem here):
Install the new gem:
Then modify your ETL script:
The check can be implemented this way:
Voilà! If the French subsidiary start sending a modified format, you’ll get an error.
You have a more robust ETL script now!
Prepare your CSV destination
Finally, it’s time to write our target CSV destination. Add this to common.rb
:
This destination will write the rows, flushing the file on close, and will also make sure that the provided columns are in the row.
Let’s use it now:
Run kiba on the script and you’ll get the expected CSV format!
Where do you go from here?
Obviously, you had to write more lines of code compared to the original Ruby version!
But you also now have a bunch of reusable components, and your ETL scripts are easier to maintain and to evolve.
For instance, if you write a database destination, modifying the ETL script to target it will only involve a one-line change, without much risk to break anything.
I hope you have a better understanding of the process of writing and refactoring ETL jobs with Kiba now. Drop me a line or a comment otherwise!
Stay tuned for more ETL articles!
Thank you for sharing this article!