Rubyists - Are you doing ETL unknowingly?

March 25, 2015

Have you already fulfilled one of those tasks while building or maintaining one of your Ruby-based apps?

Write a script to migrate a legacy database to a new schema.
Automate processing of your data to generate a report.
Synchronize all or part of the data between 2 systems on a regular basis.
Prepare your data for indexing/searching.
Aggregate heterogeneous data sources into one consistent database.
Clean-up dirty or bogus data.
Geocode rows in an app to present them through a map app.
Implement a data export process for your users.

If so, there’s one keyword you definitely need to know about: ETL, which stands for Extract-Transform-Load. Here’s why!

All these data tasks can be completed with ETL

Here’s a typical quote I get after introducing ETL to teams I work with:

Before, I didn’t realize that what I was doing had a name. Becoming aware of the word “ETL” has been tremendously useful to seek help, books, tools and to discover and apply efficient patterns.

If you google ETL, you will indeed find a wealth of resources (including the classic ETL book by Ralph Kimball) and numerous tools like the Ruby ETL by Square, Apache Storm, promptly followed by analytical tools like Mondrian.

If you were not aware of the ETL term, a whole new world opens to efficiently fulfill those tasks mentioned earlier! I’ll cover more about that in the upcoming weeks here.

But: is Ruby appropriate to write ETL processes?

You bet! I’ve been doing this for a decade (see for instance my talk showing the now unmaintained activewarehouse-etl — which I’m rewriting from the ground as kiba).

Ruby works very nicely to create well-tested, agile, ETL processes, either by processing the data directly, or by acting as a master of ceremony for other tools.

But you do not have to take my word on it: subscribe to this blog and I’ll be sharing some very concrete examples in the coming weeks. Stay tuned!