A beginner's guide to Datawarehouse

January 19, 2008

In this article I’ll share some ressources, tools and tips I found useful while learning and building datawarehouses.

What is this Datawarehouse thing and how can it be useful to me?

For me, a Datawarehouse is a “system which purpose is to make data speak” so that we can actually get a good understanding of what’s happening to a business or a system. It’s all about building tools to make informed decisions – be them business related or of technical matter.

To give real-life examples, you can use datawarehousing and what’s called dimensional modeling to fulfill the following needs:

  • build monthly or weekly sales reports, sliced by customer, country, or city
  • analyze your web logs and proceed to custom analysis beyond what existing tools can do
  • give managers of a call-center statistics to organise themselves (at which time of the day do we face a peak of calls? Is it useful to hire an additional person to handle calls during the holidays or the week-ends?)

Facts, dimensions, and why pure SQL doesn’t cut it

I suggest you first read Data Warehousing for Cavemen. This article by Philip Greenspun helped me understand the fundamentals of OLTP vs. OLAP and data warehouseing. It also made me laugh a lot because of those pictures of burning cars (see by yourself!).

In dimensional modeling, you’ll split the data into facts (usually additive things, such as sales amount for instance) and into dimensions (elements by which to want to slice the data, such as customer, region, date, time…).

Instead of writing technical ad-hoc sql queries, you can build views on joins between one fact table and one or more dimension table, and have a non technical user query on them using Excel pivot tables for instance.

For most developers used to keep things DRY and rely on heavily normalized tables, dimensional modeling is surprising at first, and you’ll have to un-learn some things first. I found the two following books very valuable to get a good grasp on those concepts.

The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling by Ralph Kimball and Margy Ross will give you a deep understanding of what is dimensional modeling, both through theory and through analysis of real-life use cases to demonstrate each concept. I consider this book to be a fundamental reading.

I actually started implementing my first datawarehouse without finishing the book. Instead, I picked up chapters which were most relevant as I implemented it, and ended up reading the whole book.

If I had to remember just one point from this book, it would be that a datawarehouse is as good as its dimensions. Hence you’ll have to carefully craft them, one after another, and you’ll have to pick relevant attributes only to avoid getting a non-functional system.

The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning also by Ralph Kimball and Joe Caserta really is the sister book of the first. ETL (extract/transform/load) is the data transformation toolbox which helps you prepare the data before filling it into your dimensional model.

Again if I had to remember one thing about ETL, it is that data from live systems is very dirty (most of the time). GUI bugs let invalid or unreliable data enter into the system, and sometimes it just cannot be cleaned (you’ll have to introduce the idea of “unreliable” or “unknown” data in the datawarehouse system). One more reason to keep datawarehoused tables and columns to the strict minimum.

If you’re not a book guy or just want to have a first look at these subjects online, have a look at those sites:

  • the articles by Ralph Kimball
  • the Data Warehousing Review web site
  • the articles at DW Info Center

Which tools can I use?

Here’s my current setup of choice. Please note it’s a matter of personal taste – you’ll find plenty of tools on the web for that. Just be aware that many overly complicated tools exist – prototyping a bit to see if a tool matches your preferences is worth it before taking the plunge:

  • ActiveWarehouse – a Ruby open-source project to handle analytical stuff
  • ActiveWarehouse-ETL – the sister project to handle extract transform and load
  • a simple MySQL back-end
  • Microsoft Excel equipped with the MySQL ODBC driver
  • Tableau Software whenever I can (a fantastic tool to dive into data and analyze it very quickly)
  • Ruby Reports aka Ruport to generate canned reports with grace

I’ve been using other software before (and still continue to do it), including Microsoft SSIS to build the data aggregation engine behind the Villanao website.

While it’s true that Ruby is slow compared to other languages or tools, I found out that focusing on a highly hackable solution was worth it, in my cases. SSIS and other solutions were kind of hard to deploy and maintain on the long run (my own opinion – and it reflects my current employer opinion, myself).

Thanks

Many thanks to Anthony Eden for building ActiveWarehouse and for his support on the ActiveWarehouse mailing-list.