ETL top-down 1 – Architecting abstraction layering

April 22, 2009
168 Views

I work at a very exciting BI-program where we have been able to do things right. The project is part of a larger program that contains Business Consulting, Master Data Management and Business Intelligence and, who knows, maybe even Information Lifecycle Management.

My work on the project includes ETL on the Enterprise Data Warehouse, a central project in the program.

My trade is, and has for some time been architecting developer. This gives me advantages early in the ETL process when it comes to what I like to call “Abstraction Layering”

Abstraction Layering – customer perspective (why)

Abstraction layering helps to set the balance between “keeping things open” vs. “delivering as soon as possible”.

For my current project we need to deliver quickly while handling a few issues (from the top of my head):

  • Loosely defined long term goal – I think
  • Distributed developers, both geographically, and experience-wise
  • Many source systems
  • Large master data management and other projects changing the environment
  • Real-time
  • Right time
  • Traceability

When implementation is done, we need to focus on simple measurable tasks. One way to do this is to model the work on well defined levels of abstraction. This way we


I work at a very exciting BI-program where we have been able to do things right. The project is part of a larger program that contains Business Consulting, Master Data Management and Business Intelligence and, who knows, maybe even Information Lifecycle Management.

My work on the project includes ETL on the Enterprise Data Warehouse, a central project in the program.

My trade is, and has for some time been architecting developer. This gives me advantages early in the ETL process when it comes to what I like to call “Abstraction Layering”

Abstraction Layering – customer perspective (why)

Abstraction layering helps to set the balance between “keeping things open” vs. “delivering as soon as possible”.

For my current project we need to deliver quickly while handling a few issues (from the top of my head):

  • Loosely defined long term goal – I think
  • Distributed developers, both geographically, and experience-wise
  • Many source systems
  • Large master data management and other projects changing the environment
  • Real-time
  • Right time
  • Traceability

When implementation is done, we need to focus on simple measurable tasks. One way to do this is to model the work on well defined levels of abstraction. This way we can design top-down by having the most abstract discussions first, then some intermediate discussions and lastly the implementation details.

Things we do interesting to ETL includes:

  1. Selecting reference architecture.
    Master Data Management, Hub and spoke EDW with 2G, full archive of source systems, data marts, custom Meta Data Repository.
  2. Create ETL “horizontal” layering – interfaces and documentation.
    Packages take data from one architecture layer to another, grouping functionality and enabling measurability.
  3. Create ETL “vertical” layering – restrictions and grouping.
    Jobs uses “job packages” uses “aggregated packages” that groups functionality in measurable chunks.
  4. Specify update intervals and delivery.
    We plan for nightly job, hourly job and real-time job. Monthly, weekly reports, operational BI and more.
  5. Define deployment, operations, etc.
    Operations implements ITIL, we should interface with it as it matures.

We deliver.

Abstraction Layering – architect perspective (how)

Architecting abstraction layering is done to serve the data integration projects by empowering a few roles, these includes:

  • Project manager
    Work breakdown structure gets easier because one for any integration task have some nice metaphors.
  • Developer
    Gets assignments with a predictable scope.
  • Tester
    Can reuse tests because many atomic items has the same scope.

The architect gets a bag of expression for reuse in the modeling of all the ETL jobs and test templates. It gets possible to create templates for kinds of functionality used often or placeholders for functionality other systems depends on.

Abstraction Layering – project manager perspective (what)

The developer lead gets measurability and some nice metaphors from the abstraction layering, in our current project they are

  • Job
    Roughly equivalent to executable, e.g.: “Nightly job”.
  • Agg
    Typically one for each of the different states a job goes through, e.g.: “Source extract”
  • Fun
    Specific function for an entity, e.g.: “Extract <<customer tables>> from source system X”
  • Task / Tsk
    Task is part of a function, it moves, changes or collects data. Data in the warehouse knows where it comes from by reference to a such task id, a sample task might be “Extract customer address from source system X”.

The project manager must choose when these metaphors are appropriate, of course.

Abstraction Layering – developer perspective (what)

When assigned a task a developer can by the name of the delivery see how it fits into the wide picture on three dimensions

  • Job/Agg/Fun/Tsk
    Dictates the level along the low-level to high-level axis.
  • Context
    Horizontal layers in the architecture touched, for instance SourceDsa or DmsaDm.
  • Function
    Typically the ‘T’ in “ETL”.

Most work repetitive by nature should have current templates controlled by the architect.

Abstraction Layering – test perspective (what)    

Too early to say, but it looks like a combination of the abstract layering, “data haven” and MR shall make test-driven development beyond simple testing possible. Looks like integration and performance testing shall come relatively cheap.

Fading out

OK, this grew pretty long, looks like I’ll have to do more on parts of this later, with more concrete samples. Hope I find the time.

G.