Data Warehousing - Tom's Ten Data Tips

Data Warehousing was an innovation from the 90'snot as tight as in "traditional" transaction processing
that promised to change the data landscape fordue to technical issues like proxy servers and caching.
good. How far have we come? Many vendors haveBecause of these differences, IT people need to
entered the marketplace because it makes sense toadapt to the web process flow, rather than having
bring together data from throughout the organization,the process adapt to IT needs as is common for
and this will continue to make sense in the future.most other DWH interfaces.
How large the Data Warehouse market will grow6. Which Data Should Be loaded In The Data
nobody knows yet. But for sure it is still growingWarehouse?
fast, and currently is estimated at 4,5 billion dollar perThe data that enter the DWH ultimately determine
year (IDC).its place in the organization. A "let's load all data, to
1. Why Do Data Warehouse Projects Run Into Scopebe safe"-attitude is a sure fire way to derail your
Creep?DWH project. Choices as to what should and should
To quote Bill Inmon (guru and author of several greatnot be included need to be made early on, to keep
books on Data Warehousing) "Traditional projectsthe project manageable. After proven success of the
start with requirements and end with data. Datadelivered, deployed, and profitably exploited DWH,
Warehousing projects start with data and end withthere always will be funding somewhere to include
requirements." As soon as the project gets underpreviously ignored interfaces. Given the anticipated
way, users will find new applications, and with it willlifecycle of the DWH, it makes perfect sense to
come new requests for data. Interestingly, theseconsciously exclude certain sources. The choice as to
projects often are justified by moving Q&Rwhat data to include needs to be driven by business
work away from the 'data people'. What we've seenconsiderations, and in particular reference to the
is that the first thing that happens as soon as thecompany bottom line. If it can't be shown how data
project delivers is that more requests for specialwill be put to use profitably, they stay out! See also
queries are submitted to these same 'data people'.tip #3.
This may appear to undermine the initial business7. Data Warehousing & Company Politics
case but actually signals the onset of value creationData Warehouses have an impact on the company
from the DWH project.bottom line. Hence, they are likely candidates for turf
2. Star Schema Versus Entity Relation Model?battles, and are also at risk of becoming "small
There has been enormous debate in the communitychange" in budget allocation negotiations. None of
about the merits of different data models. At thethese considerations benefit corporate long term
risk of over simplifying: ER models tend to havegoals. Managing a DWH project is hard enough as it
better performance (processing time) for the endis, and budget issues shouldn't make it any harder
user, and are often perceived as "easier" tothan it already is. Because DWH investments are in
understand by end users. Drawbacks are that ERthe present and revenues lie in the future, it is even
models require more disk space, and, because of themore important to secure funding through a sound
intrinsic redundancy in the data, have consistencybusiness case and buy-in from the appropriate (high)
problems from a maintenance perspective. Havingmanagement level. See also Tip #3. Access to data
said this, the practice seems to be that often somemeans power, and talking about power is one of the
combination of the two is unavoidable in the practicalgreatest management taboos, still around. Sensitive
setting, despite preferences (ER or Star) of the chiefas they are, even budgets are more readily
architects. Overall, Star models seem to have gaineddiscussed...
the most ground.8. Data Warehouse Projects Traps
3. The Importance of a Data Warehouse BusinessSome commonly recurring 'roadblocks' on the path to
Casetimely delivery of a Data Warehouse project:
Much has been written about the business case for a- ETL processes have eaten up so much time (and
Data Warehouse. What goes in to a good businessstill need "babysitters"), that little if any time is left to
case? IT savings are ubiquitous in DWH businessdevelop applications needed to exploit the DWH
cases. The important point is to not limit this to 'pure'- Some data are needed, but turn out not to be
savings, but to connect to primary businessunavailable, or not in a timely fashion
processes as much as possible. As an example, faster- Maintenance required for tuning, indexing, and
turnaround cycles for list selections are fine (whenbackup and recovery is severely underestimated
quantified in hourly rates), but it is even better if the- Different ways of calculating the same phenomenon
revenue from more customer acquisitions that followlead to different results, and nobody is able to
from these selections can be tied in. Not only will theconclusively explain the difference(s)
relation to revenue growth rather than savings make- The data that is loaded (and recombined) turn out
for a more balanced business case, more important isto contain previously unknown inconsistencies in the
the intrinsic business buy-in that results from a directsource systems, the 'classic' data quality issues that
connection to the company bottom line. These days,trip DWH projects
changes in legislation (in particular Sarbanes-Oxley)- Metadata were lacking, and developers spend
play a major role in justifying business cases. Thisinordinate amounts of time finding out what a field
may be either through a higher company valuationreally 'means'
for its transparent information gathering, or, less9. DWH Hardware and Software Go Hand in Hand
sleepless night for the CEO, which is of courseIn Data Warehousing, it is not about hardware, and
priceless...not about software: it is about the perfect
4. Why Do Data Warehouse Projects 'Never' Gointegration of these two. Those who begin their
Wrong?project from either end, will pay dearly for this
Actually, Data Warehouse projects do sometimes fail.mistake. Reasons are:
But, they fail so rarely, that it is actually very hard to· in terms of price/performance, new,
believe... Especially after having talked to so manypre-integrated hardware-software combinations are
disgruntled end-users. And there are many ways ataking the lead
Data Warehouse project can go wrong. Delivering on· from a project management perspective,
time, data administration issues, and unavoidable datayou never want to be caught between vendors
quality issues in feeding systems. Corporate politicswhen a proposed solution doesn't work as expected
(see Tip 7) are probably the best explanation for this· database tuning and indexing is very
phenomenon of near 100% success rates on DWHimportant and a hugely complex job, necessarily left
projects. In my experience, the reason why a failureto specialists (in-house trained)
or 'semi-failure' can go unnoticed is either because10. Performance is Key
senior management is not aware, or, let's sayAlthough I don't often find technology factors to be
"unmotivated" to talk about misspending of companythis important, in Data Warehouse acceptance, no
funds. As a result, not enough is learned. Maybe weother factor will be as important as performance. As
as consultants have a stake in this as well, as thissize increases over time, this factor becomes even
assures the industry plenty of ongoing business... Jmore important. There are three reasons for this:
5. What is Different About Warehousing Web Data?
Kimball & Merz (2000): "Although this clickstream1. performance has a huge impact on the
data in many cases is raw and unvarnished, it has thedevelopment speed (initial load is always very time
potential of providing unprecedented detail aboutconsuming), and hence the overall maturity of the
every gesture made by every human being using theDWH at delivery time
Web medium". The subatomic nature of clickstream2. performance can make or break end-user
data poses unique challenges. There are fewer built inacceptance, in particular the predictability of
feedback mechanisms to ensure data quality,performance
compared to other data streams. The relation3.
between user mouse clicks and server log records is