Recently I ran into an interesting ETL problem while using a source system "last update" field. Let me give you some background.
We have an ETL process that reads from a source system that was developed in-house. The queries were all based on the last update field in all of the tables.
While in UAT, several reports were reported as missing rows. After investigating, it appeared that the rows had never made it to the data mart. Needless to say, this was very worrisome.
I researched and tried to find out why these rows were excluded. There seemed to be no pattern, just random rows.
While looking at my morning logs, I noticed something strange. The ETL Last Update table showed times from about 5 hours after the ETL had run. It should have been when the ETL had run.
I looked in the source system, and there were 3 rows that had update dates in the future! 5 hours to be exact!
It turns out that under certain circumstances, the source system was using the wrong date/time to update the last update field. And this date was GMT, so it was 5 hours in the "future" when it was applied to the last update field.
The result of this was missed records on the ETL. It would miss 5 hours worth of updates anytime this occured in the source system.
So my recommendation, which I am now kicking myself for not implementing to begin with, is this: Always use a date range, not just a "Greater than" for last update fields. For example:
Where LastUpdate Between '3/5/09 11:00:00' and GetDate()
And the other rule... Never trust the source system to be accurate 100% of the time. Anticipate issues like this.
Anyway, that's all for now.
peace
No comments:
Post a Comment