One problem that presents itself is the limitation on length. I am assuming (for now) that the limit is 2 GB characters, which, since each output of HASHBYTES is 16 characters (for MD5), then you are restricted to a total of
134,217,727 rows.
This is fairly reasonable. Other ways must be found for larger data sets.
Another question comes up as to how the CONCAT operator deals with various types. Does it expand the entire DATETIME2(7) value? Or, how are floating point numbers treated? Collation is also another way that differences can be lost. I have not tested this in cases where columns of different collation types would be rolled up.
Some nice features are the lack of limitation on number of columns, and the fact that column data type differences are ignored.
In terms of performance, binary forms of the data would definitely take less space, and so should compare faster. Unfortunately I know of no way to compare at the binary level.
What makes it semi-general is, of course, that the columns here are specific to the dataset being loaded, and therefore lend themselves to being the output of a code generator, rather than constantly manually created and managed.
SELECT batch_run_id, HASHBYTES('MD5', all_hashes) superhash FROM ( SELECT batch_run_id, All_Hashes = STUFF((SELECT '' + CAST(HASHBYTES('MD5', CONCAT( event_ccb_id , event_grouping_ccb_id , event_grouping_nm , group_ccb_id , group_nm , organizer_mbr_ccb_id , organizer_nm , event_type_ccb_id , event_type_nm , location_nm , location_strt_adr , listed_ind , public_calendar_listed_ind , ins_on_srvr_nm , creator_mbr_ccb_id , creator_mbr_nm , cr_dttm , modifier_mbr_ccb_id , modifier_mbr_nm , mod_dttm )) AS VARCHAR(MAX)) AS [text()] FROM loading.Event_Profiles xt WHERE xt.batch_run_id = t.batch_run_id FOR XML PATH('')), 1, 2, '') FROM loading.Event_Profiles t GROUP BY batch_run_id ) xMore testing is required, and performance numbers will be added. For a set (with the above columns) of 2600 rows, it took 16 ms. I need to get some larger data sets. With three identical sets of data loaded, I got the following results:
Batch # | Super Hash # |
---|---|
29 | 0xC8EC5BE4F39C133D444F2F9C1D81D14E |
30 | 0xC8EC5BE4F39C133D444F2F9C1D81D14E |
31 | 0xC8EC5BE4F39C133D444F2F9C1D81D14E |
For change detection, I am only looking at the previousload set, not all the sets. Say, for example, we were to load set A and process it through the ETL. Then, perhaps the next day, we loaded and processed set B. Now on the third day we accidentally re-load set A. This could be an error, or it could be that someone backed out a change in a very low-activity set. We don't know but, regardless, we need to reprocess set A because set B has altered the master set, and been propagated out to various downstream systems.
No comments:
Post a Comment