Monday, June 27, 2016

Detecting change in a feed cycle

I find it difficult to detect change in my incoming feeds in a consistent way from system to system, so I have begun working on a semi-general way to detect if there was any change, in any column, in any row, between on import cycle and the previous cycle. This is because manual triggering of a feed load can perhaps accidentally be triggered twice, or if a feed was triggered manually in the night, and there is no change up until a schedule run takes place, then there is no sense loading the same data twice. Theoretically, there should not be any impact from double-loading the same data, but it implies that "no change" occurred between two legitimate samplings of the source data, which is in fact not necessarily true. Here is the SQL Server code for a system I currently maintain. It uses the HASHBYTES function to calculate any change in any particular row, and then it aggregates this into a single string.

One problem that presents itself is the limitation on length. I am assuming (for now) that the limit is 2 GB characters, which, since each output of HASHBYTES is 16 characters (for MD5), then you are restricted to a total of

134,217,727 rows.

This is fairly reasonable. Other ways must be found for larger data sets.

Another question comes up as to how the CONCAT operator deals with various types. Does it expand the entire DATETIME2(7) value? Or, how are floating point numbers treated? Collation is also another way that differences can be lost. I have not tested this in cases where columns of different collation types would be rolled up.

Some nice features are the lack of limitation on number of columns, and the fact that column data type differences are ignored.

In terms of performance, binary forms of the data would definitely take less space, and so should compare faster. Unfortunately I know of no way to compare at the binary level.

What makes it semi-general is, of course, that the columns here are specific to the dataset being loaded, and therefore lend themselves to being the output of a code generator, rather than constantly manually created and managed.

SELECT 
 batch_run_id, 
 HASHBYTES('MD5', all_hashes) superhash FROM (
  SELECT  
   batch_run_id, 
   All_Hashes =
   STUFF((SELECT '' + CAST(HASHBYTES('MD5', CONCAT(
   event_ccb_id
  , event_grouping_ccb_id
  , event_grouping_nm
  , group_ccb_id
  , group_nm
  , organizer_mbr_ccb_id
  , organizer_nm
  , event_type_ccb_id
  , event_type_nm
  , location_nm
  , location_strt_adr
  , listed_ind
  , public_calendar_listed_ind
  , ins_on_srvr_nm
  , creator_mbr_ccb_id
  , creator_mbr_nm
  , cr_dttm
  , modifier_mbr_ccb_id
  , modifier_mbr_nm
  , mod_dttm
  )) AS VARCHAR(MAX)) AS [text()] FROM loading.Event_Profiles xt WHERE xt.batch_run_id = t.batch_run_id
  FOR XML PATH('')), 1, 2, '')
  FROM loading.Event_Profiles t
  GROUP BY batch_run_id
 ) x
More testing is required, and performance numbers will be added. For a set (with the above columns) of 2600 rows, it took 16 ms. I need to get some larger data sets. With three identical sets of data loaded, I got the following results:
Batch #Super Hash #
290xC8EC5BE4F39C133D444F2F9C1D81D14E
300xC8EC5BE4F39C133D444F2F9C1D81D14E
310xC8EC5BE4F39C133D444F2F9C1D81D14E


For change detection, I am only looking at the previousload set, not all the sets. Say, for example, we were to load set A and process it through the ETL. Then, perhaps the next day, we loaded and processed set B. Now on the third day we accidentally re-load set A. This could be an error, or it could be that someone backed out a change in a very low-activity set. We don't know but, regardless, we need to reprocess set A because set B has altered the master set, and been propagated out to various downstream systems.