Friday, July 1, 2016

Creating a Singleton table-row

One of the nice design patterns from Java is the Singleton. I don't see it as much in new code, but I like to use it in my database designs. There is no SQL Server supported method for building a singleton, so I have manufactured a clumsy way to build one.

The idea is simply a one-row table, like DUAL in Oracle, that cannot be deleted, truncated, or appended to. It requires a seed value once.
Here is the SQL to build the object:

CREATE TABLE Loading._Batch_Run_Master_Key(
 batch_run_master_key_id AS ((1)) PERSISTED PRIMARY KEY,
 cur_batch_run_id int NULL,
 prev_batch_run_id int NULL
)
GO
Factors that make this a singleton:
  1. The master key is a computed column. It is not a column that can be set as a value, and so it will always be one. It is persisted so that we can apply a foreign key reference constraint against it.
  2. The application of a primary key constraint to this column is partly documentation, in that it tells us that this column value must be unique, and since it is always '1', then there can only ever be one row.

The listing above takes care of controlling rows exceeding the 1 row limit. It does not seed the initial row; this is not something SQL Server can do declaratively. In C#, we would simply say "int[] x = {1,2,3};"
Here we instantiate our class table with the minimal values. An INSERT command needs at least one value, which seems unnecessary to create placeholders.

INSERT INTO Loading._Batch_Run_Master_Key(cur_batch_run_id) 
 VALUES(DEFAULT);

This creates, somewhat clumsily, our Singleton row. If run this command again, I get the following error:
Msg 2627, Level 14, State 1, Line 32
Violation of PRIMARY KEY constraint 'PK___Batch_R__1C148BAAC381C4FC'. Cannot insert duplicate key in object 'Loading._Batch_Run_Master_Key'. The duplicate key value is (1).
The statement has been terminated.


This partially meets the requirements for our Singleton: More than one row cannot be inserted.
Our other use cases are:
  1. Delete
  2. Truncate
  3. Update

Let us examine the first case, deletion. Accidental deletion of singletons can and does happen, so we will put a guard on this record. A minimum row constraint would be nice, but we do not have such a device. We use a trigger to prevent deletion.
Here it is:
CREATE TRIGGER Loading._Batch_Run_Master_Key_Lock 
   ON  Loading._Batch_Run_Master_Key 
   AFTER INSERT,DELETE,UPDATE
AS 
BEGIN
 SET NOCOUNT ON;

 -- Prevent deletion of our singleton row.

    IF (
  (SELECT COUNT(*) FROM inserted)=0 
  AND 
  (SELECT COUNT(*) FROM deleted)=1
      )
 BEGIN
  PRINT 'Singleton cannot be deleted. Rolling back.'
  ROLLBACK
 END
END
Earlier today I had shown a way to determine if we were doing a delete action from within a trigger. I realized while I was coding this new trigger that I may want to ensure that a delete has actually happened.
Take the following case:

DELETE FROM Loading._Batch_Run_Master_Key WHERE 1=0;

I tested this, and the trigger actually does fire when no rows are selected, which I had not known. By verifying that deleted count is one, and that inserted count is zero, we know that a delete action got past the where filter, and we need to roll it back. For cases where no rows are deleted, it does not seem right to rollback when nothing occurred.
The second case, truncate, is more complicated to control. This is not a permissions issue, but rather a definition of the row itself. Even a DBA is not supposed to delete a row. He could instead drop the table.
To prevent truncation of one table, we must have another table referencing it. Here is the code:
CREATE TABLE Utility.BLOCK_TRUNCATE_INT(k int not null primary key)

ALTER TABLE Utility.BLOCK_TRUNCATE_INT WITH CHECK ADD CONSTRAINT [FK Master Key has truncate command blocked] FOREIGN KEY(k)
REFERENCES Loading._Batch_Run_Master_key (batch_run_master_key_id)

Now, when I run a truncate command,
TRUNCATE TABLE Loading._Batch_Run_Master_Key

..we get the following error:

Msg 4712, Level 16, State 1, Line 60
Cannot truncate table 'Loading._Batch_Run_Master_Key' because it is being referenced by a FOREIGN KEY constraint.

The third case, the update, is there to apply business rules to the various attributes you add to your Singleton, to make sure they are set properly and not improperly nulled.
As I always consolidate trigger actions into a single trigger, I will update my one trigger on my master key Singleton, to where it now reads as follows:
ALTER TRIGGER Loading._Batch_Run_Master_Key_Lock 
   ON  Loading._Batch_Run_Master_Key 
   AFTER INSERT,DELETE,UPDATE
AS 
BEGIN
 SET NOCOUNT ON;

 -- Prevent deletion of our singleton row.

    IF (
  (SELECT COUNT(*) FROM inserted)=0 
  AND 
  (SELECT COUNT(*) FROM deleted)=1
      )
 BEGIN
  PRINT 'Singleton cannot be deleted. Rolling back.'
  ROLLBACK
 END

 -- Our conditional non-nulls. We cannot update a column to null
    IF ((SELECT COUNT(*) FROM inserted)=1 AND (SELECT COUNT(*) FROM deleted)=1)
 BEGIN
  IF (UPDATE(cur_batch_run_id) AND ((SELECT inserted.cur_batch_run_id FROM inserted) IS NULL))
  BEGIN
   PRINT 'Cannot nullify cur_batch_run_id'
   ROLLBACK
  END
 END

END
The section at the bottom detects when update statements are occurring. The way almost all databases logically enact an update is a delete followed by an insert. Internally, the database implementation may not actually do that, and probably doesn't, but logically we get both counts, 1 deleted and 1 updated.

In the case where an update is being done, we apply our business rule. the batch run # cannot be null. This is a simplistic rule, but it demonstrates a different behavior in SQL systems that cannot be implemented with a simple NULL/NOT NULL constraint. It may be difficult to see, but the original INSERT of the hollow record allows for nulls in all the attributes. It is only after a legitimate batch id has been added that it cannot be removed. One reason for this is that, upon creation of the Singleton, I cannot possibly know a valid batch run #, since none have yet occurred. Most developers would create a dummy value, say -1, and if referential integrity was applied, then an actual dummy row would be in the main Batch Run table, with a key of -1. This is a fine way to do it, and there are cases where it is necessary, but I think I like the elegance of this technique, where no actual dummy value is required. A query against the batch run table and the master key table would only ever return real batch run records.

To clarify, the columns cur_batch_run_id and prev_batch_run_id were just example payloads I added to this structure. You would of course add your own.

To use the singleton, you simply reference it with a CROSS JOIN:
SELECT ep.* FROM Loading.Event_Profiles ep
JOIN Loading._Batch_Run_Master_Key mk on ep.batch_run_id = mk.cur_batch_run_id
Before any batches have run, cur_batch_run_id will be null, but after that it will always point to the most recent run and the run before that.
One issue that might occur is that a run can never be deleted if it is referenced in the singleton. The trigger logic would probably not suffice for anything more complex, or it may simply have to be disabled as far as allowing nulls in the current batch id. It's more of a demonstration rather than anything else.

Another way to use the Singleton is to store global variables that are used against another columns as constants. This would simply be a CROSS JOIN to pick up these values.

Determining the action in a multi-action trigger

I found a small trigger trick that I haven't seen anywhere else. In the interests of promoting my own re-use of code, I have been trying to create triggers that managed all three actions in one block of code. Those actions are DELETE, UPDATE, and INSERT. Unfortunately, unlike the MERGE statement, there is no "$action" global variable to tell us what action is taking place.
Here is a trigger example of testing for the delete action:
ALTER TRIGGER Loading._Batch_Run_Master_Key_Lock 
   ON  Loading._Batch_Run_Master_Key 
   AFTER INSERT,DELETE,UPDATE
AS 
BEGIN
 SET NOCOUNT ON;

 -- Block deletion of our modal row.

    IF ((SELECT COUNT(*) FROM inserted)=0)
 BEGIN
  PRINT 'DELETE!'
  ROLLBACK
 END
An example use case is when you are creating a Singleton row in a table. You need to prevent the deletion of that row by accident. Interestingly, the singleton row in Oracle's DUAL table was not protected, and occasionally DBA's and developers have had headaches with this ubiquitously referenced table. This may appear to be overdoing your design, but I see as defining as much as you can in declarative rules.
To detect an update, you can do as follows:

IF (
 (SELECT COUNT(*) FROM inserted)=1 
   AND 
 (SELECT COUNT(*) FROM deleted)=1
   )

WARNING: I am only using this technique with single row commands. In a multi-row update, I don't know what the counts would be. This remains to be tested.
Some may disagree that this is the best way to organize code. They may feel it is better to have three separate triggers, one for each action.
I cannot see any valid argument for this, except that each block of code is simpler, given it does not have to deal with determining what action is taking place. However, I believe the three sets of code to be more like facets of the same problem: How does this table deal with change? As many times as I can recall, my update triggers were closely related to my insert triggers. If I am updating the modified date in the update action, did I set a default modified date on insert? Or if the update action is incrementing a counter, then it is closely related that the insert action or a default constraint set it to 0. I find it difficult to remember the other triggers when I am working on one of the triggers.
And what about the delete trigger? Mostly this is used for auditing. If I add a column to a table, I like to browse all the actions in a single expression or block of code, so that I can examine the new column's relationship with each action.

Monday, June 27, 2016

Detecting change in a feed cycle

I find it difficult to detect change in my incoming feeds in a consistent way from system to system, so I have begun working on a semi-general way to detect if there was any change, in any column, in any row, between on import cycle and the previous cycle. This is because manual triggering of a feed load can perhaps accidentally be triggered twice, or if a feed was triggered manually in the night, and there is no change up until a schedule run takes place, then there is no sense loading the same data twice. Theoretically, there should not be any impact from double-loading the same data, but it implies that "no change" occurred between two legitimate samplings of the source data, which is in fact not necessarily true. Here is the SQL Server code for a system I currently maintain. It uses the HASHBYTES function to calculate any change in any particular row, and then it aggregates this into a single string.

One problem that presents itself is the limitation on length. I am assuming (for now) that the limit is 2 GB characters, which, since each output of HASHBYTES is 16 characters (for MD5), then you are restricted to a total of

134,217,727 rows.

This is fairly reasonable. Other ways must be found for larger data sets.

Another question comes up as to how the CONCAT operator deals with various types. Does it expand the entire DATETIME2(7) value? Or, how are floating point numbers treated? Collation is also another way that differences can be lost. I have not tested this in cases where columns of different collation types would be rolled up.

Some nice features are the lack of limitation on number of columns, and the fact that column data type differences are ignored.

In terms of performance, binary forms of the data would definitely take less space, and so should compare faster. Unfortunately I know of no way to compare at the binary level.

What makes it semi-general is, of course, that the columns here are specific to the dataset being loaded, and therefore lend themselves to being the output of a code generator, rather than constantly manually created and managed.

SELECT 
 batch_run_id, 
 HASHBYTES('MD5', all_hashes) superhash FROM (
  SELECT  
   batch_run_id, 
   All_Hashes =
   STUFF((SELECT '' + CAST(HASHBYTES('MD5', CONCAT(
   event_ccb_id
  , event_grouping_ccb_id
  , event_grouping_nm
  , group_ccb_id
  , group_nm
  , organizer_mbr_ccb_id
  , organizer_nm
  , event_type_ccb_id
  , event_type_nm
  , location_nm
  , location_strt_adr
  , listed_ind
  , public_calendar_listed_ind
  , ins_on_srvr_nm
  , creator_mbr_ccb_id
  , creator_mbr_nm
  , cr_dttm
  , modifier_mbr_ccb_id
  , modifier_mbr_nm
  , mod_dttm
  )) AS VARCHAR(MAX)) AS [text()] FROM loading.Event_Profiles xt WHERE xt.batch_run_id = t.batch_run_id
  FOR XML PATH('')), 1, 2, '')
  FROM loading.Event_Profiles t
  GROUP BY batch_run_id
 ) x
More testing is required, and performance numbers will be added. For a set (with the above columns) of 2600 rows, it took 16 ms. I need to get some larger data sets. With three identical sets of data loaded, I got the following results:
Batch #Super Hash #
290xC8EC5BE4F39C133D444F2F9C1D81D14E
300xC8EC5BE4F39C133D444F2F9C1D81D14E
310xC8EC5BE4F39C133D444F2F9C1D81D14E


For change detection, I am only looking at the previousload set, not all the sets. Say, for example, we were to load set A and process it through the ETL. Then, perhaps the next day, we loaded and processed set B. Now on the third day we accidentally re-load set A. This could be an error, or it could be that someone backed out a change in a very low-activity set. We don't know but, regardless, we need to reprocess set A because set B has altered the master set, and been propagated out to various downstream systems.