Viewing entries tagged


Understanding different kinds of Data Reporting. Why and how is done.

In business, the words reporting, data reporting, audit reports, big data, and analytics are used very loosely sometimes. So, I thought I'd put in writing some thoughts about reporting as a reference point for anyone scratching their heads.

We've all heard the parrots say "DATA is King", but how do we get "DATA"?

Data for almost ANY application can be categorized into one of 3 categories:

  1. Operational Data (aka. Configuration Data)
  2. Event Data (aka. Business Event Data)
  3. Audit Data (aka. Tech Data)

The patterns to display and materialize these data sources generally follow a very ordered and logical process where:

  1. Operational Data is shown in small digestible chunks so a human being can easily analyze and modify what is presented to him/her at the moment.
    • Most often than not, it takes the form of a UI in a webpage or a native app.
    • Some other times, the configuration data is shown in printable format. More often than not, these take the form of counters and aggregate live data. A common use case is to keep counters of live metrics,  KPI alerts, and application heartbeats.
    • Operational data is modeled to support full CRUD operations.
    • Regardless of the format, this kind of data is never presented with thousands of rows or configuration entries... and the reason for that is: is not human readable. Just imagine your reaction if every time you do right-click on a file in Windows you get a menu with 2 thousand lines; not very useful.
    • Last but not least, configuration data is ALWAYS live data, it NEVER holds historical significance.
  1. Event Data exists with the sole purpose of fulfilling business intelligenceand analytics.
    • Event Data is always decentralized, meaning there is no magic bullet that can fulfill all applications, since the nature of the events being logged is usually tightly coupled with the actions that occur within applications, and those have some semantic meaning around the business of the application.
    • Event Data usually take the form of a separate database, or a different schema within the same database where events are logged in an asynchronous way so they don't interrupt the main flow of the application in question.
    • Event Data is WRITE-ONLY data. There are NO UPDATES, and NO DELETES, only writes to drop the events when they occur and reads to generate the reports. This usually makes event databases grow very rapidly depending on the number of events being reported.
    • Event data generally follows a star data model with multiple correlated tables holding DUPLICATED data from the operational data, with the additional time stamp on each event dropped.
    • Depending on the needs and performance boundaries allowed by the reporting tool, reports can run directly from the Event database, or from a Data Warehouse (DW). The DW is nothing more than a centralized database or collections of databases that are highly optimized to run reports as fast and efficient as possible, so business folks can run analytics and complex searches on the data from different perspectives.
    • Reports running from a Data Warehouse repository usually perform better and faster, since the tables are flattened to match exactly the dimensions of the reports. The data is said to be 'curated' when is transformed and loaded into the DW, since the schema of the data after ETLs are run, can be totally different from the way it was designed on for the operation of the application (the Operational Data)
    • If you run your reports from Event database, your queries will likely have a lot of joins; whereas if you run them from the DW, very few joins will be necessary (because the data is curated to match those reporting needs, and perform faster).
    • Last but not least, event data ALWAYS holds historical significance and ALWAYS report on BUSINESS data.
  1. Audit Dataexists with the sole purpose of reporting on Technology events. That is everything that is NOT BUSINESS data, but relates to the performance, security, functionality, etc, from the application.
    • Audit Data usually follows a very similar path to the Event data in terms of implementation and consumption (like having an Audit database with Star schema, and moving to the DW, blah, blah, blah).
    • The main differentiator between Audit Data and Business Data is that one relates to the business events, while the other relates to technology events.
    • Just like events, audit data ALWAYS holds historical significance and ALWAYS reports TECHNOLOGY-DRIVEN data.

When asked to generate reports and such, ask yourself what type of report it is and don't be afraid to ask questions and challenge the true business need for the 'report'. If someone is looking for operational (live) data, and there are UI/Screens already available that show such data for configuration or anything else, then ask yourself if you want to go on a limb and create a lot of overhead duplicating the effort, or simply adding a "Save as PDF" button or link in the screen will do the trick.

The last piece is about BIG DATA. That's the new hot word around IT these days. Big Data is nothing more than a group of very large data sets (in the order or exabytes) whose relationships and business semantics are so broad and messy, that is very difficult to drive analytics using the traditional DBA tools found in a DW environment. You can think of it like the next level to data warehousing.

When a DW gets big enough, or when massive amounts of data from different DW needs business analysis, then it becomes BIG DATA. The challenges are different and harder when you are trying to put order to this kind of mess, but once you do, a lot of goodies can come out of it, including predictive analytics based on past performance and events. Predictive business analytic engines are one of the hardest things to do, and it is a valued asset, especially for large corporations that have been hoarding data for decades, and now want to use it as a competitive advantage to plan ahead.

Good luck with your reporting adventures!



Implementing SQL Source Control

Every developer working with databases and backend magic needs DB source control. If you think you don’t, then you are crazy, or will become crazy soon after you realize it. But not every developer has the money to pay for the awesome RedGate SQL Source Control tool (~$1500 per developer, or ~$6000 for a team of 5). Sometimes, the ramen noodles expression must be interpreted literally. So how do you resolve/implement this problem? You may want to follow K. Scott Allen’s database version control blog series (links below).

  1. Three rules for database work
  2. The Baseline
  3. Change Scripts
  4. Views, Stored Procedures and the Like
  5. Branching and Merging

These posts are great for the conceptual background and practices when doing the change scripts and such. The problem is there is no much implementation in the series to help speed up the process. To complement the series, you can DOWNLOAD my implementation of SQL Source Control. It consist of roughly of 4 tables and 4 sprocs. If you want to implement it on your SQL Server Database, download this update script and run it against your database (make a backup first, just in case).

Here is the ERD of the version control entities:

To use it, just create your change script (maybe follow some of the advice from Scott Allen), and follow this template to add the final touches to the script. Essentially all you have to do is call the Database_Update sproc and inspect the DatabaseErrorLog table whenever you encounter a problem to find more details.

[sourcecode language="sql"] BEGIN TRY


DECLARE @DatabaseVersionLogId INT , --This is for later, you may needed if you want to report to a CI Server. @UpdateFromDatabaseVersion VARCHAR(32) = '1.2.3' ,--The database version you want to update. This is your starting point. @UpdateToDatabaseVersion VARCHAR(32) = '1.2.4' ,--The new version of the database after successfuly updated @UpdateScriptName VARCHAR(64) = N'''1.2.3 to 1.2.4''' ,--A friendly name for this script @UpdateScriptFileName VARCHAR(1024)= N'''123 to 124.sql''' ,--The file name containing this script. @CreateScriptName VARCHAR(64) = N'1.2.4' ,--A friendly name for ANOTHER script (not this one) that generates the new schema from scratch as opposed to from the @UpdateFromDatabaseVersion database @CreateScriptFileName VARCHAR(1024)= N'124.sql' ,--The file name containing the @CreateScriptName script @Tags VARCHAR(MAX) = N'1.2.4, 1.2.3, super release'--A CSV for tagging and searching your releases later on.


EXEC @return_value = [dbo].[Database_Update] @UpdateFromDatabaseVersion, @UpdateToDatabaseVersion, @UpdateScriptName, @UpdateScriptFileName, @CreateScriptName, @CreateScriptFileName, @Tags, @DatabaseVersionLogId = @DatabaseVersionLogId OUTPUT

SELECT @DatabaseVersionLogId AS N'@DatabaseVersionLogId'

PRINT ( 'All good. Update finished... now go open a beer' )




PRINT ( 'Michael Jackson''s moonwalk. The script is a piece of shit. Fix it and try again. Your database is safe.' )


END CATCH [/sourcecode]

You can download the same script HERE

Enjoy and Happy coding!



Letter from a MS Access hater!

I hate MS AccessI am a very frustrated man right now. I happen to be working on a project to “fix” a MS Access database some genius business user created with a ton of crappy forms, redundant data, thousands of data anomalies, zero relationships, unimaginable naming conventions, and the craziest data structures I have ever seen in my life. Somehow these people have been using this Access database to run their business for YEARS now, and the only reason they needed someone to look at it, is because the geniuses that created it made a giant table with columns that represented a combination of Month and Year where they would add some data as time went by. Every month they will come to this table and add a couple of columns to continue adding crap for the next month and so on. Well, they ran out of columns in the Access table (up to 255) and now they are stuck with it.

I just wanted to shoot myself in the head when I saw the db for the first time. My first suggestion was “Let’s do it right. Let’s port this to SQL Server Express (FREE), normalize it, create a couple of views, and you can still use your forms with the correct data model” but off course, they do not want me to do that, even after I explained that trying to fix this database will take more time and money because of how flawed it is, they still wanted their old stuff, full of errors and inconsistencies. They want me to fix this piece of shit. Unbelievable!

The situation made me think again about how businesses that are not in the IT space, see their IT and software development efforts as a liability instead of an asset. Why is it so hard for some companies to embrace technology? Unless your business is that of fighting against IT (and I can’t think of a business model based on that premise), they should embrace technology as a tool to improve business processes and enable collaboration between your clients, partners and your workforce. Every time I come across a corporation closing the doors of innovation, creativity and the enablement of IT to support and improve business processes, I see a dying mammoth. Such companies are doomed for failure, no matter the size or industry they are in.

Anyways… let me get back to my grinding Access DB. Ufff… I just needed to let out some steam.



SQL Convert CSV to Table. The Better Way.

A very common database operation is to transform a simple CSV string into a table so it can be used in more complex query operations. There is an overabundance of SQL functions online that perform such transformations. Here is one: [sourcecode language="sql"] CREATE FUNCTION [dbo].[fn_ParseCsvString] ( @CSVString VARCHAR(max) , @Delimiter VARCHAR(1) ) RETURNS @tbl TABLE ( s VARCHAR(128) ) AS BEGIN DECLARE @i INT , @j INT SELECT @i = 1 WHILE @i BEGIN SELECT @j = CHARINDEX(@Delimiter, @CSVString, @i) IF @j = 0 BEGIN SELECT @j = LEN(@CSVString) + 1 END INSERT @tbl SELECT SUBSTRING(@CSVString, @i, @j - @i) SELECT @i = @j + LEN(@Delimiter) END RETURN END [/sourcecode]

And you’ll use it like this:

[sourcecode language="sql"] select * from dbo.fn_ParseCSVString ('1,2,3,4,5,6,7,8,9,10,11,12,13,14,15', ',') [/sourcecode]

The problem is the way we use those functions and the fact that their performance order is linear to the number of comma separated values in the string. That works out ok for a small number of items, but not when you want to JOIN on a very large number of CSV. When you are using the CSV functions in a sproc or a query, you'd be writing JOIN statements to the result set from the function in order to filter a real table in your database.

Let's look at a case scenario: Suppose you have an Address table in your Geo schema. Your address table has a primary key 'Id' column and other columns that are irrelevant for the purpose of this article. If you wanted to query your Address based on CSV values using your function, you'd do something like this:

[sourcecode language="sql"]

DECLARE @csv VARCHAR(MAX) = '1,2,3,4,5,6,7'

SELECT a.Id FROM Geo.[Address] a JOIN dbo.fn_ParseCsvString(@csv, ',') tbl ON a.Id = tbl.s [/sourcecode]

Here is some benchmarking on the previous query (all in microseconds): 1 second = 1000 milliseconds = 1'000'000 microseconds

1000 CSV  ~ 0.5 secs – T1: 557'272 – T2: 408'185 – T3: 446'485

5000 CSV  ~ 2.3 secs  – T1: 2'376'159 – T2: 2'307'394 – T3: 2'370'191

10000 CSV ~ 4.6 secs  – T1: 4'699'332 – T2: 4'737'877 – T3: 4'641'049

You get the idea and it gets even worse when you do left joins on CSV. Even if your tables are indexed on the actual columns you are joining to your function, sequentially calculating those values and inserting them into a table will take some unnecessary time. So here is a solution that requires not a function but instead a store procedure and the reason for the store procedure is the need to run Dynamic SQL.

[sourcecode language="sql"]

PROCEDURE [dbo].[CsvToTable] @Csv VARCHAR(MAX) , @SchemaName VARCHAR(128) , @TableName VARCHAR(128) , @KeyColumnName VARCHAR(32) = 'Id' , @Debug BIT = 0 AS SET NOCOUNT ON ; BEGIN TRY DECLARE @mySql NVARCHAR(MAX)= 'select distinct ' + @KeyColumnName + ' from ' + QUOTENAME(@schemaName) + '.' + QUOTENAME(@tableName)

IF ( @Csv IS NULL OR @Csv = '' ) BEGIN SET @mySql = @mySql + ' where 1=0' END ELSE BEGIN

SET @mySql = @mySql + ' where ' + @KeyColumnName + ' in (' + @Csv + ')' END IF ( @Debug = 1 ) PRINT @mySql EXEC sp_executesql @mySql END TRY BEGIN CATCH --Your catch logic here --EXEC dbo.Error_Rethrow END CATCH


And this is how you’d use it:

[sourcecode language="sql"] DECLARE @csv VARCHAR(MAX) = '1,2,3,4,5,6,7'

DECLARE @tbl TABLE ( Id INT ) INSERT INTO @tbl EXEC dbo.CsvToTable @csv, N'Geo', N'Address', N'Id'

SELECT a.Id FROM Geo.Address a JOIN @tbl tbl ON tbl.Id = a.Id [/sourcecode]

Here is some benchmarking on the new solution (all in microseconds): 1 second = 1000 milliseconds = 1'000'000 microseconds

10000 CSV ~ 0.2 secs – T1: 212'289 – T2: 183'635 – T3: 204'688

WOW! That's  1/5 of a second on average to query 10K records. MUCH MUCH better ah?!

Let’s check pros and cons of this solution.


  1. There is nothing faster than this solution to convert CSV into a table. Period.
  2. You can configure additional indexes on the columns you want to lookup using a CSV approach and the indexed will have an impact on the speed of the CSV look-up; versus always having a linear execution order because the function is unaware of indexes.
  3. Simple to use.


  1. It’s using dynamic SQL, and that brings all the negative things writing dynamic SQL has. I will recommend having a dedicated role to runners of this sproc so that users do not have permissions to run it. Try to always use this sproc through another sproc.
  2. The fact that is a sproc “feels” wrong, since by their nature sprocs can have side effects, although this one has none.  This is precisely the reason why a function does not allow for execution of dynamic SQL.

I think I can live with the feeling in trade of the unmatched performance this solution offers. A couple of things to note that were done to prevent SQL injection on the Dynamic SQL:

  1. Both schema and table names are provides as separate parameters of the sproc and they are put together in the code with the '.' separator.
  2. Both schema and table name are wrapped using quotename()
  3. When executing the Dynamic SQL, I'm using sp_executesql rather than just the plain EXEC() function.
  4. There is a @Debug parameter that prints the generated Dynamic SQL when debugging. This will safe you a couple of headaches.



SQL MaxInt and MinInt functions

Is kind of frustrating when you are trying to write a complex query and there is that one function not provided by SQL Server out of the box to use it right away. This is the case with a function to compare 2 integers to get the Max or Min between them, similar to the Math.Max and Math.Min methods in .NET. You could create your own function and re-use it everywhere, but the first implementation that comes to your mind is a very inconvenient way to deal with this, since you cannot use it (easily) within an inline Sql query. Let's take a look at the easiest way to implement our function:

[sourcecode language="sql"] create function dbo.MaxInt(@int1 int, @int2 int) returns int as begin if @int1 > @int2 return @int1 return @int2 end [/sourcecode]

As I mentioned, there is no easy way to have this query as a subquery in a complex sql statement. And if you make it possible, the performance implications can be kinda bad. But the world doesn't end here! There is a convenient solution that will allow you to write an efficient function for it, or even better, have it inline right within your query. The trick is to double the desired value and then divide it by 2. So if you want to find the max value, find the value that is 2 times the max and then divide it by 2; if you want to find the min value, find the value that is 2 times the min and then divide it by 2. Let's take a look:

Max Function

[sourcecode language="sql"] create function dbo.MaxInt(@int1 int, @int2 int) returns int as begin return ((@int1 + @int2) + ABS(@int1 - @int2))/2 end [/sourcecode]

Min Function

[sourcecode language="sql"] create function dbo.MinInt(@int1 int, @int2 int) returns int as begin return ((@int1 + @int2) - ABS(@int1 - @int2))/2 end [/sourcecode]


And using this doubling notion with the absolute value tricks, we can elegantly write min and max comparisons right inside the Sql statement without the need of a function. Here is an example of an inline Sql statement:

[sourcecode language="sql"] select a, b, ((a + b) + ABS(a - b))/2 as maxValue from myTable [/sourcecode]

Sweet ahh?! Happy coding!



New product coming...

I'm starting a new project/product now. Let's see how it goes, I have very high expectations with this new product and I'll let things slip thru as it takes shape. I still have to make a new company for it, so I can get some tax benefits too. I keep postponing the company creation thing for tomorrow and then for tomorrow. One more time, I'll do it tomorrow (and this time I'll really do it). I'll kick things up with .NET 4.0 and some Entity Framework. This is my first time working with the ADO.NET entity framework and I like very much what I've seen so far. Initially I thought to use something different from .NET, like RoR (Ruby) or Django, the Python Web framework; but I ended up going back to the land I know the best: .NET world. The reason was simply the speed at which I can work and produce code is important at this time, since I'm the only one writing code on this thing.

I'll try to keep posting about my progress here and the new challenges I face as I move forward.