SQL Server security best practice

Security! This is the word comes in mind of every concerned person when it come...

Change the Collation Settings in MS SQL Server

This post will show you how to change the collation settings in MS SQL Server for specific database...

Resolve collation conflict

In this post I will show you how you can resolve collation conflict error...

Book: SQL Server 2008 High Availability

In this book I have tried to cover every single piece of information that might requires for installing and configuring SQL Server HA option like Clustering, Replication, Log Shipping and Database Mirroring...

Why to recompile Stored Procedure

Generally, we create views and stored procedures (proc here after) ...


Common Mistakes in SQL Server - Part 4

Last week we discussed advantages of recompilation in Common Mistakes in SQL Server – Part 3. This week I will discuss the Null Value and how it can cause an issue!!

What is NULL? 

Null is the value which indicates the value is unknown. Null is different than zero or an empty string, as a zero or empty string indicates that the value is known (refer http://en.wikipedia.org/wiki/Null_%28SQL%29 for further reading.)

Why should we avoid NULL values?

In one of my previous projects I was asked to perform some tests and do analysis on a specific database which had a tremendous growth rate, and it seemed that either the estimation related to growth was wrong, or the application inserts data more often than is expected. The database growing more than 300% in just 2 years wasn’t a good sign - something was really wrong. After I finished the analysis I found that there was a single table which had significantly contributed to the database size - that table itself was 450+ gigs in size; it had 217 columns and more than 50 million records, most of the columns allowing null values, which is where the problem was.

There is a misconception that if we have the NULL values in a table it doesn't occupy storage space. The fact is, a NULL value occupies space – 2 bytes. If we do not consider this point at the estimation stage we will have trouble with the growth of the database. Sometimes, however, application design demands some columns to be there for future use, and in this case, there will be NULL values.

Let me give you an example – create a table “tblEmployee” having thirteen columns of fixed length data type. I will add 150000 records and check the size of table. Remember, we are not creating a primary key/clustered index - thus this will be the Heap.

1. Specify the number of rows that will be present in the table
    Num_Rows = 150000

2. Specify the number of fixed length and variable length columns:
    Num_cols = 13
    Fixed_data_size= 4+15+15+4+25+25+10+10+3+4+4+35+35 = 189
    Num_Variable_Cols = 0
    Max_Var_Size = 0

3. Part of the row, known as the null bitmap, is reserved to manage column nullability:
    Null_Bitmap=2 + ((13+7)/8) = 4.5 = 4

4. Calculating the variable-length data size:
    Variable_Data_size: 0

5. Calculate the total row size:
    Row_size = Fixed_data_size + variable_data_size + Null_bitmap + 4
    Row_size = 189 +0 + 4 + 4
    Row_size = 197

6. Calculate the number of rows per pages (8096 free bytes per page)
    Rows_per_page = 8096 / (122+2) = 40.27 = 40

7. Calculate the number of pages required to store all the rows
    Num_Pages = Num_rows / Rows_per_page
    Num_Pages = 150000 / 40 = 3750

8. Calculate the amount of space required to store the data in the heap (8192 total bytes per page)
    Heap Size (bytes) = 8192 * 3750 = 30720000 Bytes
The size of the table tblEmployee is 30776KB which means storing 150,000 records would occupy 30 MB. Now, lets calculate this for 50 million records - you will get 9990 MB ((50000000/150000)*30). Remember we have calculated this for 13 columns only; if your table has more columns the size could be bigger than this. Download the demo script here

The above example shows how your estimation will be wrong if you do not consider NULL values while estimating database size. To estimate your table size correctly, you must consider this point, and you should try to narrow down the possibility of having NULL values in your tables.



Common Mistakes in SQL Server - Part 3

Last week in Common Mistakes in SQL Server - Part2 we discussed the advantages and disadvantages of using Identity columns. In this article I would like to shed some light on the advantages of recompile , and in what cases you should attempt recompilation!
Generally, we create views and stored procedures (“proc” here after) to avoid the re-write  T-SQL  which is called often by an application or on ad-hoc basis. And, there is strong reasoning to do so: both -  views and procs provides am edge over executing T-SQL as they can reuse cached plan stored in the system table syscacheobjects ( you can query it using DMV sys.syscacheobjects). 

We are aware that once we have created a view and/or proc and call it for the first time it gets compiled and executed(and the execution plan is stored in the syscacheobjects system table).  Then, every time we call this view or proc, it uses the same cached execution plan. By doing this, SQL Server Engine is reducing the overhead of recompiling the same set of T-SQL.

What exactly is an Execution Plan?
There are number of connections that can be made to the database server,  from one or  more different locations, and each of them are querying the database server to return some data. The specific data structure is called an execution plan (i.e. select * from Product where ProductSubCategoryId = 4).

What the SQL Server engine does here is - if the T-SQL came from a view or proc, it searches for the cached execution plan for the same T-SQL statement.  The SQL Server Engine will reuse the plan if it  is found ; otherwise, it will compile and execute the statement.  For example  we have to query Product table where SubCategoryId is 4 and 11 and this query is being called frequently( 500 times a day). If you query the syscacheobjects system table, or query DMV sys.syscacheobjects, you will notice that the Usecount value is not incremented.

Now, if you were to, create a view and execute it 5 times, and then query DMV sys.syscachedobjects again,  you will notice that the Usecount value has been incremented. This is because the SQL Server engine has used the cached execution plan.

Why should we recompile the proc or view?

There are certain events which invalidate the cached execution plan,  and it is advisable to recompile*  a proc or view if one of these events occur:

·         Schema changes / alter table
·         Alter index or drop index
·         Calling sp_recompile or using the  ‘with recompile’ or option (recompile)when executing the T-SQL
·         Large insertion or deletion of data in the tables
·         If you call DML and DDL in single set of T-SQL
·         Change the value of SET option

*Note: Recompilation is an expensive operation,  but it will help to avoid use of old/invalid plans.

Let us query the Product table in the AdventureWorks database; note that it uses the clustered Index Scan (PK_Product_ProductId)

Now we will create a non-clustered index on column ProductSubCategoryId in Product table.

Now we will create a procedure that uses the same select statements that we have used earlier (CMPart3_Proc1)

As we have the non-clustered index on the column ProductSubCategoryId, the relational database engine should use the non-clustered index. Let’s check if it does it by executing the query

Let's create a proc here, and execute it. If you notice, the database engine still uses the same execution plan even though we have  a newly created index on the ProductSubCategoryId.

We will free the procedure cache now and recompile the proc so that relational engine can create a new execution plan and use a newly created non-clustered index

You will note that as soon as we execute DBCC Freeproccache & sp_recompile ‘usp_ProdSubCatId’  Relational engine has created a new plan and is now using the correct index (in our case idx_ProdSubCatId).

That is why it is advisable to recompile proc or view as I've described above. You may download the demo script here.


Common Mistakes in SQL Server - Part 2

Last week we discussed about whether to choose variable or fixed length data type and when to use when to use a Unicode data type like nchar or nvarchar in Common Mistakes in SQL Server – Part 1.
Let’s discuss about the 2nd common mistake I have observed - using Identity column as primary key.

Identity is a property which can be defined on an integer, decimal, numeric, small integer, big integer or tiny integer data type, an identity column is a column which - generally is used as primary key.  The  Identity property creates an incremental value for the specified column automatically, which is why it is widely used by developers when they designed the table and a primary key – an identifier column.

How does it work?
The Identity column property has two components: a seed value and an incremental value. A seed is the intiial value  set for that particular column , where as an incremental is added to last inserted value when a new row is created. 

When we set the Identity properties, we have to define the value for first record aka seed value (for example, 0) and the incremental value (for example 1). Once you have defined this property and you insert a record into the table for the first time, it will add 0 for the Identity column.  upon the 2nd time you insert a record it, will have a value 1 (seed of 0 plus the incremental of, 1),  and at 3rd record insertion it would have a value  of 2  (1+1). (Download the demo script here Part2_1 which illustrates the table definition with Identity column)

Advantage of using an Identity column as a Primary Key:
There are couple of advantages using an Identity column as a Primary Key, including:

  • SQL Server itself manages the  Identity values
  • Developers don’t need to write a code for new primary key value because Identity column value is created automatically
  • The size of index will be low as compared to index size on char based columns

Are there any catches?
While there are couple of advantages, there are some disadvantages which makes managing Identity columns a bit harder including:

1. You cannot control Identity values, because this column and its values are controlled and managed by SQL Server. Please note, however, we may use 'set Identity_insert tblName on' to insert an Identity value explicitly.

2. For any reason, if insertion fails, a value for Identity will get created. Then the next time you insert a record it will create gap in numbering.

Let us check this out with an example. Create a table called TestIdentity with three columns: one Identity column, one varchar(10) and  one varchar(5) column. We’ll insert 10 records here, and then 11th record will fail because we’ll try to insert a value which is greater in size then what is defined in table definition. Once it has failed, we’ll again pass the correct values. Please notice the gap created in the Identity value. (Download the demo script here Part2_2)

3. You cannot add an Identity property to an existing column. There are only two options that you have if you need to add an Identity column to a table - drop and recreate the table or add a new column with Identity value (and then drop the existing column, if applicable). (Download the demo script here Part2_3)

4. There has been a lot of said about the disadvantage of Identity columns , especially when it comes to a replicated database., you have to alter your identity property and say “Not for Replication”.

Then, if you have to restore your replicated database for any reason you will have to be very careful, as you will have lost all your replication settings unless you specify that you wish to  keep_replication. For further information on what all we have to aware of when we back up or restore replicated database please refer to this article.

Whenever the Identity has reached its threshold value you will have to reseed the Identity values and adjust Identity range at publisher end. This happens when we restore backup at subscriber end; you will have to find the last inserted Identity value and adjust the Identity range at publisher, using these steps:
a. Execute select ident_current(‘tblName’) for every subscriber you have
b. Note down the highest value
c. Go to your publication server and execute
dbcc checkident(‘tblName’,reseed,value*+1) where value = value found in step b
d. Finally execute sp_adjust
e. Now, go to your publication server and
execute sp_adjustpublisherIdentityrange PublicationName, TblName

Note: above case is for Merge Replication.

Here is a nice article on MSDN on Replicating Identity Columns, and you may also refer Hilary Cotter’s article Identity Crisis (generally known as Identity Crisis).

I personally believe that one should avoid using  the Identity property , especially as a primary key when you have replicated database to over come  the issues you face with Identity columns. Rob Volk has written an interesting observation on Identity and Primary Key. Yes, doing this will require some more efforts on your part to create incremental values for your Primary key; but it really is just a matter of writing a few more line of code. Additionally, there are alternate ways of doing this - for more information you may refer an article here on how to generate auto incremental values in SQL Server. For further reading you can refer