Data Normalization, with storage being so cheap what’s the point?

Back in the old days one of the big reasons that people enforced such strict rules on data normalization was because it would greatly increase the amount of information you could fit on the hard drive.  Back when everything was measured in kilobytes or megabytes storing redundant information simply wasn’t an option.  You had to be able to cram as much information into that little database.

These days space isn’t the issue.  Storage is very cheep these days, and if you have the money you can get absolutely massive amounts of storage.  1 TB hard drives are easily found, and when you start talking about RAIDing drives together the storage limits start to become just insane.  100 TB of storage just isn’t what it used to be.

The problem now becomes that with so much data for the database platform to search through we need to make sure that the indexes which SQL Server is searching as as small as possible so that we can get those indexes read from the disk and into memory as quickly as possible.

Say we have an Employee table which looks like this.

CREATE TABLE Employee
(EmployeeId INT,
FirstName NVARCHAR(50),
LastName NVARCHAR(50),
Department NVARCHAR(100),
Manager_First_Name NVARCHAR(50),
Manager_Last_Name NVARCHAR(50),
Location NVARCHAR(50),
Active VARCHAR(5))

Now this table is clearly not setup as a normalized table.  We are assuming that this table has best case indexing.  If we have 1000 employees records and we are going to search the Department column the index must be read from disk.  Now assuming that the department names are nice long bureaucratic names which average 30 characters each we have about 30000 bytes (more if you take the fill factor into account) which needs to be read from the disk.  Now if we had a relational version of the table.

CREATE TABLE Employee
(EmployeeId INT,
FirstName NVARCHAR(50),
LastName NVARCHAR(50),
DepartmentID INT,
ManagerEmployeeID INT,
LocationId INT,
Active BIT)

When we search the Employee table we now are only loading an integer value from the disk into memory. This means that we only have to load 8000 bytes of data from the disk into memory. Now knowing that SQL Server reads data from the disk in 8k blocks even with a 50% fill factor the index only takes 2 blocks on the disk and can then be completed in two read operations to the disk. Now our original table with no fill factor will require 30 data pages (20.297 to be specific), about 60 if we use the same fill factor.

With tables of this size this isn’t much. But as your database grows the number of additional data pages which would have to be read for a simple search of the table grows exponentially.

Denny

Share

One Response

  1. I agree with Denny that we still need to normalise data even with the storage becoming cheap.
    Particularly, if we are storing the same data in so many places in an enterprise, then it becomes a problem for the enterprise
    (a) to maaintain the data – you may have more than one system maintaining the same data – thus maintenance cost increased unnecessarily.
    (b)to determine which data is up-to-date – if we have more than one place to store the same data, then it would difficult to determine which data is correct. Thus if a customer having given his particulars to one business area fins that another business area in the same enterprise does not have this information, the enterprise would not be considered customer-oriented. It would not be able to use the data effectively, as the data is not consistent. It would hinder its own goals for improving its bottom line or increasing its sales.

    Jagan

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Trust DCAC with your data

Your data systems may be treading water today, but are they prepared for the next phase of your business growth?