A while back someone posted on the ITKE forum asking what Indexes where, and what they were used for. I put up a quick answer, but I felt that it deserved a more in depth blog post; so here it is.
In the basic view, an Index is a subset of columns from a table. This subset of columns is stored in a sorted order so that the SQL Server can more quickly find the records based on the data in the index.
As an example, say that we have a table called Employee with four columns EmployeeId, FirstName, LastName and ManagerId. The EmployeeId column if the primary key of the table, by default the primary key is the clustered index (I’ll write another post about that later on) for the table.
We want to be able to search the table quickly for an Employee based on the last name. If we search the table as it is the SQL Server has to read every record in the table looking for matches. While at most companies this might be ok, if you have a larger employee base, say the US Army for example you’ll have hundreds of thousands of records in this table and searching it one record at a time will take quite a bit of time.
With an index created on the LastName column when we look for the employees last name the SQL Server knows where the records are in the table because the records are sorted in Alphabetical order.
Until SQL Server 2008 was released an index was required to hold all the records of the table. SQL Server 2008 introduces the concept of filtered indexes. When a filtered index is created the index is created on a subset of the data in the table by specifying a WHERE clause in the CREATE INDEX statement. This is good for making the indexes smaller and there for even faster.
For an example lets look back at the Employee table from above. Lets add a column called CurrentEmployee which is a BIT field. When the employee works for the company the value is set to 1 (shown as true in the HR system), when they leave the company it is set to 0 (shown as false in the HR system). Knowing that most of the queries in the system will only need to look for current employees we can create filtered indexes to improve the performance of those queries so that the index only has to hold the current employees.
One thing to remember about indexes, is that SQL Server will only use a single index per table per query. So if you are using more than one column in your where clause (or JOIN clause or COUNT clause) then you will need to have more than one column in your index creating what are called covering indexes.
In SQL Server 2005 we were introduced to another new feature of the index. This gives us the ability to add columns to the index which aren’t actually indexed. This gives us the ability to improve performance yet again, but this time the performance improvement has a cost; this cost is disk space.
Before the include feature was added to the indexes after finding the rows that you need, the SQL Server would then need to go back to the base table to find the rest of the values that you requested. When the SQL Server has to do this it is called a lookup. As an example, lets look at the Employee table from above again. We have an index on the LastName column, but we also want to display the EmployeeId and FirstName values as well so we write our query like this.
SELECT EmployeeId, FirstName, LastName
FROM Employee
WHERE LastName = N'Smith'
The SQL Server scans the index for the records with the LastName value of Smith. But we want the other columns as well, and they aren’t in the index. Because of this the SQL Server now has to do a lookup for each row being returned and get those values from the base table (or the clustered index). This requires that the SQL Server load all the data pages which hold all the records to get these values. Since the employees with the LastName value of Smith probably were not hired at the same time, the data is probably spread over many data pages all over the disk. This means that the results will take much longer to return while we wait for all this disk activity to complete.
Now, if we include these columns in the Index, all the values of Smith will still be kept together, but the EmployeeId and FirstName will be stored in the index along with the Smith values. So when we run the same query with the change to the index the query will run much faster now as the SQL Server can process all the records in a single internal operation.
What’s the difference between adding the columns to the index, and including them you ask? The easiest way to show this to stick an OrderBy clause on the query.
SELECT EmployeeId, FirstName, LastName
FROM Employee
WHERE LastName = N’Smith’
ORDER BY FirstName
By adding the Orderby clause the SQL Server now has to sort the data in a separate operation instead of a single operation. If we change the index to be on the LastName and FirstName columns and include the EmployeeId column the data will already be sorted in the correct order, and the SQL Server shouldn’t need to do any additional processing of the data before returning it to the client application.
I know that this was a long post, but I hope that you found it worth it.
Denny
2 Responses
I have inheirited three tables with many to many relationships,
and another so called link table containing three foreign keys in each
record. Is the correct index for the link table – a single clustered index
using all three columns? Should clustered indexes be on the primary
file group and non clustered go else where? where exactly?
Many thanks…
Acapsis,
The clustered index must be on the same file group as the table.
The nonclustered index can be placed on the same file group, on in a different file group.
By default the primary key will be the clustered index unless you tell it otherwise. Depending on how you are querying the table you may need to have more than one index on this linking table. I have several linking tables which I need to be able to query from both sides (mine are only two columns wide) and I have two indexes on each. The clustered index with the first column as the first column in the index, and the second index with the second column as the indexed column, and the first column as an included column.