So yesterday was Day 2 of EMC World, and my body is really starting to feel it. All the sessions today were top notch sessions.
The first session yesterday was by Bruce Zimmerman. For those of you in the SQL Server community that are reading this Bruce is to the EMC Storage community as Bob Ward is to the SQL Server Community. Bruce talks on EMC CLARiiON performance tuning every year at the 400-500 level. Here are some of the highlights.
When using NaviSphere Analyzer to monitor the utilization of your array you may not be getting the correct picture. Metrics such as utilization are not true measurements of the utilization but instead a calculation. In the case of the utilization metric Analyzer simply looks at the utilization of the RAID Group that each Storage Processor (SP) is putting on the RAID Group and the higher of the two numbers is reported. If you have a single LUN on the RAID Group, or all the LUNs on the RAID Group are owned by a single SP this isn’t an issue as the other number will be 0, but if the LUNs on a single RAID Group are owned by both SPs and each SP is running the RAID Group to 40%, Analyzer will show a 40% load, instead of an 80% load on the RAID Group.
If you need to dump Analyzer data to a CSV file via the naviseccli command use the -archivedump switch. (Someone asked me about this via twitter a while back, which is why I made sure to include it.)
If you monitor the performance of your Storage Processors you may see the CPU spike to 100% on regular intervals. This interval will correspond to the data logging interval that you have set within NaviSphere Manager. While this CPU spike may worry you, unless your normal CPU load on the Storage Processor is very high, this CPU spike will not effect your performance through put. If you are concerned that it is affecting your performance throughput through the storage processors, try disabling the data collection for a period of time in the SP properties.
If you look at the NaviSphere properties for the array you’ll see two settings for data logging. One for the background process, and one for live data capture. If these settings are different then the data logging happens at the lower of the two intervals. Most people should set both of these options to 300 seconds unless you need capture data more frequently than that for a specific reason.
Some improvements to the FLARE version 29 that you’ll notice is that the load placed on the storage processor by the data logging process has been reduced by about 80% which is a huge savings. You’ll also notice that with Release 29 that when doing a non-disruptive update (NDU) the CPU on each storage processor has to be slow 65%. In older versions the CPU load had to be below 50%. This change was made because the amount of backround processes which the array is performing as background management processes can be about 7-8% (per SP) and these processes don’t fail over.
Another naviseccli trick is to include the -np flag for all your commands. This will tell naviseccli not to poll the array for response information. Now if you need to get back information from the array when you run your commands you’ll want to include this. For example if you create a LUN and have the array assign the LUN id and you want to do something with the LUN id later in the script you’ll need to exclude the -np switch, however if you specify the LUN id and don’t care about the feedback including the -np flag will save the Storage Processor quite a bit of work as the CLI requests a good deal of information from the Storage Processor for each CLI command that is issued.
I also gathered a lot of information about VMware in other sessions yesterday.
I’m not sure if this was supposed to be released, but the next release of vSphere (aka ESX) will be in Q3 of 2010 and will be vSphere 4.1. This next release has a lot of enhancements to ease administration and improve integration between ESX and the EMC storage arrays. You can assume that all of these integrations between vSphere and EMC CLARiiON arrays will require FLARE release 30 which should also be coming out in Q3 2010.
The first improvement is the vStorage APIs. This is a set of APIs within vSphere 4.1and the EMC arrays that allows the vCenter server or the vSphere server it self (if not running with a vCenter server) to talk to the array directly and perform some actions.
These actions include Bulk Zero Acceleration. This allows the vSphere host to when creating a new file to tell the array to fill the file with 0s instead of having to transmit all those zeros to the array over the fibre or iSCSI. This is done by the vSphere host writing a single block to the array with all 0s in it, then telling the array to replicate that block n number of times. While this won’t reduce the amount of data that the array has to write, it will reduce your network traffic and because of this may safe time. By default this feature will be enabled in vSphere 4.1, but can be disabled in the advanced settings page of the host.
Another feature are some hardware locking changes. Currently when vSphere needs to take a lock on the LUN it locks the entire volume then performances it’s operation then releases the lock on the LUN. In vSphere 4.1 it will be able to lock just the specific block on the disk that it wants to work with, then release just that block. This will allow multiple hosts to take locks on the same LUN at the same time without having to wait in line to complete the operation. There are a few places when this benefit will be seen including boot storms (where you’ve got lots of machines booting at the exact same time), and allow for more snapshoting to take place (as when each snapshot is created a lock has to be taken on the LUN when the new file is going to be created). By default this feature will be enabled on vSphere 4.1, but can be disabled in the advanced settings page of the host.
The next feature is called Full Copy Acceleration. This is a great feature which will reduce the amount of traffic between the array and the host when cloning a virtual machine. Today when you clone a file the file is copied up from the array to the host, then written from the host back to the array in the new location. With this feature enabled (which it is by default) the API will simply tell the array to copy the blocks which make up the file from one location to another preventing the entire file from being transferred from the array up to the host. If your network between the array and the host is bandwidth limited this will reduce the time it takes to clone the virtual machine.
Of the new VMware Features which require array integration there is only one which doesn’t require FLARE 30 and that is the Stop and Resume feature, which requires FLARE 29 on the host. This feature cleans up the way that the guest OSs see that a thin provisioning pool is out of space and the LUN can’t consume any additional space. Prior to vSphere 4.1 (also known as today) if a thin LUN can’t be expanded as needed on the array because there isn’t any space the guest OS will throw (within Windows at least) a blue screen of death (BSOD) because the page that it’s requesting to write to isn’t available. In vSphere 4.1 an error message will be thrown as a popup within the guest OS which effectively says that there was a problem writing to the disks.
Something which will be coming in Q2 of 2010 (so probably within the next 6 weeks or so) will be the CLARiiON Provisioning Plugin for vSphere. This will let you provision a new LUN on the storage array, and attach it to the VMware Cluster from a single screen which should greatly decrease the amount of time required to provision and attach storage from the array to the server.
I’m curious to see how long it takes other storage vendors to get these APIs working on their arrays (with or without VMwares assistance).
Check back tomorrow for my Day 3 post.
Denny